Thursday, February 25, 2016

Using Version Control and GitHub in the Classroom

This semester I'm co-instructing a course called Introduction to Data Science (BIO 260 and CSCI E-107) with Rafael Irizarry at the Harvard School of Public Health and Harvard Extension School.  It is similar to a course that I was the head TA for in Fall 2014 taught at Harvard University called CS 109.  We have a fantastic group of people involved with the course this year, which has made developing a course from scratch run much more smoothly than it could have been.

We spent one lecture teaching students the importance of version control. Version control is a way of tracking the change history of a project. Even if you have never heard of version control, you have probably already done it manually.  For example, if you have ever written a document or paper, you may have tried copying and renaming the file multiple times as it went through different stages ("paper-v1.doc", "paper-v2.doc", "paper-final.doc", "paper_finalFINALdraft.doc", etc.). If at any point you wanted to see an older version of the paper, you could simply open the file. That is a form of version control. It's not very efficient, but it is in fact a form of version control.  One improvement over this would be to have a way only keeping one file (e.g. "paper.doc") AND being able to see older versions of it as it changed through time.  You can think of these older versions as little snapshots of the paper as it changed through time.  The same idea can be applied to code that you write.

In data science, it's important to know how to keep track of your code as it changes over time. On top of that, when you are writing code in a collaborative setting, it is almost required that that you know something about version control. This is how a group of people can collaboratively contribute code to the same project using the same file. Git is a tool that automates and enhances a lot of the tasks that arise when dealing with larger, longer-living, and collaborative projects. It has also become the common underpinning to many popular online code repositories, GitHub being the most popular.


One of the unique aspects of the course is that we are requiring the students to work on their homework assignments and submit their homework assignments using GitHub. We started by creating the GitHub organization datasciencelabs-students. Then, we followed the GitHub Education Classroom guide. To create private repositories for each student for each homework assignment, we followed the sandboxing setup.  Sandboxing is the idea of creating duplicated repositories (one for each student) in an automated fashion. The tool that physically creates the private repositories for each student for each homework assignment is called "teachers_pet".  



For anyone interested in using teachers_pet in your own classroom, I created a set of notes on GitHub describing how to install and use the tool.  Once you have set up all the authentication steps with GitHub, you can clone my teachers_pet GitHub repository, which is an enhanced version made up of multiple repositories from the web.  It is enhanced because it can push files to specific branches in the private repositories AND it can delete repositories at the end of your course so you can re-use the private repositories in the future.  

I hope others find the teachers_pet tutorial useful in using GitHub in the classroom! 


No comments:

Post a Comment