Thursday, February 6, 2014

Creating A New Ground for 'Data Science' outside of just Statistics or Computer Science

In a recent AMSTAT News article, Terry Speed wrote a fantastic and inspirational article on the field of statistics.   He began by summarizing some themes from the IYS 2013 "The Future of Statistical Science" workshop he attended.  Some themes he noted were "what an excellent job statisticians are doing" particularly in the areas of "genomics, cancer biology, the study of diet, the environment and climate, in risk and regulation, neuroimaging, confidentiality and privacy and autism research" but saw a lack of representation from "social, agricultural, government, business and industrial statistics".

He openly acknowledged our field of statistics faces many challenges (I direct you to the link for the complete list). In particular: (1) statistics departments are grappling with the debate of adsorbing statistics into incredibly popular emerging field of "data science" and (2) statistics departments are not able to deliver the type of graduates that companies such as Google, Apple, Amazon want which "perhaps involves adopting a more engineering approach to our work".  Yet, students across the world are seeking out majors and/or courses in statistics and "data science" through many different venues e.g. taking courses at a universities, via MOOCs, or online tutorials, etc. Should Statistics be renamed "Data Science"? What is the overlap? What is the best way to train "Data Scientists"?

Terry's response was essentially that data science is not the same as statistics and he saw "no evidence that data science ... has any prospect of replacing our discipline" because statistics "is far wider and deeper than data science".  He encourages statisticians to embrace this emerging field of data science and not fear it: "As with mathematics more generally, we are in this business for the long term. Let's not lose our nerve."

These ideas were all echoed at a symposium I attended last Friday called 'Paths to Precision Medicine: The Role of Statistics' hosted by the Department of Biostatistics at Harvard School of Public Health. At the end of the symposium, there was a panel discussion on 'Education of Future Statisticians in the Big Data Era' with Giovanni Parmigiani, Corsee Sanders, Rafael Irizarry and Marc Pfeffer as the panelists.

Giovanni said when transformations in technology occur, this is a characteristic of the "Big Data Era" and no single field is able to take on "Big Data" as a whole by saying this is a subset of what they do (similar to the argument made by Terry).  Not computer science. Not electrical engineering.  Not statistics. When it comes to creating a curriculum to train individuals who are seeking to become "data scientists" in the Big Data Era, he says we should "teach less and do more".  Rather than starting with a predefined idea of what we should be teaching and/or adding more things to a curriculum for "data science", we should let these individuals dive right into projects as early as possible (with mentorship!). This will also help develop essential skills such as communication and teamwork which cannot be taught in the classroom.

Furthermore, Rafa proposed creating a new, uncharted territory between statistics and computer science to provide the training to individuals who are seeking to become data scientists with as much emphasis as the individuals want in the direction of statistics, computer science or other fields.  This will more than likely require faculty to step out of their comfort zone and even possibly connect with faculty outside their department to jointly teach a course. When it comes to data science, there are many people who believe data science is more about the computing than the statistics and others who would emphasize the statistics more than the computing.  Regardless of the degree of emphasis of various fields on data science, the one thing I do know is that I agree with Jeff Leek's view that "the key word in 'Data Science' is not Data, it is Science".  We should be emphasizing the science (whether it is statistics, computer science, hacking skills, etc) in our curriculums for data science and allow the students to have an opinion in how diverse of a training they want.  As Rafa said, "the faculty in statistics and biostatistics departments are becoming increasingly more diverse and we should make these graduate students more diverse too."