Wednesday, April 8, 2015

Influential works in Data-Driven Discovery

A recent initiative to fund data-driven discoveries was completed last year by the Gordon and Betty Moore Foundation. Over 1,100 applications were received and each application had the opportunity to cite five "influential works in the general field of 'Big Data' for scientific discovery".  An analysis was done to see which works were cited the most and in what genres these works were from.  A paper summarizing the results was posted on arXiv on March 30, 2015 and had some interesting results that I wanted to share!

First, I was happy to see I have read several of the most cited influential works, but this also gave me a nice summer reading list of things I haven't read (I think this will basically be my #tbt (throwback Thursday) data-science papers for the next year)!  It was such a great list of works that I wanted to share the most cited influential works on here (each cited at least 10 times):

  1. MapReduce [Dean and Ghemawat, 2008] - 63 (citations)
  2. Fourth Paradigm [Hey et al., 2009] - 51
  3. Elements of Statistical Learning [Hastie et al., 2009] - 43
  4. Initial sequencing of the human genome [Lander et al., 2001] - 30
  5. A mathematical theory of communication [Shannon, 2001] - 24
  6. Sloan Digital Sky Survey [York et al., 2000] - 23
  7. BLAST [Altschu et al., 1990] - 20
  8. Lasso [Tibshirani et al., 1996] - 19
  9. Latent Dirichlet allocation [Blei et al., 2003] - 19
  10. EM algorithm [Demster et al., 1977] - 17
  11. Support vector networks [Cortes and Vapnik, 1995] - 17
  12. Random forest [Breiman, 2001] - 15
  13. Pattern Recognition [Bishop et al., 2006] - 14
  14. Anatomy of web search engine [Brin and Page, 1998] - 14
  15. Numerical Recipes [Press, 2007] - 13
  16. Boostrap methods [Efron, 1979] - 11
  17. Equation of state calculations [Metropolis et al., 1953] - 11
  18. Exploratory data analysis [Tukey, 1977] - 11
  19. Probability reasoning [Pearl, 1988] - 11
  20. PageRank [Page et al., 1999] - 10
  21. Bayesian Data Analysis [Gelman et al., 2013] - 10
  22. Unreasonable effectiveness of data [Halevy et al., 2009] - 10
Other cool things about this article:
  • The R programming language and the IPython Notebook programming environment were highlighted. The authors state the "R language is one of the leading programming languages, and was referenced a significant number of times".  Similarly, the IPython Notebook "is noteworthy as one of the few open source software toolkits for both programming and data analysis that is not database, algorithm or programming language". 
  • Classic/foundational ideas such as Bayes Theorem, Metropolis-Hastings Algorithm, lasso, bootstrap, Expectation-Maximization (EM) Algorithm were sprinkled throughout the article. These ideas are almost standard in any statistics curriculum and are incredibly powerful and useful tools when analyzing data. 
  • The concepts of 'exploratory data analysis' (EDA) and 'data visualization' got a major shout out with Tukey's and Tufte's essential works.  These concepts are critical in the analysis of data and are often overlooked or treated as assumed knowledge.  I would argue that these concepts should be included as a major portion of any course based around teaching the concepts of data analysis.  

So, how many have of these have you read??

No comments:

Post a Comment