Wednesday, April 8, 2015

Influential works in Data-Driven Discovery

A recent initiative to fund data-driven discoveries was completed last year by the Gordon and Betty Moore Foundation. Over 1,100 applications were received and each application had the opportunity to cite five "influential works in the general field of 'Big Data' for scientific discovery".  An analysis was done to see which works were cited the most and in what genres these works were from.  A paper summarizing the results was posted on arXiv on March 30, 2015 and had some interesting results that I wanted to share!

First, I was happy to see I have read several of the most cited influential works, but this also gave me a nice summer reading list of things I haven't read (I think this will basically be my #tbt (throwback Thursday) data-science papers for the next year)!  It was such a great list of works that I wanted to share the most cited influential works on here (each cited at least 10 times):

  1. MapReduce [Dean and Ghemawat, 2008] - 63 (citations)
  2. Fourth Paradigm [Hey et al., 2009] - 51
  3. Elements of Statistical Learning [Hastie et al., 2009] - 43
  4. Initial sequencing of the human genome [Lander et al., 2001] - 30
  5. A mathematical theory of communication [Shannon, 2001] - 24
  6. Sloan Digital Sky Survey [York et al., 2000] - 23
  7. BLAST [Altschu et al., 1990] - 20
  8. Lasso [Tibshirani et al., 1996] - 19
  9. Latent Dirichlet allocation [Blei et al., 2003] - 19
  10. EM algorithm [Demster et al., 1977] - 17
  11. Support vector networks [Cortes and Vapnik, 1995] - 17
  12. Random forest [Breiman, 2001] - 15
  13. Pattern Recognition [Bishop et al., 2006] - 14
  14. Anatomy of web search engine [Brin and Page, 1998] - 14
  15. Numerical Recipes [Press, 2007] - 13
  16. Boostrap methods [Efron, 1979] - 11
  17. Equation of state calculations [Metropolis et al., 1953] - 11
  18. Exploratory data analysis [Tukey, 1977] - 11
  19. Probability reasoning [Pearl, 1988] - 11
  20. PageRank [Page et al., 1999] - 10
  21. Bayesian Data Analysis [Gelman et al., 2013] - 10
  22. Unreasonable effectiveness of data [Halevy et al., 2009] - 10
Other cool things about this article:
  • The R programming language and the IPython Notebook programming environment were highlighted. The authors state the "R language is one of the leading programming languages, and was referenced a significant number of times".  Similarly, the IPython Notebook "is noteworthy as one of the few open source software toolkits for both programming and data analysis that is not database, algorithm or programming language". 
  • Classic/foundational ideas such as Bayes Theorem, Metropolis-Hastings Algorithm, lasso, bootstrap, Expectation-Maximization (EM) Algorithm were sprinkled throughout the article. These ideas are almost standard in any statistics curriculum and are incredibly powerful and useful tools when analyzing data. 
  • The concepts of 'exploratory data analysis' (EDA) and 'data visualization' got a major shout out with Tukey's and Tufte's essential works.  These concepts are critical in the analysis of data and are often overlooked or treated as assumed knowledge.  I would argue that these concepts should be included as a major portion of any course based around teaching the concepts of data analysis.  

So, how many have of these have you read??

Sunday, April 5, 2015

Pasta alla Carbonara


Pasta alla carbonara is one of those wonderfully decadent recipes that is an easy dinner during the week or perfect for a special date night! One of my good friends from Sicily recently showed me how to make pasta all carbonara the traditional Italian way which does not have cream, wine or stock (which was a big surprise to me since that's the typical way it is made in the US). Once you try this recipe, you won't even miss the cream! 

Ingredients
- 1 package of bacon
- 1 box of pasta (campanelle, farfella, penne, spaghetti, etc)
- 4 fresh eggs
- 1/2 cup parmesan cheese
- salt, pepper

Optional: chopped parsley

Recipe
1) Bring a pot of salted water to boil and cook pasta to al dente according to the directions (subtract 1-2 minutes from directions).  


2) At the same time that the water is boiling, chop raw bacon into bite size pieces. In a large pan, sauté bacon until golden brown. 


3) Using a slotted spoon, take the bacon off the pan and on to a plate with a towel to drain some of the fat.  At this point, you can turn off the heat and pour some of the rendered bacon fat out of the pan (but I prefer to keep all the bacon fat).  It just adds extra flavor! :)


4) By this point the pasta should be just finished cooking at perfect al dente and the pan with the rendered bacon fat should be off the heat. In the same pan with the rendered bacon fat, add the pasta, bacon and 4 eggs.  Make sure you mix everything as you add the eggs to prevent any scrambled eggs. The heat from the pasta will slightly cook the eggs. Add the parmesan cheese. Top with salt & pepper.


This recipe cooks so fast and is delicious. Seriously. You won't be disappointed! :)