Statistical [R]ecipes: May 2012

Thursday, May 17, 2012

Interface 2012: Day 2

JCGS Highlights at the Interface session
With several great choices to pick from, my Day 2 of the Interface conference began in the JCGS Highlights at the Interface session to listen to Jennifer Le-Radamacher of University of Georgia give a talk on using symbolic-coveriance PCA and visualization techniques for interval-valued data. The visualizations she showed were very interesting, but it was suggested she check out the colorspace palette in R to further improve her figures.

Information Mining session
Next, I moved to the Information Mining session to hear William Szewcyzk from the NSA give a talk on streaming exploratory data analysis. He began by summarizing the process of data analysis in five steps: 1) choose a default model which is very controversial subject in itself because he said even if you just describe the data with its mean and variance, you are implicitly assuming the data can be described by the first and second moments of the distribution 2) project your data onto the model 3) examine the fit or lack thereof of the data to the model 4) adjust the model accordingly 5) repeat. For streaming data, you have to make a slight modification to this process because people incorrectly assume they think they are the only ones working on that flow of data and their process is the last one to touch the data. Finally he proposed a "default model for streaming data" similar to the way people often assume a gaussian distribution as the default model for static data. The last speaker of the session Andy Frenkiel of IBM gave a thought provoking talk on filling in the gaps of news stories when there is missing information using keyword searches.

Contributed Paper Session II
Xueying Chen of Rutgers University began this session with her split and conquer approach for extremely large datasets. In her talk, she randomly splits a data set into subsets, estimates a penalized logistic regression model within each subset and finally combined the estimates from the subsets in a final set of coefficient estimates. The second speaker was Garrett Grolemund of Rice University (pictured below) who gave a wonderful demo of his R package Lubridate which greatly simplifies the process of working with dates, times and time zones. Some great features include the ability to display the same instant of time in different time zones, to save and use time intervals as a class object in R and the test whether certain dates fall "%within%" a different set of dates. I'll also advertise for his online course for Visualization in R with ggplot2 on June 19-20!

David Kahle of Baylor University (pictured below) gave the final talk of the session on his useful R package mpoly which allows user to work with multivariate polynomials within R. There are three other packages in R which work with polynomials, but they are not very intuitive or efficient to work with. Some features include a new class of mpoly objects, basic arithmetic/calculus such as gradients, algebra, and finally evaluating polynomials.

Woman VS Machine: The Inference Battle session
After lunch, quite a few conference attendees move into the Woman VS Machine: The Inference Battle session. The session began with Andreas Buja who gave a thought-provoking talk on the problems with post-selection inference and proposed the Post Selection Inference (PoSi) constant which allows valid post-selection inference. Interestingly, PoSi guarantees coverage of CIs and Type I errors of tests and is not specific for any type of model selection. The second speaker in the session Heike Hofmann outlined the concepts of visual inference within the framework of exploratory data analysis. In a classical statistical setting, we reject the null hypothesis if the test statistic is past some threshold, but in a visual setting, she argued we would reject the null hypothesis (i.e a plot is not distinguishable from null plots) if the data plot is identifiable. A great example was shown in which she simulated random data in four dimensions and included one plot with the real signal. Due to the artifact of high dimensionality, the audience was not able to pick it out (including me!). Finally, she described a set of experiments they designed using Amazon Mechanic Turk in which they recruited people to look at a line-up of plots to pick out which ones were different from the rest using criteria such as bi-modality, outliers and mean shift (and showed how the power estimates). Very curious results. The last speaker Mahbub Majumder described these "Turk experiments" in greater detail. The session ended with a great question from the audience about the reproducibility of this type of research (which I believe should be as long as the same plots could be used again).

Banquet Keynote
Mark Hansen of UCLA gave an amazing keynote talk for the banquet tonight! I was very impressed with the quality of graphics and art projects he has produced over the last decade.

I will wrap up with a few pictures from the banquet of some current and past PhD students at Rice.

Wednesday, May 16, 2012

Interface 2012: Day 1

Today was the first day of the 43rd Interface Conference 2012 which is being held at Rice University this year (follow me with updates on twitter with hashtag #Interface12) There were several concurrent technical sessions going on through out the day, so I only post about the ones I attended. It was an early start to the morning, but the coffee definitely helped. :)

Keynote speaker
The keynote speaker Trevor Hastie from Stanford gave a wonderful talk this morning on methods for low-rank factorization with missing data (perfect application for the Netflix data). He specifically discussed the methods Soft-Impute (soft threshold SVD) and Hard-Impute and showed their relationship to the Maximum Margin Matrix Factorization (MMMF). His method has an expectation-maximization flavor to it and is similar to alternating ridge regression. Finally he ended with a few generalizations including Convex Robust Completion (Robust SVD). When the data matrix X can be approximated by L (a low-rank matrix) + S (sparse matrix), then the method just adds a penalty parameter on a sparse matrix in addition to penalty parameter on the low rank matrix. I enjoyed the level of detail in this talk.

Software Development in R session
After the keynote, I decided to attend the Software Development in R technical session. The first speaker JJ Allaire (founder of Rstudio) gave a great high level demo of many useful features in Rstudio. He stressed the importance of "reproducible research" and "trustworthy computing". Some of the most exciting things in Rstudio include: a searchable history for any piece of code ever run through the console, page back through plots, quickly traverse through nested functions, interact with Git and SVN and incorporation of Sweave and knitr. You can easily navigate between chunks of code and even be pointed back to the source code after clicking on a complied pdf. Rstudio now has the feature of writing in the markdown language to quickly publish high quality web pages instead of having to deal with html. The second speaker in the session Norm Matloff of UC Davis discussed parallel computing in R. He reviewed classical shared-memory loop scheduling methods (static, dynamic, time-varying chunk size, etc) and how to these might be adapted to R. The example he discussed was how to parallelize all possible regressions in a given data set with dim(X) = n x p. The available R packages discussed for parallel computing were:
1) snow - serializes/deserializes communications which takes time; most used R package; the functions clusterApply() is static and clusterApplyLB() is dynamic; both limited to a fixed chunk size of 1 (small chunk sizes not good because of high overhead); chunk size > 1 must be programmed by user
2) Rmpi - more flexible than snow, but still has serialization and network problems
3) mclappy/multicore - each call involves new unix process creation
4) gputools - each call involves a GPU kernel invocation, time intensive; major overhead
He suggested a new scheduling method called 'Random Scheduling'. After making this small adjustment, then you can use the R packages as before (e.g. snow). For his presentation slides go here and his open source book go here. The final speaker of the session Duncan Murdoch gave a great overview of the older tools available for debugging and some examples of visual debuggers.

Statistical Models for Complex Functional Data session
The session started out with Todd Ogden of Columbia University who discussed sparse functional principal component regression to predict depression using MRI images as the functional data. He mentioned other statistical learning tools such as random forests may be more accurate, but he advocated for using a regression-based method with functional data because functional regression has a clear interpretation of the weight function. After expressing the functional components in the wavelet domain he applies penalization techniques such as wavelet-based LASSO to the functional model. The basic idea is to perform sparse functional principal component analysis and then use the loadings as the predictors. The second speaker Lan Zhou of Texas A&M showed how to use penalized bivariate B-splines in functional data analysis to estimate variability in Texas temperatures over the past 100 years. Because Texas or the "domain" is complicated (e.g. not rectangular, contains holes, etc), she uses the idea of triangulations. The goal is to estimate a bivariate smooth function over the domain to create a temperature map using data from weather stations and to investigate the variability in temperatures over the years. Veera Baladandayuthapani of UT MD Anderson wrapped the session up by discussing a bayesian functional mixed model for copy-number variation data measured by aCGH array data and extending it for SNP array data (higher resolution). The goal was to do a joint analysis on set of samples to look for a small signal by borrowing strength between the samples.

So far the talks have been excellent and I'm looking forward to the rest of the sessions tomorrow and Friday!