Statistical [R]ecipes: Interface 2012: Day 1

Today was the first day of the 43rd Interface Conference 2012 which is being held at Rice University this year (follow me with updates on twitter with hashtag #Interface12) There were several concurrent technical sessions going on through out the day, so I only post about the ones I attended. It was an early start to the morning, but the coffee definitely helped. :)

Keynote speaker
The keynote speaker Trevor Hastie from Stanford gave a wonderful talk this morning on methods for low-rank factorization with missing data (perfect application for the Netflix data). He specifically discussed the methods Soft-Impute (soft threshold SVD) and Hard-Impute and showed their relationship to the Maximum Margin Matrix Factorization (MMMF). His method has an expectation-maximization flavor to it and is similar to alternating ridge regression. Finally he ended with a few generalizations including Convex Robust Completion (Robust SVD). When the data matrix X can be approximated by L (a low-rank matrix) + S (sparse matrix), then the method just adds a penalty parameter on a sparse matrix in addition to penalty parameter on the low rank matrix. I enjoyed the level of detail in this talk.

Software Development in R session
After the keynote, I decided to attend the Software Development in R technical session. The first speaker JJ Allaire (founder of Rstudio) gave a great high level demo of many useful features in Rstudio. He stressed the importance of "reproducible research" and "trustworthy computing". Some of the most exciting things in Rstudio include: a searchable history for any piece of code ever run through the console, page back through plots, quickly traverse through nested functions, interact with Git and SVN and incorporation of Sweave and knitr. You can easily navigate between chunks of code and even be pointed back to the source code after clicking on a complied pdf. Rstudio now has the feature of writing in the markdown language to quickly publish high quality web pages instead of having to deal with html. The second speaker in the session Norm Matloff of UC Davis discussed parallel computing in R. He reviewed classical shared-memory loop scheduling methods (static, dynamic, time-varying chunk size, etc) and how to these might be adapted to R. The example he discussed was how to parallelize all possible regressions in a given data set with dim(X) = n x p. The available R packages discussed for parallel computing were:
1) snow - serializes/deserializes communications which takes time; most used R package; the functions clusterApply() is static and clusterApplyLB() is dynamic; both limited to a fixed chunk size of 1 (small chunk sizes not good because of high overhead); chunk size > 1 must be programmed by user
2) Rmpi - more flexible than snow, but still has serialization and network problems
3) mclappy/multicore - each call involves new unix process creation
4) gputools - each call involves a GPU kernel invocation, time intensive; major overhead
He suggested a new scheduling method called 'Random Scheduling'. After making this small adjustment, then you can use the R packages as before (e.g. snow). For his presentation slides go here and his open source book go here. The final speaker of the session Duncan Murdoch gave a great overview of the older tools available for debugging and some examples of visual debuggers.

Statistical Models for Complex Functional Data session
The session started out with Todd Ogden of Columbia University who discussed sparse functional principal component regression to predict depression using MRI images as the functional data. He mentioned other statistical learning tools such as random forests may be more accurate, but he advocated for using a regression-based method with functional data because functional regression has a clear interpretation of the weight function. After expressing the functional components in the wavelet domain he applies penalization techniques such as wavelet-based LASSO to the functional model. The basic idea is to perform sparse functional principal component analysis and then use the loadings as the predictors. The second speaker Lan Zhou of Texas A&M showed how to use penalized bivariate B-splines in functional data analysis to estimate variability in Texas temperatures over the past 100 years. Because Texas or the "domain" is complicated (e.g. not rectangular, contains holes, etc), she uses the idea of triangulations. The goal is to estimate a bivariate smooth function over the domain to create a temperature map using data from weather stations and to investigate the variability in temperatures over the years. Veera Baladandayuthapani of UT MD Anderson wrapped the session up by discussing a bayesian functional mixed model for copy-number variation data measured by aCGH array data and extending it for SNP array data (higher resolution). The goal was to do a joint analysis on set of samples to look for a small signal by borrowing strength between the samples.

So far the talks have been excellent and I'm looking forward to the rest of the sessions tomorrow and Friday!

Statistical [R]ecipes

Wednesday, May 16, 2012

Interface 2012: Day 1

No comments:

Post a Comment