Statistical [R]ecipes: Interface 2012: Day 2

JCGS Highlights at the Interface session
With several great choices to pick from, my Day 2 of the Interface conference began in the JCGS Highlights at the Interface session to listen to Jennifer Le-Radamacher of University of Georgia give a talk on using symbolic-coveriance PCA and visualization techniques for interval-valued data. The visualizations she showed were very interesting, but it was suggested she check out the colorspace palette in R to further improve her figures.

Information Mining session
Next, I moved to the Information Mining session to hear William Szewcyzk from the NSA give a talk on streaming exploratory data analysis. He began by summarizing the process of data analysis in five steps: 1) choose a default model which is very controversial subject in itself because he said even if you just describe the data with its mean and variance, you are implicitly assuming the data can be described by the first and second moments of the distribution 2) project your data onto the model 3) examine the fit or lack thereof of the data to the model 4) adjust the model accordingly 5) repeat. For streaming data, you have to make a slight modification to this process because people incorrectly assume they think they are the only ones working on that flow of data and their process is the last one to touch the data. Finally he proposed a "default model for streaming data" similar to the way people often assume a gaussian distribution as the default model for static data. The last speaker of the session Andy Frenkiel of IBM gave a thought provoking talk on filling in the gaps of news stories when there is missing information using keyword searches.

Contributed Paper Session II
Xueying Chen of Rutgers University began this session with her split and conquer approach for extremely large datasets. In her talk, she randomly splits a data set into subsets, estimates a penalized logistic regression model within each subset and finally combined the estimates from the subsets in a final set of coefficient estimates. The second speaker was Garrett Grolemund of Rice University (pictured below) who gave a wonderful demo of his R package Lubridate which greatly simplifies the process of working with dates, times and time zones. Some great features include the ability to display the same instant of time in different time zones, to save and use time intervals as a class object in R and the test whether certain dates fall "%within%" a different set of dates. I'll also advertise for his online course for Visualization in R with ggplot2 on June 19-20!

David Kahle of Baylor University (pictured below) gave the final talk of the session on his useful R package mpoly which allows user to work with multivariate polynomials within R. There are three other packages in R which work with polynomials, but they are not very intuitive or efficient to work with. Some features include a new class of mpoly objects, basic arithmetic/calculus such as gradients, algebra, and finally evaluating polynomials.

Woman VS Machine: The Inference Battle session
After lunch, quite a few conference attendees move into the Woman VS Machine: The Inference Battle session. The session began with Andreas Buja who gave a thought-provoking talk on the problems with post-selection inference and proposed the Post Selection Inference (PoSi) constant which allows valid post-selection inference. Interestingly, PoSi guarantees coverage of CIs and Type I errors of tests and is not specific for any type of model selection. The second speaker in the session Heike Hofmann outlined the concepts of visual inference within the framework of exploratory data analysis. In a classical statistical setting, we reject the null hypothesis if the test statistic is past some threshold, but in a visual setting, she argued we would reject the null hypothesis (i.e a plot is not distinguishable from null plots) if the data plot is identifiable. A great example was shown in which she simulated random data in four dimensions and included one plot with the real signal. Due to the artifact of high dimensionality, the audience was not able to pick it out (including me!). Finally, she described a set of experiments they designed using Amazon Mechanic Turk in which they recruited people to look at a line-up of plots to pick out which ones were different from the rest using criteria such as bi-modality, outliers and mean shift (and showed how the power estimates). Very curious results. The last speaker Mahbub Majumder described these "Turk experiments" in greater detail. The session ended with a great question from the audience about the reproducibility of this type of research (which I believe should be as long as the same plots could be used again).

Banquet Keynote
Mark Hansen of UCLA gave an amazing keynote talk for the banquet tonight! I was very impressed with the quality of graphics and art projects he has produced over the last decade.

I will wrap up with a few pictures from the banquet of some current and past PhD students at Rice.

Statistical [R]ecipes

Thursday, May 17, 2012

Interface 2012: Day 2

No comments:

Post a Comment