Statistical [R]ecipes: March 2012

Friday, March 30, 2012

Policy on 'Secondary findings' from Whole Genome Sequencing in Clinical Tests

The American College of Medical Genetics and Genomics (ACMG) is having their annual meeting this week in North Carolina. One of the major discussion points is: when a patient has their genome sequenced to look for disease-causing mutations for a specific disease in a clinical setting, what do you do with the 'secondary-findings' or other mutations unrelated to the disease in question that have been found? This is an incredibly difficult and convoluted question to answer.

For example in the clinical setting, say a patient's genome is being sequenced to test if their genome contains mutations related to high cholesterol, but in the process other mutations come back positive for Alzhimer's. Should a patient be informed of the 'secondary-finding' information? What if it were a child? Should the child know at a young age that they have a high predisposition to Alzhimer's?

In a research setting, there currently exist hundreds of large sequencing studies which sequence the genomes of many individuals suffering from a particular disease in an effort to study the etiology of that disease. When patients participate in these sequencing projects, thousands of mutations are often found which may or may not be related to a wide spectrum of diseases. When researchers find mutations related to other diseases, should the researchers be responsible of reporting the information to the patient? If the individual's genome is sequenced a second time at a later point in the future and mutations related to diseases that were not known before, but are now known are found, should the researcher be responsible of tracking down the individual to inform them? If a patient was informed at one point in time to have a deleterious mutation, but in the future that mutation is no longer considered to be deleterious, what should happen? At the American Society of Human Genetics (ASHG) annual meeting this fall in Montreal, I attended a similar forum that discussed many of these questions. The conversation can only be described as "intense and very heated". There were individuals who were adamantly in support of informing patients of secondary-findings and individuals who were adamantly against it in both the research and clinical-based setting.

The ACMG is releasing a policy statement which will be finalized this summer in support of reporting secondary findings to patients in the clinical setting. The policy says only disease-causing mutations with a high-prevelance for a treatable condition will be included for this clinical-based testing. Mutations for diseases with no known treatments will not be included in the list. I will be interested to see how we as a society decide to deal with all the other issues that will come out of this policy. A few of the issues include: How we will relay the information to the patients? Who is responsible to relay the information? Who will help the patients interpret these variants? Who is responsible for updating the patient on new information in the future? Of course there are also the legal issues related to the patient's privacy?

Wednesday, March 28, 2012

Live Forum Tomorrow on 'Big Data' from the White House

Tomorrow afternoon the White House will be hosting a 90 minute forum on the Challenges and Opportunities in Big Data! Tune in at 2pm Eastern live on Thursday March 29th to see leaders in academia, industry, and the heads of these governmental agencies OSTP, NSF, NIH, DoE, DoD, DARPA and USGS. A blogpost from R-bloggers.com suggested even though it is of interest in how to store large data, what's more interesting is how to infer information from large data. This is exactly the question that genetics and genomics is asking with next-generation sequencing. We are now at a point that sequencing someone's genome is cheap. Interpreting the variants from someone's genome is the million, no billion dollar question. As a statistician, I'm curious to see the government's stance on analyzing not only large data coming out of companies such as Amazon, Google, Netflix, but also genome data (hopefully).

Saturday, March 24, 2012

Beer-battered Fish Tacos and Fried Plantains

Tonight we made beer-battered fish tacos and fried plantains with a spicy mango salsa!

I have a confession. I really love fish tacos. Whenever I see them on a menu, I almost always order them in my quest to find the best fish tacos around! In my opinion, Cuatro's in Austin, TX is the reigning champion of 'Best Fish Tacos'. :) So simple, so delicious, so perfect.

Here in Houston, I haven't really found any tacos that compare. So, we decided to spend the evening trying to make some tasty fish tacos. We started out with some plantains that we bought at Central Market this morning.

and ended up with these:

Then we moved to making the beer batter for the fish. We went with tilapia and seasoned it up a bit before adding it to the batter. This was my first time making beer batter, which was surprisingly easy. It's about a 1-1 ratio of flour to beer. Add in a bit of salt, sugar and baking powder and you're ready to go.

After coming out, they looked delicious!

Because we had a bit of extra batter, we opted to make a few onion rings. We do not fry things very often, partly because it's terrible for you and partly because it's a big ordeal to clean up the oil afterwards.

Anyways, the mango salsa ended up being a bit too spicy by itself, but went perfectly with both the plantains and the tacos. I think both dishes were a great success!

Thursday, March 22, 2012

Strong smells help people take smaller bites

A new study came out suggesting that foods that have a strong smell help people take smaller bites of food. Nature wrote a blog post about it which is how I came across the study. Researchers from the Netherlands pumped vanilla custard into people's mouths and pumped different smells into their noses at the same time. The higher the intensity of the smell, the smaller the bite. Eventually, they hope to design foods with intense smells to trigger people to take smaller bites.

[Photo taken from Photo-Dictionary.com]

The best part was this study was published in a recent journal called Flavour which from what I can tell was launched just this month! How exciting! They describe it as a "peer-reviewed, open access, online journal that publishes interdisciplinary articles on flavour, its generation and perception, and its influence on behaviour and nutrition. We seek articles on the psychophysical, psychological and chemical aspects of flavour as well as those taking brain imaging approaches. We take flavour to be the experience of eating food as mediated through all the senses. Thus we welcome articles that deal with not only taste and aroma, but also chemesthesis, texture and all the senses as they relate to the perception of flavour. "

They hope to make the journal accessible to scientists as well as chefs and nutritionists. I will definitely add it to my list of journals to read.

Wednesday, March 21, 2012

Getting WinBUGS and OpenBUGS running on a Mac OS

I came across a great blog on r-bloggers.com today pointing to two tutorials on how to get WinBUGS and OpenBUGS running on a Mac OS using Wine. Very useful for all your bayesian data analysis.

Tuesday, March 20, 2012

EM Algorithm: Confidence Intervals

Say you have some sample data $X_1, X_2, \ldots, X_n$ that are iid and follow some distribution $f(X|\theta)$ (e.g. Binomial$(n,p)$). Finding the maximum likelihood estimate (MLE) and confidence interval of $\theta$ is fairly straightforward. To find the MLEs, just compute the gradient of the log likelihood, set equal to 0 and solve for $\theta$. Let $\hat{\theta}$ be the MLE of $\theta$ and let the Fisher Information matrix of the sample be given by

\[ I(\theta) = E_{\theta}((\frac{\partial}{\partial \theta} \log f(\mathbf{X} | \theta) )^2 ) = - E_{\theta}(\frac{\partial^2}{\partial \theta^2} \log f(\mathbf{X} | \theta) ) \]

An important property of MLEs is the distribution of the estimators is asymptotically normal with mean $\theta$ and the Var($\theta$) being approximated by the inverse of the Fisher Information matrix $I(\theta)$.

To calculate a $100(1-\alpha)\%$ confidence interval for $\theta$, compute the Fisher Information matrix from the sample and
\[ [\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] \]
The confidence intervals calculated above are when your data is complete and does not contain any missing data. When there is missing information, the Expectation-Maximization Algorithm is commonly used to estimate the parameters. There are several good guides out there including one of my favorites (here). A question I recently came across was, how do we calculate the confidence intervals for MLEs of incomplete data out of the EM algorithm? How do we compute the observed information matrix of the incomplete data? Louis (1982), Meilijson (1989), Lange (1995), Oakes (1999) were a few of the references I found on the topic.

First, we must introduce a little notation. Let $y$ be the observed data, $z$ be the missing data and $x = (y,z)$ be the complete data. Define $L(\theta | y)$ as the observed (or incomplete) likelihood and $L_0(\theta|x)$ as the complete likelihood with the full data. In the EM algorithm, we want to maximize $L(\theta | y)$ in $\theta$, but we do this using a conditional expectation
\[ Q(\theta | \theta^{(t)}) = E_{X|Y,\theta^{(t)}} [\log L_0(\theta | X) ] \]
where $\theta^{(t)}$ is the parameter estimate for $\theta$ at the $t$th iteration. The EM algorithm moves in iterations between two steps:
1) Expectation Step: take the expectation of the complete data $X$ conditional on the observed data $y$ and the current parameter estimates $\theta^{(t)}$.
2) Maximization Step: find the new $\theta^{(t+1)}$ that maximizes $Q(\theta|\theta^{(t)})$.

Louis (1982) defines the notation of the gradient and the negative of the second derivatives of the complete likelihood,
\[ S(X,\theta) = \frac{\partial \log L_0(\theta | X) }{ \partial \theta } \hspace{5mm} \text{ and } \hspace{5mm} B(X,\theta) = - \frac{\partial^2 \log L_0(\theta | X) }{ \partial \theta^2 } \]
and the gradient of the observed likelihood
\[ S^*(Y,\theta) = \frac{\partial \log L(\theta | Y) }{ \partial \theta } \]
where $S^*(y,\theta) = E_{X|Y,\theta}[S(X,\theta)]$ and $S^*(y,\hat{\theta}) = 0$. Then, the observed information matrix of the incomplete data can be obtained using
\[ I_Y(\theta) = E_{X|Y,\theta}[B(X,\theta)] - E_{X|Y,\theta}[S(X,\theta)S^{T}(X,\theta)] + S^*(y,\theta)S^{*T}(y,\theta) \]
or another way to think about it is
\[ I_Y = I(\hat{\theta}) = I_X(\theta) - I_{X | Y} \]
The authors note Efron and Hinkley (1978) define $I_Y$ as the observed information and say it is "a more appropriate measure of information than the a priori expectation $E_{\theta}[B^*(Y,\theta)]$".

Oakes (1999) shows the function $Q(\theta | \theta^{(t)})$ can be used in the maximization of the observed likelihood $L(\theta | y)$. Therefore, when calculating the observed information matrix of the incomplete data, it is sufficient to use
\[ I(\theta) = - \frac{\partial^2 Q}{\partial \theta^2 } |_{\theta = \hat{\theta} } \]
To calculate a $100(1-\alpha)\%$ confidence interval for $\theta$, we then use the same formula as above
\[ [\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] \]

My plan is to make another post on this topic soon applying this to binomial mixtures. I would like to see the differences in confidence intervals using the information derived from Louis (1982) compared to Oakes (1999).

Thursday, March 15, 2012

Nearly 800,000 lung cancer death averted by decline in smoking

The results from a major study on lung cancer conducted by six institutions (including Rice) was released yesterday in the Journal of National Cancer Institute. The goal of the study was to ask how many deaths have been prevented after the release of the US Surgeon General's report on Smoking and Health in 1964. This is the first time the NCI is publishing a paper of this magnitude on estimates of lung cancer deaths using all model-based approaches. It was estimated nearly 800,000 lung cancer deaths have been averted from 1975-2000 with the decline in smoking. What's the most impressive is that if smoking had been completely eliminated, then it was estimated that 2.5 million deaths would have been prevented.

Wednesday, March 14, 2012

Pi Day Celebration at Rice

Today the three math-oriented departments at Rice (MATH, STAT and CAAM) put together an event to celebrate Pi Day! We had pies from House of Pies, pizzas (i.e. pies), a Pi Recitation contest, and also a fundraiser for the National Math and Science Initiative. The way it worked was the department that donates the most amount of money gets to pie a professor from that department in the face. Hadley Wickham from STAT, Steve Cox from CAAM, and Andy Putman from MATH all graciously agreed to take one from the team. Due to some logistical issues at the last minute, it was decided that all three would pie each other.

Here are a few pictures of the event and below is a movie. In the first picture, Darren Ong (graduate student from MATH who organized most of the event) is cheering them on saying "Pie them all!"

For those who want to see the play by play:

Thank you to everyone who helped put this event all together! Also thank you to the three departments, GSA, SIAM, and AWM for their financial support.

Sunday, March 11, 2012

Julia Child's Boeuf Bourguignon

Two words: simply divine!

On Saturday I decided I wanted to attempt the famous boeuf bourguignon recipe by Julia Child. I'll be honest and say this recipe does not take a short amount of time, but boy is it worth every bit of effort. Plus, I love any excuse to use my big blue Le Creuset pot. The recipe starts by rendering out the fat from bacon. Take the bacon out with a slotted spoon and set aside.

In the bacon fat, brown up some stewing meat. I bought two pounds and did this in batches so the meat would get evenly browned on all sides. Make sure you pat the meat dry with a paper towel. This helps the browning process. After the meat is all browned, add 2 tablespoons of flour to thicken everything up.

Other ingredients include 2 cups of a red wine, 3 cups of beef broth, a chopped onion, two chopped carrots, mushrooms, garlic, tomato paste, thyme, salt and pepper.

Put everything in the pot and bring it to a boil. At this point, put the top on the pot and place in an oven at 325 degrees for 2 1/2 - 3 hours. Every hour or so, I took out the pot and gave it a stir just to make sure it was bubbling away properly.

Hopefully you will find this recipe as delicious as I did!

Boeuf Bourguignon ingredients:

5-8 bacon slices
2 pounds of lean stewing beef, cut into cubes
1 chopped onion
1 chopped carrot
8 oz of mushrooms, quartered
2 tablespoons of flour
2 cups of red wine
3 cups of beef broth
1 tablespoon of tomato paste
2 cloves of garlic, chopped
1 teaspoon of thyme
18-20 small white onions

Thursday, March 8, 2012

Khan Academy

Salman Khan, founder of Khan Academy, is scheduled to be the commencement speaker for Rice University's commencement this May. I haven't heard of Khan Academy before, but it's basically like the MIT Open Courseware but better! It's a non-for-profit website that is trying to bring a free-education to anyone anywhere. They are not full-length classes, but short clips on individual topics, including SAT test preparation. After you watch the video, you can test your knowledge using the practice exercises.

My dad keeps saying that education is going to go through a big change in the near future because all the online education opportunities. With websites like this, it's hard not to agree. The only thing this doesn't give you is a formal degree. Unfortunately, we live in a society in which you often need the actual degree to get a job even if you have taught yourself the material. Interestingly, I think this will push universities (e.g. Rice) to incorporate more online courses so anyone anywhere can sign up to take the classes as long as you pay the tuition. The question I still have is will websites like this eventually drive the cost of education down because there is so much freely available information out there?

Wednesday, March 7, 2012

Getting LaTeX in Blogger or Blogspot

As a someone who uses $\LaTeX$ quite a bit, I was excited to discover how to integrate it into my blogposts on blogger.com. Since the idea of taking .pngs and putting them into the blog as images one by one is not ideal, I sought out scripts to install on blogger.com. This post led me to the link here that got it up and running. There are a lot of links out there that say 'just install this MathJax script' and it will work, but the problem is most of those links are no longer valid. The one I just posted worked for me. Below I've listed one of my favorite theorems!

Central Limit Theorem

Let $X_{1}$, $X_{2}$, $\ldots$ be a sequence of iid random variables whose mgfs exist in a neighborhood of 0 (that is, $M_{X_{i}}(t)$ exists for $|t| < h$, for some $h \in \mathbb{R}^{+}$). Let $E[X_{1}] = \mu_{X_{1}}$ and Var$[X_{1}] = \sigma_{X_{1}}^{2} > 0$, which are both finite since the mgf exists. Then
\[ \frac{ \sum_{k=1}^{n} X_{k} - n\mu_{X_{1}} }{ \sigma_{X_{1}}\sqrt{n}} = \frac{ \bar{X} - \mu_{X_{1}} }{ \sigma_{X_{1}}/\sqrt{n} } \longrightarrow Z \sim \mathcal{N}(0,1) \text{ (converges in distribution)} \]

Monday, March 5, 2012

Visit to Fredericksburg and Luckenbach, TX

This weekend I went to visit Fredericksburg, TX with a good friend of mine for her birthday! We had that 'Girls Getaway' kind of weekend and stayed in the most adorable bed & breakfast / cottage.

During our stay, we visited some lovely shops including a yarn shop, fudge shop and a pretty authentic German restaurant with some delicious schnitzel! Below is schnitzel covered in a mushroom sauce.

Here is some heavenly fried chicken covered in a honey pecan sauce

We visited some wineries:

and also made a very brief stop in Luckenbach, TX (definitely worth a quick google if you aren't familiar with it). Yes, the one with Willie, Waylon and the boys. :)

Overall, it was was a FANTASTIC trip! Would highly recommend it to anyone looking for a relaxing weekend in the Texas hill country. Hope you had an awesome birthday in your awesome birthday tiara!