Tuesday, March 20, 2012

EM Algorithm: Confidence Intervals

Say you have some sample data $X_1, X_2, \ldots, X_n$ that are iid and follow some distribution $f(X|\theta)$ (e.g. Binomial$(n,p)$).  Finding the maximum likelihood estimate (MLE) and confidence interval of $\theta$ is fairly straightforward.  To find the MLEs, just compute the gradient of the log likelihood, set equal to 0 and solve for $\theta$.  Let $\hat{\theta}$ be the MLE of $\theta$ and let the Fisher Information matrix of the sample be given by
\[ I(\theta) = E_{\theta}((\frac{\partial}{\partial \theta} \log f(\mathbf{X} | \theta) )^2 ) = - E_{\theta}(\frac{\partial^2}{\partial \theta^2} \log f(\mathbf{X} | \theta) ) \]

An important property of MLEs is the distribution of the estimators is asymptotically normal with mean $\theta$ and the Var($\theta$) being approximated by the inverse of the Fisher Information matrix $I(\theta)$.

To calculate a $100(1-\alpha)\%$ confidence interval for $\theta$, compute the Fisher Information matrix from the sample and
\[ [\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] \]
The confidence intervals calculated above are when your data is complete and does not contain any missing data. When there is missing information, the Expectation-Maximization Algorithm is commonly used to estimate the parameters.  There are several good guides out there including one of my favorites (here).  A question I recently came across was, how do we calculate the confidence intervals for MLEs of incomplete data out of the EM algorithm? How do we compute the observed information matrix of the incomplete data?  Louis (1982)Meilijson (1989)Lange (1995)Oakes (1999) were a few of the references I found on the topic.

First, we must introduce a little notation.  Let $y$ be the observed data, $z$ be the missing data and $x = (y,z)$ be the complete data.  Define $L(\theta | y)$ as the observed (or incomplete) likelihood and $L_0(\theta|x)$ as the complete likelihood with the full data.  In the EM algorithm, we want to maximize $L(\theta | y)$ in $\theta$, but we do this using a conditional expectation
\[ Q(\theta | \theta^{(t)}) = E_{X|Y,\theta^{(t)}} [\log L_0(\theta | X) ] \]
where $\theta^{(t)}$ is the parameter estimate for $\theta$ at the $t$th iteration.  The EM algorithm moves in iterations between two steps:
1) Expectation Step: take the expectation of the complete data $X$ conditional on the observed data $y$ and the current parameter estimates $\theta^{(t)}$.
2) Maximization Step: find the new $\theta^{(t+1)}$ that maximizes $Q(\theta|\theta^{(t)})$.

Louis (1982) defines the notation of the gradient and the negative of the second derivatives of the complete likelihood,
\[ S(X,\theta) = \frac{\partial \log L_0(\theta | X) }{ \partial \theta } \hspace{5mm} \text{ and } \hspace{5mm} B(X,\theta) = - \frac{\partial^2 \log L_0(\theta | X) }{ \partial \theta^2 } \]
and the gradient of the observed likelihood
\[ S^*(Y,\theta) = \frac{\partial \log L(\theta | Y) }{ \partial \theta } \]
where $S^*(y,\theta) = E_{X|Y,\theta}[S(X,\theta)]$ and $S^*(y,\hat{\theta}) = 0$. Then, the observed information matrix of the incomplete data can be obtained using
\[ I_Y(\theta) = E_{X|Y,\theta}[B(X,\theta)] - E_{X|Y,\theta}[S(X,\theta)S^{T}(X,\theta)] + S^*(y,\theta)S^{*T}(y,\theta) \]
or another way to think about it is
\[ I_Y = I(\hat{\theta}) = I_X(\theta) - I_{X | Y} \]
The authors note Efron and Hinkley (1978) define $I_Y$ as the observed information and say it is "a more appropriate measure of information than the a priori expectation $E_{\theta}[B^*(Y,\theta)]$".

Oakes (1999) shows the function $Q(\theta | \theta^{(t)})$ can be used in the maximization of the observed likelihood $L(\theta | y)$.  Therefore, when calculating the observed information matrix of the incomplete data, it is sufficient to use
\[ I(\theta) = - \frac{\partial^2 Q}{\partial \theta^2 } |_{\theta = \hat{\theta} } \]
To calculate a $100(1-\alpha)\%$ confidence interval for $\theta$, we then use the same formula as above
\[ [\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] \]

My plan is to make another post on this topic soon applying this to binomial mixtures.  I would like to see the differences in confidence intervals using the information derived from Louis (1982) compared to Oakes (1999).


  1. May you show how the Louis principle is applied in the Binomial case, if its possible.

  2. Hi, does this hold for multivariate EM as well? Could you please cite a paper/book which describes these things, so that I can include them in work and cite the paper/book as reference?

    1. Hi Sayantee, yes it should hold for the multivariate case. I believe the references I have listed in my post would cover that case. Cheers!

    2. Thank you for the prompt response.