Processing math: 100%

Tuesday, March 20, 2012

EM Algorithm: Confidence Intervals

Say you have some sample data X_1, X_2, \ldots, X_n that are iid and follow some distribution f(X|\theta) (e.g. Binomial(n,p)).  Finding the maximum likelihood estimate (MLE) and confidence interval of \theta is fairly straightforward.  To find the MLEs, just compute the gradient of the log likelihood, set equal to 0 and solve for \theta.  Let \hat{\theta} be the MLE of \theta and let the Fisher Information matrix of the sample be given by
I(\theta) = E_{\theta}((\frac{\partial}{\partial \theta} \log f(\mathbf{X} | \theta) )^2 ) = - E_{\theta}(\frac{\partial^2}{\partial \theta^2} \log f(\mathbf{X} | \theta) ) 

An important property of MLEs is the distribution of the estimators is asymptotically normal with mean \theta and the Var(\theta) being approximated by the inverse of the Fisher Information matrix I(\theta).

To calculate a 100(1-\alpha)\% confidence interval for \theta, compute the Fisher Information matrix from the sample and
[\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] 
The confidence intervals calculated above are when your data is complete and does not contain any missing data. When there is missing information, the Expectation-Maximization Algorithm is commonly used to estimate the parameters.  There are several good guides out there including one of my favorites (here).  A question I recently came across was, how do we calculate the confidence intervals for MLEs of incomplete data out of the EM algorithm? How do we compute the observed information matrix of the incomplete data?  Louis (1982)Meilijson (1989)Lange (1995)Oakes (1999) were a few of the references I found on the topic.

First, we must introduce a little notation.  Let y be the observed data, z be the missing data and x = (y,z) be the complete data.  Define L(\theta | y) as the observed (or incomplete) likelihood and L_0(\theta|x) as the complete likelihood with the full data.  In the EM algorithm, we want to maximize L(\theta | y) in \theta, but we do this using a conditional expectation
Q(\theta | \theta^{(t)}) = E_{X|Y,\theta^{(t)}} [\log L_0(\theta | X) ]
where \theta^{(t)} is the parameter estimate for \theta at the tth iteration.  The EM algorithm moves in iterations between two steps:
1) Expectation Step: take the expectation of the complete data X conditional on the observed data y and the current parameter estimates \theta^{(t)}.
2) Maximization Step: find the new \theta^{(t+1)} that maximizes Q(\theta|\theta^{(t)}).

Louis (1982) defines the notation of the gradient and the negative of the second derivatives of the complete likelihood,
S(X,\theta) = \frac{\partial \log L_0(\theta | X) }{ \partial \theta } \hspace{5mm} \text{ and } \hspace{5mm} B(X,\theta) = - \frac{\partial^2 \log L_0(\theta | X) }{ \partial \theta^2 }
and the gradient of the observed likelihood
S^*(Y,\theta) = \frac{\partial \log L(\theta | Y) }{ \partial \theta }
where S^*(y,\theta) = E_{X|Y,\theta}[S(X,\theta)] and S^*(y,\hat{\theta}) = 0. Then, the observed information matrix of the incomplete data can be obtained using
I_Y(\theta) = E_{X|Y,\theta}[B(X,\theta)] - E_{X|Y,\theta}[S(X,\theta)S^{T}(X,\theta)] + S^*(y,\theta)S^{*T}(y,\theta)
or another way to think about it is
I_Y = I(\hat{\theta}) = I_X(\theta) - I_{X | Y}
The authors note Efron and Hinkley (1978) define I_Y as the observed information and say it is "a more appropriate measure of information than the a priori expectation E_{\theta}[B^*(Y,\theta)]".

Oakes (1999) shows the function Q(\theta | \theta^{(t)}) can be used in the maximization of the observed likelihood L(\theta | y).  Therefore, when calculating the observed information matrix of the incomplete data, it is sufficient to use
I(\theta) = - \frac{\partial^2 Q}{\partial \theta^2 } |_{\theta = \hat{\theta} }
To calculate a 100(1-\alpha)\% confidence interval for \theta, we then use the same formula as above
[\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] 

My plan is to make another post on this topic soon applying this to binomial mixtures.  I would like to see the differences in confidence intervals using the information derived from Louis (1982) compared to Oakes (1999).


4 comments:

  1. May you show how the Louis principle is applied in the Binomial case, if its possible.

    ReplyDelete
  2. Hi, does this hold for multivariate EM as well? Could you please cite a paper/book which describes these things, so that I can include them in work and cite the paper/book as reference?

    ReplyDelete
    Replies
    1. Hi Sayantee, yes it should hold for the multivariate case. I believe the references I have listed in my post would cover that case. Cheers!

      Delete
    2. Thank you for the prompt response.

      Delete

Note: Only a member of this blog may post a comment.