Introduction

In the last post we went through the intuition behind maximum likelihood estimation in a continuous case. In this post we'll cover the mathematics behind it.

Mathematical derivation of MLE for normal distribution

If we have n independent draws from the normal distribution, the joint probability density of obtaining some data, x, given some specific parameter values for mean, \( \mu \) , and standard deviation, \( \sigma \) , is: \[ p(x_1, x_2,...,x_n|\mu,\sigma)\\ =\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x_1-\mu}{\sigma})^2}*\\ \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x_2-\mu}{\sigma})^2}*\\ ...*\\ \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x_n-\mu}{\sigma})^2} \\ p(x|\mu,\sigma)=\prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} \\ \] And as before, this is our likelihood function that we want to maximize. To recap, we want to find such values for \( \mu \) and \( \sigma \) so that the likelihood function is at its maximum value. \[ \mathcal{L}(\mu, \sigma)=\prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} \\ \] Note that the normalizing term \( (\sigma 2\pi)^{-1/2} \) is the same for all of terms in the expression and thus we have it n times. \[ \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\\ =(\sigma^2 2\pi)^{-1/2}*\\ ...*\\ (\sigma^2 2\pi)^{-1/2}\\ =(\sigma^2 2\pi)^{-1/2-1/2-...-1/2}\\ =(\sigma^2 2\pi)^{-n/2} \] Thus, we get it out of the product operator: \[ \mathcal{L}(\mu, \sigma)=(\sigma^2 2\pi)^{-n/2}\prod_{i=1}^n e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} \\ \] The product of the exponent functions on the other hand is equal to: \[ \prod_{i=1}^n e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} \\ =e^{-\frac{1}{2}(\frac{x_1-\mu}{\sigma})^2} *\\ e^{-\frac{1}{2}(\frac{x_2-\mu}{\sigma})^2} *\\ ...*\\ e^{-\frac{1}{2}(\frac{x_n-\mu}{\sigma})^2}\\ \] Note that as we multiply n times the same base number e, we can just sum the exponents: \[ =e^{-\frac{1}{2}(\frac{x_1-\mu}{\sigma})^2 - \frac{1}{2}(\frac{x_2-\mu}{\sigma})^2 - ...- \frac{1}{2}(\frac{x_n-\mu}{\sigma})^2 } \] Let's make the exponent prettier first. The exponent is: \[ -\frac{1}{2}(\frac{x_1-\mu}{\sigma})^2 -\\ \frac{1}{2}(\frac{x_2-\mu}{\sigma})^2 -\\ ...-\\ \frac{1}{2}(\frac{x_n-\mu}{\sigma})^2 \] Common factor: \[ =-\frac{1}{2}((\frac{x_1-\mu}{\sigma})^2 +\\ (\frac{x_2-\mu}{\sigma})^2 +\\ ...+\\ (\frac{x_n-\mu}{\sigma})^2) \] Another common factor: \[ =-\frac{1}{2 \sigma^2}((x_1-\mu)^2)+\\ (x_2-\mu)^2+\\ ...+\\ (x_n-\mu)^2) \] Finally, we obtain a much nicer exponent: \[ =-\frac{1}{2 \sigma^2}\sum_{i=1}^n((x_i-\mu)^2) \] Plug in the exponent to obtain our likelihood function. \[ \mathcal{L}(\mu, \sigma)=(\sigma^2 2\pi)^{-n/2} e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2} \\ \] We could continue with the above form of the likelihood function, but let's continue as before and take natural logarithms and obtain the log-likelihood function, \( l \) . If you don't remember the argumentation for this, check it from the previous post. \[ ln(\mathcal{L}(\mu, \sigma)\\ =l(\mu, \sigma)\\ =ln((\sigma^2 2\pi)^{-n/2} e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2}) \\ \] Let's make it a bit nicer with logaritm rules (product rule): \[ =ln((\sigma^2 2\pi)^{-n/2}) + \\ ln(e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2}) \] And product rule again: \[ =ln((\sigma^2)^{-n/2})+\\ ln((2\pi)^{-n/2}) + \\ ln(e^{-\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2}) \] Exponent rule: \[ =-\frac{n}{2} ln(\sigma^2)-\\ \frac{n}{2} ln(2\pi) - \\ \frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2 ln(e) \] Natural logarithm of its base number is unity and thus, we obtain a very pretty log-likelihood function: \[ =-\frac{n}{2} ln(\sigma^2)-\\ \frac{n}{2} ln(2\pi) - \\ \frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2 \] Let's begin maximizing the function. As before, we do it with differential calculus. Differentiate w.r.t. \( \mu \) : Terms that do not contain \( \mu \) will be treated as constant and are zero. \[ \frac{\delta}{\delta\mu} l(\mu, \sigma) \\ = \frac{\delta}{\delta\mu}( -\frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i^2-2 x_i \mu + \mu^2)) \] \[ = \frac{1}{2 \sigma^2}( -\frac{\delta}{\delta\mu} \sum_{i=1}^n(x_i^2-2 x_i \mu + \mu^2))\\ = \frac{1}{2 \sigma^2}( -\sum_{i=1}^n(-2 x_i + 2 \mu)) \\ = \frac{2}{2\sigma^2} \sum_{i=1}^n(x_i - \mu) \\ = \frac{1}{\sigma^2} \sum_{i=1}^n(x_i - \mu) \] Next, differentiate w.r.t. \( \sigma^2 \) : \[ \frac{\delta}{\delta\sigma^2} l(\mu, \sigma)\\ =\frac{\delta}{\delta\sigma^2}(-\frac{n}{2} ln(\sigma^2)-\\ \frac{n}{2} ln(2\pi) - \\ \frac{1}{2 \sigma^2}\sum_{i=1}^n(x_i-\mu)^2) \] And as before, terms without \( \sigma^2 \) are treated as constants: \[ =-\frac{n}{2} \frac{1}{\sigma^2}+\\ \frac{1}{2}\frac{1}{(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2) \] \[ =-n\frac{1}{2\sigma^2}+\\ \frac{1}{2}\frac{1}{(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2) \] \[ =\frac{1}{2\sigma^2}(-n+\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2) \] The first order conditions are (i.e. set the partial derivatives zero to find the mu and sigma that maximize the log-likelihood): \[\ \frac{1}{\sigma^2} \sum_{i=1}^n(x_i - \mu)=0 \] and \[\ \frac{1}{2\sigma^2}(-n+\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2)=0 \] Let's solve for \( \mu \) in the first FOC. Sum of constant n times equals n times the constant. Thus, we get it out of the summation: \[\ \frac{1}{\sigma^2} (-n\mu +\sum_{i=1}^n x_i)=0\\ \] \[\ -n\mu + \sum_{i=1}^n x_i=0\\ \] \[\ - n\mu = - \sum_{i=1}^n x_i\\ \] \[\ \hat{\mu} = \frac{\sum_{i=1}^n x_i }{n}\\ \] BOOM! There you have the maximum likelihood estimator for the mean parameter, which is the mean of the sample - quite intuitive and anticlimatic, isn't it? Let's continue with the estimator for the variance: \[\ \frac{1}{2\sigma^2}(-n+\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2)=0 \] \[\ -n+\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2=0 \] \[\ \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2=n \\ \frac{1}{\sigma^2}=\frac{n}{\sum_{i=1}^n(x_i-\mu)^2} \\ \] \[\ \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu)^2=n \\ \sigma^2=\frac{\sum_{i=1}^n(x_i-\mu)^2}{n} \\ \] KABOOM! There you have the estimator for the variance: \[\ \hat{\sigma}^2=\frac{\sum_{i=1}^n(x_i-\mu)^2}{n} \\ \] In the next post, we'll cover maximum likelihood estimation, in which we let mean parameter vary.

You can reach me through LinkedIn