Lecture 26 - Maximum Likelihood Estimation Framework
We have covered how to model many of the non-idealities present in real signals, but in practice these we can never perfectly model real signals. For this reason, we need some way to determine what data our imperfect received signal actually corresponds to. The framework that we will use to acheive this is maximum likelihood estimation.
1 General Framework
Before delving into examples, we need to define a few terms. First, we have some parameter vector \(\vec{\theta}\). We also have some probabilistic model \(f_{\vec{r} | \vec{\theta}}\) for how that parameter vector affects an output. In our case, the output will be \(\vec{r}\) which is called our measurement or observation vector. Notice that the model is some conditional probability distribution, which means that the output is dependent on the input(s).
The maximum likelihood (ML) estimate \(\hat{\theta}_{ML}(\vec{r})\) is defined as follows:
\[ \hat{\theta}_{ML}(\vec{r}) = \operatorname*{arg max}_{\vec{\theta} \in \Phi} f_{\vec{r} | \vec{\theta}} (\vec{r} | \vec{\theta}) \]
For fixed \(\vec{\theta}\), \(f_{\vec{r} | \vec{\theta}} (\vec{r} | \vec{\theta})\) is a probability distribution. In principle it could be any type of probability distribution. For fixed \(\vec{r}\), \(f_{\vec{r} | \vec{\theta}} (\vec{r} | \vec{\theta})\) is a likelihood function, i.e. the likelihood of that particular observed \(\vec{r}\) as a function of different \(\vec{\theta}\). When we maximize \(f_{\vec{r} | \vec{\theta}}\), we find the most likely ELABORATE
2 Jointly Gaussian Likelihood Function
For our purposes, the jointly gaussian distribution is important. Recall that it is completely specified by two quantities: - Mean vector \(\vec{M} (\vec{\theta})\) - Covariance matrix \(K (\vec{\theta})\)
When we know the form of the mean vector and the covariance matrix, we can write the likelihood function like so, where \(n\) is the dimension of the vector \(\vec{r}\):
\[ f_{\vec{r} | \vec{\theta}} (\vec{r} | \vec{\theta}) = \frac{1}{(2 \pi)^{n/2} |K (\vec{\theta})|^{1/2}} \exp \left[- \frac{1}{2} (\vec{r} - \vec{\mu}(\vec{\theta}))^T \; K^{-1}(\vec{\theta}) \; (\vec{r} - \vec{\mu}(\vec{\theta}))\right] \]
Because we only care about maximizing this function, not its exact value, we can do some monotonic transformation on it to make the math easier. In this case, taking the logarithm will remove the exponential and make things simpler:
\[ \ln f_{\vec{r} | \vec{\theta}} (\vec{r} | \vec{\theta}) = - \frac{n}{2} \ln (2 \pi) - \frac{1}{2} \ln |K (\vec{\theta})| - \frac{1}{2} (\vec{r} - \vec{\mu}(\vec{\theta}))^T \; K^{-1}(\vec{\theta}) \; (\vec{r} - \vec{\mu}(\vec{\theta})) \]
Very often, the covariance matrix doesn’t depend on parameter matrix. For instance, AWGN has a flat power spectrum so several of the terms simplify.
3 Examples
3.1 Unknown Mean, Known Variance
Suppose \(r_k\) iid. Gaussian with mean \(\theta\) and variance \(\sigma^2\).
We begin by writing our likelihood equation. Note that in this problem, \(\theta\) is a scalar.
\[ f_{\vec{r} | \theta} (\vec{r} | \theta) = \prod_{k=1}^n f_{r | \theta} (r_k | \theta) \]
Because we know the distribution is Gaussian, we can rewrite the equation.
\[ f_{\vec{r} | \theta} (\vec{r} | \theta) = \prod_{k=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(r_k - \theta)^2/2 \sigma^2} \]
If we now take the logarithm of this expression, we get that
\[ \ln f_{\vec{r} | \theta} (\vec{r} | \theta) = \sum_{k=1}^n - \frac{1}{2}\ln{2 \pi \sigma^2} - \frac{(r_k - \theta)^2}{2 \sigma^2} \]
Now, for any given \(r\), we want to find the \(\theta\) that maximizes the function \(f\). In other words, given the data that we observe, what is the parameter that makes the data most likely?
Recall the maximum likelihood function we defined above:
\[ \begin{aligned} \hat{\theta}_{ML}(\vec{r}) &= \operatorname*{arg max}_{\theta} f_{\vec{r} | \theta} (\vec{r} | \theta) \\ &= \operatorname*{arg max}_{\theta} \ln f_{\vec{r} | \theta} (\vec{r} | \theta) \end{aligned} \]
Notice that in our likelihood equation, the first term does not contain \(\theta\). Notice also that there is a negative sign in front of the second term (which does contain \(\theta\)). Maximizing a negative of some quantity is the same as minimizing the quantity, so our maximum likelihood equation becomes
\[ \hat{\theta}_{ML}(\vec{r}) = \operatorname*{arg min}_{\theta} \sum_{k=1}^n \frac{(r_k - \theta)^2}{2 \sigma^2} \]
To find the minimum, we can take a derivative with respect to our paramter and set it equal to \(0\).
\[ \frac{d}{d \theta} \left. \sum_{k=1}^n \frac{(r_k - \theta)^2}{2 \sigma^2} \right|_{\theta=\hat{\theta}_{ML}} = 0 \]
which can be rewritten
\[ \sum_{k=1}^n \frac{-2(r_k - \hat{\theta}_{ML})}{2 \sigma^2} = 0 \]
We can simplify this expression further to
\[ n \; \hat{\theta}_{ML} = \sum_{k=1}^n r_k \]
which leaves us with an equation for \(\hat{\theta}_{ML}\):
\[ \hat{\theta}_{ML} = \frac{1}{n} \sum_{k=1}^n r_k \]
What is this saying? The mean of the data should be the mean of the sampled data.
There are a few final things to note about this problem: - We call this estimator a sample mean estimator. - The result is independent of variance - \(\mathbb{E} [\hat{\theta}_{ML} (\vec{r})] = \frac{1}{n} \sum_{k=1}^n \mathbb{E}[r_k] = \theta\), which means this receiver is unbiased - The mean-square error of this estimator is \(\mathbb{E} [(\hat{\theta}_{ML} (\vec{r}) - \theta)^2] = \frac{1}{n} r^2\), which approaches \(0\) as \(n \rightarrow \infty\). We say that this makes the estimator consistent. - In general, the more data we have, the better our estimate will be with this receiver.
This is perhaps the simplest estimator we will cover, but it illustrates the concepts of maximum likelihood detection well.
3.2 Known Signal in Noise of Known Variance, Unknown Amplitude
Suppose \(r_k = \theta s_k + w_k\) where \(s_k\) is known and \(w_k ~ N(0, \sigma^2)\) iid. We can write the likelihood function:
\[ f_{\vec{r} | \theta} (\vec{r} | \theta) = \prod_{k=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(r_k - \theta s_k)^2/2 \sigma^2} \]
We take the derivative and set it to \(0\)
\[ \frac{d}{d \theta} \left. f_{\vec{r} | \theta} (\vec{r} | \theta) \right|_{\theta=\hat{\theta}_{ML}} = 0 \]
and end up with the following maximum likelihood equation
\[ \hat{\theta}_{ML} (\vec{r}) = \frac{\sum_{k=1}^n r_k s_k}{\sum_{k=1}^n s_k^2} \]
Recalling signal space concepts, if \(s_k = \langle s(t), \varphi_k (t) \rangle\), \(r_k = \langle r(t), \varphi_k (t) \rangle\), and \(w_k = \langle w(t), \varphi_k (t) \rangle\) and we let \(n \rightarrow \infty\) so that we have a complete orthonormal basis, then we can think of our maximum likelihood equation in terms of waveforms.
\[ \hat{\theta}_{ML} (r(t)) = \frac{\langle r(t), s(t) \rangle}{\langle s(t), s(t) \rangle} \]
In integral form this would be
\[ \hat{\theta}_{ML} (r(t)) = \frac{\int_{-\infty}^\infty r(t) s^*(t) dt}{\int_{-\infty}^\infty |s(t)|^2 dt} \]
We see here that we can solve more complicated continuous time problems with our vector math.