Methods of estimation

Given some data, we may want to estimate the parameter(s) of the true probability distribution they come from. There are three methods: plugin, feature matching, and maximum likelihood.

For the plugin estimator, simply plug in the data, weighting each data point by its associated probability.

$$ \mu = \mathbb{E}[X] $$

$$ \hat{M} = \frac{1}{n} \sum_{i=1}^{n} X_i = \hat{\mathbb{E}}[X] $$

$$ v = \mathbb{E}\left[\left(X - \mathbb{E}[X] \right)^2 \right] $$

$$ \hat{V} = \frac{1}{n} \sum_{i=1}^{n} (X_i - \mu)^2 $$

$$ a = \mathrm{median}(\mathbb{P}) $$

$$ \hat{A} = \mathrm{median}(\hat{\mathbb{P}}) $$

A feature is a property of a distribution, including but not limited to mean, variance, and median.

The goal of feature matching is to make an estimate for the parameter(s) of the distribution so that the feature(s) of the distribution match the features of the data.

For a given probability distribution $\mathbb{P}$ with parameter $\theta$, we can extract feature(s) $h^\theta = g(\mathbb{P}^\theta)$. We can also calculate the features for the empirical distribution $\hat{h} = g(\hat{\mathbb{P}})$. Then solve for $\theta$ by setting $h^\theta = \hat{h}$.

Moments of distributions are commonly used as features for feature matching. The $k$-th moment of a random variable $X$ is $\mathbb{E}[X^k]$.

To estimate the moment from empirical data $X_1, \ldots X_n$, replace expectation with the average:

$$ \hat{\mathbb{E}}[X^k] = \frac{1}{n} \sum_{i=1}^n X_i^k $$

Assume a probability mass or distribution function with parameter(s) $\theta$. Given a set of data points $ X = (X_1, \ldots, X_n) $, the likelihood function is the product of the PMFs of all of the points for a discrete distribution, or the product of the PDFs of all of the points for a continuous distribution.

Discrete (PMF):

$$ L^\theta(x_1, \ldots, x_n) = \prod_{i=1}^{n} \mathbb{P}^\theta (X_i = x_i) $$

Continuous (PDF):

$$ L^\theta(x_1, \ldots, x_n) = \prod_{i=1}^{n} f_{X_i}^\theta (x_i) $$

It is usually easier to maximize the log of the likelihood function, known as the log-likelihood function. This is of course equivalent to maximizing the likelihood function.

Discrete (PMF):

$$ \max_\theta \sum_{i=1}^n \log \mathbb{P}^\theta (X_i = x_i) $$

Continuous (PDF):

$$ \max_\theta \sum_{i=1}^n \log f_{X_i}^\theta (x_i) $$