What is KL Divergence?

KL divergence is a metric that measures the difference between two probabilities.

For a discrete random variable, and for a continuous random variable , where

  • KL divergence is a non-symmetric measure:
  • Used to measure how much information is lost when is used to approximate
  • Generally, is true distribution of data that we can really never know and is the approximation of from a neural network.
  • Not a distance metric and doesn’t need to satisfy the triangular inequality
  • A non-negative measure
  • when . but when
  • if the 2 distributions are absolutely different

KL Divergence and Maximum Likelihood Estimator

Before we identify the relationship between KL divergence metric and the maximum likelihood estimator, we have to look into what exactly maximum likelihood estimation is. in short, maximum likelihood estimation is finding the optimal parameter of a distribution that best describes a set of data. Likelihood can be written in the form of and finding the parameter that maximizes the likelihood is the maximum likelihood estimation. The log likelihood can be represented as . Now we can take a look at how KL divergence relates to maximum likelihood estimation.

Here we’ve derived that minimizing KL divergence is essentially the same thing as maximizing log likelihood estimator

KL Divergence and Cross Entropy

Moving onto the relationship between KL divergence and cross entropy.

Entropy is a measure of the level of uncertainty/disorder/unpredictability in a given dataset/system. It’s a metric that quantifies the amount of information in a dataset and it’s often used to evaluate the qualtiy of a model and its ability to make accurate predictions.
⬆️ uncertainty = ⬆️ entropy

The formula for entropy is the following:

Then what is cross entropy? cross entropy is another loss function that can be used to quantify the difference between two probability distributions. where represents the true probability distribution (usually one-hot) and represents model’s predicted probability distribution. Cross entropy is usually used in classification problem to determine how well the model performs. A good classifier would usually have . Essentially, minimizing KL divergence is equivalent to minimizing cross entropy under specific conditions (for more, take a look at this link)