Reference Note • Distances & Divergences
Taxonomy of Principal Distances & Divergences
Hamidreza Hashempoor • Institute for AI, University of Stuttgart
This note reorganizes the classical “distances and divergences” landscape
(after Frank Nielsen’s taxonomy chart) into clean families. It comes in two views:
a compact quick-reference table, and a
detailed reference where every entry carries its defining
equation, where it is used, and a small worked numeric example. The numbers in
every Example line are reproduced by the companion
Jupyter notebook.
How to read this sheet — metric vs. divergence
A metric (true distance) satisfies four rules: $d(p,q)\ge 0$;
$d(p,q)=0 \Leftrightarrow p=q$; symmetry $d(p,q)=d(q,p)$; and the
triangle inequality $d(p,r)\le d(p,q)+d(q,r)$. A divergence
keeps only the first two (non-negativity and identity): it is generally
asymmetric, $D(p\|q)\ne D(q\|p)$, and need not obey the triangle inequality.
Divergences measure how one distribution differs from a reference; metrics measure
mutual separation. The symbol $\|$ denotes the asymmetric “from / to”
argument order.
Summary / Table of distances
Grouped families, defining equations, and typical applications. Generator conventions:
$f$-divergence $D_f(p\|q)=\int p\,f(q/p)\,d\mu$; Bregman $B_F$ from a convex potential
$F$. $\|$ marks asymmetric (divergence) arguments; argument-order conventions vary by
source.
| Group | Distance / divergence | Equation | Application — where to use |
| Vector & metric distances |
Euclidean ($L_2$) | $d_2=\sqrt{\sum_i (p_i-q_i)^2}$ | $k$-means, $k$-NN, least squares |
| Manhattan ($L_1$) | $d_1=\sum_i |p_i-q_i|$ | Grid/route cost; robust to outliers; LASSO |
| Minkowski ($L_k$) | $d_k=\big(\sum_i |p_i-q_i|^k\big)^{1/k}$ | Tunable; unifies $L_1,L_2,L_\infty$ |
| Chebyshev ($L_\infty$) | $d_\infty=\max_i |p_i-q_i|$ | Max-coordinate gap; chessboard moves |
| Mahalanobis | $\sqrt{(p-q)^{\top}\Sigma^{-1}(p-q)}$ | Scale/correlation-aware; outliers, classification |
| Quadratic | $\sqrt{(p-q)^{\top} Q\,(p-q)}$ | Feature-weighted; colour histograms |
| Hamming | $\big|\{i:p_i\neq q_i\}\big|$ | Error-correcting codes, DNA, bit strings |
| Riemannian & information geometry |
Fisher information | $\mathbf I(\theta)=\mathbb E[(\partial_\theta\ln p)(\partial_\theta\ln p)^{\top}]$ | Natural metric on models; Cramér–Rao, natural gradient |
| Fisher–Rao distance | $\rho=\min_\gamma\!\int_0^1\!\sqrt{\dot\gamma^{\top}\mathbf I\,\dot\gamma}\,dt$ | Intrinsic geodesic distance between distributions |
| Riemannian geodesic | $L=\int\!\sqrt{g_{ij}\,\dot x^i\dot x^j}\,dt$ | Shortest path on curved manifolds; shape spaces |
| $f$-divergences (Csiszár / Ali–Silvey) |
general template | $D_f(p\|q)=\int p\,f(q/p)\,d\mu$ | Master family — pick the convex generator $f$ |
| Kullback–Leibler | $\int p\log\tfrac{p}{q}\,d\mu$ | MLE, cross-entropy loss, variational inference |
| Pearson $\chi^2$ | $\int \tfrac{(q-p)^2}{p}\,d\mu$ | Goodness-of-fit; local KL approximation |
| Hellinger | $\sqrt{\tfrac12\!\int(\sqrt p-\sqrt q)^2\,d\mu}$ | Symmetric bounded metric; robust statistics |
| Total variation | $\tfrac12\!\int|p-q|\,d\mu$ | Statistical distinguishability; coupling |
| Amari $\alpha$-divergence | $f_\alpha(t)=\tfrac{4}{1-\alpha^2}\big(1-t^{\frac{1+\alpha}{2}}\big)$ | One dial: KL ($\alpha{=}1$), Hellinger ($\alpha{=}0$) |
| Bregman divergences |
general template | $B_F(x\|y)=F(x)-F(y)-\langle x-y,\nabla F(y)\rangle$ | Master family — pick the convex potential $F$ |
| Squared Euclidean | $\|x-y\|^2$ | Centroid clustering ($k$-means) |
| Itakura–Saito | $\sum_i\!\big(\tfrac{p_i}{q_i}-\log\tfrac{p_i}{q_i}-1\big)$ | Audio/spectral distortion; NMF |
| Log-Det | $\langle P,Q^{-1}\rangle-\log\det(PQ^{-1})-n$ | SPD matrices; metric learning (ITML) |
| Overlap / $\alpha$-power family |
Bhattacharyya | $-\log\!\int\!\sqrt{pq}\,\,d\mu$ | Class separability; object tracking |
| Chernoff information | $\max_{\alpha\in(0,1)}\,-\log\!\int p^{\alpha} q^{1-\alpha}d\mu$ | Error exponent in hypothesis testing |
| Rényi divergence | $\tfrac{1}{\alpha-1}\log\!\int p^{\alpha} q^{1-\alpha}d\mu$ | Differential privacy; information theory |
| Symmetrized & Jensen-type |
Jeffreys | $\mathrm{KL}(p\|q)+\mathrm{KL}(q\|p)$ | Symmetric KL when direction is arbitrary |
| Jensen–Shannon | $\tfrac12\mathrm{KL}(p\|m)+\tfrac12\mathrm{KL}(q\|m),\ m=\tfrac{p+q}{2}$ | Bounded; $\sqrt{\cdot}$ is a metric; GANs |
| Burbea–Rao / Jensen | $\tfrac{F(p)+F(q)}{2}-F\!\big(\tfrac{p+q}{2}\big)$ | Jensen gap; JS is the Shannon-entropy case |
| Entropies (functionals behind divergences) |
Shannon / Boltzmann | $H=-\!\int p\log p\,\,d\mu$ | Information content, source coding |
| Rényi | $H_\alpha=\tfrac{1}{1-\alpha}\log\!\int p^{\alpha}d\mu$ | Min-/collision-entropy; cryptography |
| Tsallis (non-additive) | $T_\alpha=\tfrac{1}{1-\alpha}\big(\!\int p^{\alpha}d\mu-1\big)$ | Non-extensive (long-range) systems |
| Von Neumann (quantum) | $S(\rho)=-\mathrm{Tr}(\rho\log\rho)$ | Quantum entropy; entanglement |
| Set & metric-space distances |
Hausdorff | $\max\{\sup_{x}\rho(x,Y),\ \sup_{y}\rho(X,y)\}$ | Set/shape matching; image comparison |
| Gromov–Hausdorff | $\inf_{\phi_X,\phi_Y}\rho_H^{Z}(\phi_X X,\phi_Y Y)$ | Compare spaces up to isometry; manifold learning |
| Optimal transport & IPMs |
Wasserstein / EMD | $\big(\inf_{\gamma\in\Gamma}\!\int\rho(x,y)^{\alpha}d\gamma\big)^{1/\alpha}$ | Optimal transport; WGAN, retrieval; EMD$=W_1$ |
| Max Mean Discrepancy | $\sup_{\|f\|_{\mathcal H}\le1}|\mathbb E_p f-\mathbb E_q f|$ | Kernel two-sample tests; generative models |
| Kolmogorov–Smirnov | $\sup_x|F_p(x)-F_q(x)|$ | Non-parametric goodness-of-fit |
| Lévy–Prokhorov | $\inf\{\varepsilon: p(A)\le q(A^{\varepsilon})+\varepsilon\ \forall A\}$ | Metrizes weak convergence (in distribution) |
| Quantum geometry |
Von Neumann divergence | $\mathrm{Tr}\big(P(\log P-\log Q)-P+Q\big)$ | Quantum KL; distinguishing quantum states |
Nesting at a glance: $L_1\!\subset\!L_2\!\subset\!L_\infty$ are Minkowski cases;
Mahalanobis is Quadratic with $Q=\Sigma^{-1}$. KL is the only divergence that is
both an $f$-divergence and a Bregman divergence. Bhattacharyya/Chernoff/Rényi all
come from the affinity $\int p^{\alpha}q^{1-\alpha}d\mu$. Wasserstein, MMD, KS,
Lévy–Prokhorov are IPMs.
Detailed distances (with worked examples)
Each entry below gives the defining equation, an Application line, and a small
Example with concrete numbers. The example numbers are exactly those printed by the
companion notebook.
1. Metric distances on vectors (geometry of points)
Classical distances between two points $\mathbf p,\mathbf q\in\mathbb R^n$. All are true metrics.
Euclidean distance, $L_2$ (Pythagoras)
$$d_2(\mathbf p,\mathbf q)=\sqrt{\textstyle\sum_i (p_i-q_i)^2}$$
Application: straight-line distance; $k$-means, nearest-neighbour search, least-squares regression.
Example: with $\mathbf p=(0,0)$, $\mathbf q=(3,4)$: $d_2=\sqrt{9+16}=\mathbf{5.0}$.
Manhattan / city-block distance, $L_1$
$$d_1(\mathbf p,\mathbf q)=\textstyle\sum_i |p_i-q_i|$$
Application: grid/route distance; robust to outliers; underlies LASSO sparsity.
Example: same $\mathbf p,\mathbf q$: $d_1=|3|+|4|=\mathbf{7.0}$.
Minkowski distance, $L_k$-norm
$$d_k(\mathbf p,\mathbf q)=\Big(\textstyle\sum_i |p_i-q_i|^k\Big)^{1/k}$$
Application: tunable family: $k{=}1$ Manhattan, $k{=}2$ Euclidean, $k{\to}\infty$ Chebyshev.
Example: same $\mathbf p,\mathbf q$, $k=3$: $(3^3+4^3)^{1/3}=91^{1/3}\approx\mathbf{4.498}$.
Chebyshev distance, $L_\infty$
$$d_\infty(\mathbf p,\mathbf q)=\max_i |p_i-q_i|$$
Application: maximum-coordinate gap; king moves on a chessboard.
Example: same $\mathbf p,\mathbf q$: $\max(3,4)=\mathbf{4.0}$.
Quadratic (generalized) distance
$$d_Q(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} Q\,(\mathbf p-\mathbf q)}\,,\quad Q \succeq 0$$
Application: feature-weighted / cross-bin distance, e.g. comparing colour histograms.
Example: $\mathbf p=(0,0)$, $\mathbf q=(3,4)$, $Q=\big[\begin{smallmatrix}2&0.5\\0.5&1\end{smallmatrix}\big]$: $d_Q\approx\mathbf{6.782}$.
Mahalanobis metric (1936)
$$d_\Sigma(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} \Sigma^{-1}(\mathbf p-\mathbf q)}$$
Application: scale- and correlation-aware distance ($Q=\Sigma^{-1}$); outlier detection, classification, metric learning.
Example: $\mathbf a=(2,1)$, $\mathbf b=(0,0)$, $\Sigma=\big[\begin{smallmatrix}2&0.5\\0.5&1\end{smallmatrix}\big]$: $d_\Sigma\approx\mathbf{1.512}$.
Hamming distance
$$d_H(\mathbf p,\mathbf q)=\big|\{\,i : p_i \ne q_i\,\}\big|$$
Application: count of differing symbols; error-correcting codes, DNA comparison, bit strings.
Example: "10110" vs "10011" differ in positions 3 and 4 $\Rightarrow d_H=\mathbf{2}$.
String / time-series distances
Application: edit-/alignment-based dissimilarities — Levenshtein (edit distance) and Dynamic Time Warping; spell-checking, bioinformatics, speech and gesture recognition.
Example: Levenshtein("kitten", "sitting") $=\mathbf{3}$; DTW of $[1,2,3]$ vs $[1,2,2,3]$ $=\mathbf{0.0}$.
2. Riemannian & information geometry (curved manifolds)
Here “distance” becomes the length of the shortest path (geodesic) on a
curved space. For statistical models, the natural curvature comes from the Fisher information.
Riemannian metric tensor & geodesic length
$$ds^2=g_{ij}\,dx^i\,dx^j,\qquad L(\gamma)=\int \sqrt{g_{ij}\,\dot x^i \dot x^j}\;dt$$
Application: shortest paths on curved surfaces/manifolds; foundation of general relativity and shape spaces.
Example: on a unit sphere the geodesic between two points at angular separation $\theta$ is the great-circle arc, $L=\theta$ (e.g. orthogonal points $\Rightarrow L=\pi/2\approx\mathbf{1.571}$). No closed-form snippet for general $g_{ij}$.
Finsler metric tensor
$$g_{ij}(x,y)=\tfrac12\,\frac{\partial^2 F^2(x,y)}{\partial y^i \partial y^j}$$
Application: generalizes Riemannian geometry to direction-dependent norms (anisotropic costs).
Example: a Randers metric $F=\sqrt{g_{ij}y^iy^j}+b_iy^i$ makes “uphill” and “downhill” travel cost differently — no single scalar value; computed by solving the geodesic flow.
Fisher information matrix
$$\mathbf I(\theta)=\mathbb E\!\left[\Big(\tfrac{\partial}{\partial\theta}\ln p(X\mid\theta)\Big)\!\Big(\tfrac{\partial}{\partial\theta}\ln p(X\mid\theta)\Big)^{\top}\right]$$
Application: the “local entropy” / natural metric on a statistical model; Cramér–Rao bound, natural-gradient descent.
Example: for $\mathcal N(\mu,\sigma^2)$ with $\sigma=2$, $\mathbf I(\mu)=1/\sigma^2=\mathbf{0.25}$ (Monte-Carlo estimate $\approx0.251$).
Fisher–Rao distance
$$\rho_{FR}(p,q)=\min_{\gamma}\int_0^1 \sqrt{\dot\gamma(t)^{\top} \mathbf I(\theta)\,\dot\gamma(t)}\;dt$$
Application: intrinsic geodesic distance between distributions; the Riemannian “gold standard” on statistical manifolds.
Example: no simple closed form in general, but locally $\rho_{FR}(p,q)\approx\sqrt{2\,\mathrm{KL}(p\|q)}$; for the $p,q$ used below, $\sqrt{2\cdot0.511}\approx\mathbf{1.011}$.
3a. $f$-divergences (Ali–Silvey 1966; Csiszár 1967)
A single template generates a zoo of divergences via a convex generator $f$ with $f(1)=0$:
$$D_f(p\|q)=\int p\,f\!\Big(\tfrac{q}{p}\Big)\,d\mu .$$
Examples use the pmfs $p=[0.5,0.5]$, $q=[0.9,0.1]$.
Kullback–Leibler divergence / relative entropy ($f(t)=-\log t$)
$$\mathrm{KL}(p\|q)=\int p\,\log\frac{p}{q}\,d\mu=\mathbb E_p\!\Big[\log\tfrac{p}{q}\Big]$$
Application: maximum-likelihood, cross-entropy loss, variational inference, model selection.
Example: $\mathrm{KL}(p\|q)=0.5\log\tfrac{0.5}{0.9}+0.5\log\tfrac{0.5}{0.1}\approx\mathbf{0.511}$ nats (and $\mathrm{KL}(q\|p)\approx0.368$ — asymmetric).
Reverse Kullback–Leibler ($f(t)=t\log t$)
$$\mathrm{KL}(q\|p)=\int q\,\log\frac{q}{p}\,d\mu$$
Application: the other KL direction; minimizing it is mode-seeking (zero-forcing), used in variational inference / expectation propagation where the approximation sits inside the target.
Example: same $p,q$: $0.9\log\tfrac{0.9}{0.5}+0.1\log\tfrac{0.1}{0.5}\approx\mathbf{0.368}$ nats.
Pearson $\chi^2$ divergence ($f(t)=(t-1)^2$)
$$\chi^2(p\|q)=\int \frac{(q-p)^2}{p}\,d\mu$$
Application: goodness-of-fit testing; local (second-order) approximation to KL.
Example: $\tfrac{(0.4)^2}{0.5}+\tfrac{(-0.4)^2}{0.5}=\mathbf{0.64}$.
Neyman $\chi^2$ divergence ($f(t)=(1-t)^2/t$)
$$\chi^2_N(p\|q)=\int \frac{(p-q)^2}{q}\,d\mu$$
Application: the “reverse” Pearson $\chi^2$ (roles of $p,q$ swapped); goodness-of-fit and importance-sampling variance diagnostics.
Example: same $p,q$: $\tfrac{(0.4)^2}{0.9}+\tfrac{(0.4)^2}{0.1}\approx\mathbf{1.778}$.
Hellinger distance ($f(t)=(\sqrt t-1)^2$)
$$H(p,q)=\sqrt{\tfrac12\int\!\big(\sqrt{p}-\sqrt{q}\big)^2 d\mu}$$
Application: a symmetric, bounded true metric between densities; robust statistics, density estimation.
Example: $H(p,q)\approx\mathbf{0.325}$ (bounded in $[0,1]$).
Total variation distance ($f(t)=\tfrac12|t-1|$)
$$\mathrm{TV}(p,q)=\tfrac12\int |p-q|\,d\mu$$
Application: the strongest “statistical distinguishability”; coupling arguments, mixing times.
Example: $\tfrac12(|0.5-0.9|+|0.5-0.1|)=\mathbf{0.4}$.
Amari $\alpha$-divergence (1985)
$$f_\alpha(t)=\frac{4}{1-\alpha^2}\Big(1-t^{\frac{1+\alpha}{2}}\Big),\ \ -1<\alpha<1$$
Application: one dial spanning KL ($\alpha{=}1$), reverse-KL ($\alpha{=}{-}1$) and Hellinger ($\alpha{=}0$); core of information geometry.
Example: at $\alpha=0$, $D=4\big(1-\int\sqrt{pq}\big)\approx\mathbf{0.422}$ on the same $p,q$.
Each $f$-divergence is fixed by its convex generator $f$ (with $f(1)=0$). The table below
collects the common ones, after Nielsen & Nock,
arXiv:1309.3029, Table 1. We write them in this post's
convention $D_f(p\|q)=\int p\,f(q/p)\,d\mu$, so the argument is $t=q/p$ (some sources use
$t=p/q$, which swaps a generator with its conjugate $f^\ast(t)=t\,f(1/t)$).
| $f$-divergence | Generator $f(t)$, $\ f(1)=0,\ t=q/p$ | In this post |
| Kullback–Leibler | $-\log t$ | §3a above |
| Reverse KL | $t\log t$ | §3a above |
| Pearson $\chi^2$ | $(t-1)^2$ | §3a above |
| Neyman $\chi^2$ | $(1-t)^2/t$ | §3a above |
| Squared Hellinger | $(\sqrt t-1)^2$ | §3a (Hellinger) |
| Total variation | $\tfrac12|t-1|$ | §3a above |
| Amari $\alpha$-divergence | $\tfrac{4}{1-\alpha^2}\big(1-t^{\frac{1+\alpha}{2}}\big)$ | §3a above |
| Pearson–Vajda $\chi^k$ | $(t-1)^k$ | generalizes $\chi^2$ ($k{=}2$) |
| Pearson–Vajda $|\chi|^k$ | $|t-1|^k$ | generalizes TV ($k{=}1$) |
| Jensen–Shannon | $-(t+1)\log\tfrac{1+t}{2}+t\log t$ | §3d (symmetrized) |
Except total variation, $f$-divergences are not metrics. KL and reverse-KL are the
$\alpha{=}\mp1$ limits of the $\alpha$-divergence; $\chi^2$/Neyman and TV are the $k{=}2$/$k{=}1$
cases of the Pearson–Vajda families.
3b. Bregman divergences (1967)
Generated by a strictly convex potential $F$: the gap between $F$ and its tangent plane at $\theta_2$.
$$B_F(\theta_1\|\theta_2)=F(\theta_1)-F(\theta_2)-\langle\,\theta_1-\theta_2,\ \nabla F(\theta_2)\rangle .$$
Squared Euclidean ($F(\mathbf x)=\|\mathbf x\|^2$)
$$B_F=\|\theta_1-\theta_2\|^2$$
Application: centroid clustering / $k$-means.
Example: $\mathbf x=(1,2)$, $\mathbf y=(0,0)$: $1^2+2^2=\mathbf{5.0}$.
Kullback–Leibler (discrete), $F=-H$
Application: KL is the unique divergence that is both an $f$-divergence and a Bregman divergence.
Example: same $p,q$ as in 3a $\Rightarrow B_F=\mathrm{KL}(p\|q)\approx\mathbf{0.511}$.
Itakura–Saito divergence ($F=-\sum_i\log x_i$)
$$\mathrm{IS}(p\|q)=\sum_i\Big(\frac{p_i}{q_i}-\log\frac{p_i}{q_i}-1\Big)$$
Application: scale-invariant spectral distortion; speech/audio coding, non-negative matrix factorization (NMF).
Example: $p=[1,2,3]$, $q=[1,1,4]$: $\mathrm{IS}\approx\mathbf{0.345}$.
Log-Det divergence (on SPD matrices)
$$D(\mathbf P\|\mathbf Q)=\langle\mathbf P,\mathbf Q^{-1}\rangle-\log\det(\mathbf P\mathbf Q^{-1})-\dim\mathbf P$$
Application: a Bregman divergence on positive-definite matrices; covariance comparison, metric learning (ITML).
Example: $\mathbf P=\mathrm{diag}(2,1)$, $\mathbf Q=I$: $\mathrm{tr}(\mathbf P)-\log\det\mathbf P-2\approx\mathbf{0.307}$.
Like $f$-divergences, every Bregman divergence is fixed by one convex seed $F$. The common
vector generators (Banerjee et al., 2005) are collected below.
| Bregman divergence | Seed $F(x)$ | $B_F(x\|y)$ | Domain |
| Squared Euclidean | $\|x\|^2$ | $\|x-y\|^2$ | $\mathbb R^d$ |
| Generalized KL (I-divergence) | $\sum_i x_i\log x_i$ | $\sum_i\big(x_i\log\tfrac{x_i}{y_i}-x_i+y_i\big)$ | $\mathbb R_{+}^d$ |
| Itakura–Saito | $-\sum_i\log x_i$ | $\sum_i\big(\tfrac{x_i}{y_i}-\log\tfrac{x_i}{y_i}-1\big)$ | $\mathbb R_{++}^d$ |
| Mahalanobis | $x^{\top}A\,x$ | $(x-y)^{\top}A\,(x-y)$ | $\mathbb R^d,\ A\succ0$ |
| Exponential | $\sum_i e^{x_i}$ | $\sum_i\big(e^{x_i}-e^{y_i}-(x_i-y_i)e^{y_i}\big)$ | $\mathbb R^d$ |
3e. Matrix Bregman divergences
A Bregman divergence (BD) is the gap between a convex seed and its tangent plane.
Matrix BDs lift this from vectors to matrices: take a strictly convex
spectral seed $\phi$ (a function of the eigenvalues) and the trace inner product
$\langle X,Y\rangle=\mathrm{tr}(X^{\top}Y)$,
$$B_\phi(\mathbf X\|\mathbf Y)=\phi(\mathbf X)-\phi(\mathbf Y)-\big\langle \mathbf X-\mathbf Y,\ \nabla\phi(\mathbf Y)\big\rangle .$$
They measure dissimilarity between matrices — covariance, kernel, or density matrices — and
the von Neumann and Log-Det divergences seen above are simply the matrix BDs of the entropy
and Burg seeds. Nock et al.,
“Mining Matrix Data with Bregman Matrix Divergences for Portfolio Selection”,
use them in a mean-divergence framework that generalizes Markowitz mean-variance:
the risk premium of an allocation $\mathbf A$ relative to the market $\boldsymbol\Theta$ is
$p_\phi(\mathbf A;\boldsymbol\Theta)=\tfrac{1}{a}\,B_\phi(\boldsymbol\Theta-a\mathbf A\,\|\,\boldsymbol\Theta)$.
| Matrix BD | Seed $\phi(\mathbf X)$ | $B_\phi(\mathbf X\|\mathbf Y)$ | Where used |
| Squared Frobenius | $\|\mathbf X\|_F^2=\mathrm{tr}(\mathbf X^2)$ | $\|\mathbf X-\mathbf Y\|_F^2$ | baseline matrix distance (matrix Mahalanobis) |
| Von Neumann | $\mathrm{tr}(\mathbf X\log\mathbf X-\mathbf X)$ | $\mathrm{tr}\big(\mathbf X(\log\mathbf X-\log\mathbf Y)-\mathbf X+\mathbf Y\big)$ | density/covariance matrices; quantum relative entropy |
| Log-Det / Burg | $-\log\det\mathbf X$ | $\mathrm{tr}(\mathbf X\mathbf Y^{-1})-\log\det(\mathbf X\mathbf Y^{-1})-n$ | covariance comparison, metric learning (ITML) |
| Bregman–Schatten-$p$ | $\|\mathbf X\|_p^p\ \ (p>1)$ | $\tfrac12\,\mathrm{tr}\big(\mathbf X^{2p}-2\mathbf X\mathbf Y^{p-1}+(p-1)\mathbf Y^p\big)$ | tunable spectral family |
These are the matrix analogues of squared-Euclidean, generalized-KL, and the Log-Det
Bregman from §3b; the von Neumann divergence and Log-Det divergence elsewhere in this post
(§7 and §3b) are exactly these matrix BDs. Numeric examples for the von Neumann and Log-Det
cases are in §3b and §7.
3c. Overlap / $\alpha$-power family
All built from the affinity integral $\int p^{\alpha}q^{\,1-\alpha}\,d\mu$ (same $p,q$ pmfs).
Bhattacharyya distance
$$d_B(p,q)=-\log\!\int \sqrt{p\,q}\,\,d\mu \quad\big(BC=\textstyle\int\sqrt{pq}\big)$$
Application: class-separability measure; object tracking; feature selection.
Example: coefficient $BC\approx0.894 \Rightarrow d_B=-\log(0.894)\approx\mathbf{0.112}$.
Chernoff divergence / information (1952)
$$C(p,q)=\max_{\alpha\in(0,1)} \Big(-\log\!\int p^{\alpha}q^{1-\alpha}d\mu\Big)$$
Application: optimal error exponent in binary hypothesis testing.
Example: grid-searching $\alpha\in(0,1)$ on the same $p,q$ gives $C\approx\mathbf{0.112}$ (near $\alpha=0.5$).
Rényi divergence (1961)
$$R_\alpha(p\|q)=\frac{1}{\alpha-1}\log\!\int p^{\alpha}q^{1-\alpha}d\mu \xrightarrow[\alpha\to1]{}\mathrm{KL}(p\|q)$$
Application: differential-privacy accounting, information theory, generalized entropies.
Example: at $\alpha=0.5$ on the same $p,q$: $R_{0.5}\approx\mathbf{0.223}$ nats.
3d. Symmetrized & Jensen-type divergences
Jeffreys divergence (symmetric KL)
$$J(p,q)=\mathrm{KL}(p\|q)+\mathrm{KL}(q\|p)$$
Application: a symmetric KL when direction is arbitrary.
Example: $0.511+0.368\approx\mathbf{0.879}$ on the same $p,q$.
Jensen–Shannon divergence
$$\mathrm{JS}(p,q)=\tfrac12\mathrm{KL}\!\big(p\,\big\|\,m\big)+\tfrac12\mathrm{KL}\!\big(q\,\big\|\,m\big),\quad m=\tfrac{p+q}{2}$$
Application: symmetric, always finite, $\sqrt{\mathrm{JS}}$ is a metric; original GAN objective; comparing text/topic distributions.
Example: $\mathrm{JS}\approx\mathbf{0.102}$ nats $\Rightarrow \sqrt{\mathrm{JS}}\approx0.319$ (a metric).
Burbea–Rao / Jensen divergence
$$J_F(p,q)=\frac{F(p)+F(q)}{2}-F\!\Big(\frac{p+q}{2}\Big)$$
Application: the “Jensen gap” of a convex $F$; JS is the case $F=-H$ (negative Shannon entropy).
Example: with $F=-H$ on the same $p,q$, $J_F\approx\mathbf{0.102}$ — identical to JS above.
4. Entropies (the functionals divergences are built from)
Entropy measures the uncertainty/spread of a single distribution. Examples use $r=[0.5,0.25,0.25]$.
Shannon / Boltzmann–Gibbs entropy
$$H(p)=-\!\int p\log p\,\,d\mu$$
Application: information content, source coding, thermodynamics.
Example: $H(r)\approx\mathbf{1.040}$ nats $=\mathbf{1.5}$ bits.
Rényi entropy (1961)
$$H_\alpha(p)=\frac{1}{1-\alpha}\log\!\int p^{\alpha}\,d\mu$$
Application: additive generalization of Shannon entropy; collision/min-entropy in cryptography.
Example: at $\alpha=2$ on $r$: $H_2\approx\mathbf{0.981}$ nats (collision entropy).
Tsallis entropy — non-additive (1988)
$$T_\alpha(p)=\frac{1}{1-\alpha}\Big(\int p^{\alpha}\,d\mu-1\Big)$$
Application: non-extensive (long-range correlated) systems in statistical physics.
Example: at $\alpha=2$ on $r$: $T_2=1-\sum r_i^2=\mathbf{0.625}$.
Sharma–Mittal entropy (two-parameter unifier)
$$h_{\alpha,\beta}(p)=\frac{1}{1-\beta}\bigg(\Big(\int p^{\alpha}d\mu\Big)^{\!\frac{1-\beta}{1-\alpha}}-1\bigg)$$
Application: unifies Shannon, Rényi and Tsallis entropies as limiting cases.
Example: at $\alpha=2,\beta=3$ on $r$: $h_{2,3}\approx\mathbf{0.430}$.
5. Distances between sets & whole metric spaces
Hausdorff distance
$$d_{\mathrm{Haus}}(X,Y)=\max\Big\{\,\sup_{x\in X}\rho(x,Y),\ \sup_{y\in Y}\rho(X,y)\Big\}$$
Application: how far two sets are; shape/image matching, template comparison.
Example: $X=\{(0,0),(1,0),(0,1)\}$, $Y=\{(0,0),(1,1)\}$: $d_{\mathrm{Haus}}=\mathbf{1.0}$.
Gromov–Hausdorff distance
$$d_{GH}(X,Y)=\inf_{\phi_X,\phi_Y}\rho_{\mathrm{Haus}}^{Z}\big(\phi_X(X),\phi_Y(Y)\big)$$
Application: compares whole metric spaces (shapes, graphs) up to isometry; manifold learning, 3-D shape matching.
Example: computing it exactly is NP-hard (infimum over all isometric embeddings); two isometric shapes give $d_{GH}=\mathbf{0}$, and it is usually approximated via the Gromov–Wasserstein relaxation.
6. Optimal transport & integral probability metrics (IPMs)
An IPM measures distance by the largest gap a test function from a class $\mathcal F$ can produce: $\gamma_{\mathcal F}(p,q)=\sup_{f\in\mathcal F}\big|\int f\,dp-\int f\,dq\big|$.
Wasserstein distance / Earth Mover’s Distance (EMD)
$$W_\alpha(p,q)=\Big(\inf_{\gamma\in\Gamma(p,q)}\int \rho(x,y)^{\alpha}\,d\gamma(x,y)\Big)^{1/\alpha}$$
Application: “minimum cost to morph $p$ into $q$”; WGANs, image retrieval, domain adaptation. EMD $=W_1$.
Example: 1-D samples $u=[0,1,2,3]$, $v=[1,2,3,4]$ (sort & average gaps): $W_1=\mathbf{1.0}$.
Maximum Mean Discrepancy (MMD)
$$\mathrm{MMD}(p,q)=\sup_{\|f\|_{\mathcal H}\le1}\big|\mathbb E_p f-\mathbb E_q f\big|$$
Application: IPM over an RKHS ball; kernel two-sample tests, training generative models.
Example: RBF kernel ($\gamma=0.5$), $X=\{0,1,2\}$, $Y=\{3,4,5\}$: $\mathrm{MMD}^2\approx\mathbf{1.063}$.
Stein discrepancy
Application: IPM built from a Stein operator; needs the score $\nabla\log p$ but not the normalizing constant; sampler/MCMC diagnostics, Stein variational gradient descent.
Example: not a single closed-form number here — it is a supremum over a Stein-RKHS ball; in practice estimated from samples and the model score.
Kolmogorov(–Smirnov) distance
$$K(p,q)=\sup_x \big|F_p(x)-F_q(x)\big|$$
Application: sup-distance between CDFs; classic non-parametric goodness-of-fit test.
Example: same $u,v$ as Wasserstein: $\sup_x|F_u-F_v|=\mathbf{0.25}$.
Lévy–Prokhorov distance
$$\mathrm{LP}_\rho(p,q)=\inf\big\{\varepsilon>0:\ p(A)\le q(A^{\varepsilon})+\varepsilon\ \ \forall A\in\mathcal B(\mathcal X)\big\}$$
Application: metrizes weak convergence of probability measures (convergence in distribution).
Example: an infimum over all Borel sets — no one-line value; for a point mass shifted by $\delta$, $\mathrm{LP}=\min(\delta,1)$ (e.g. $\delta=0.3\Rightarrow\mathbf{0.3}$).
7. Quantum geometry (density matrices)
Replace probability densities by a density matrix $\rho$; integrals become traces. With diagonal $\rho$ these reduce to the classical formulas on the eigenvalues.
Von Neumann entropy (1927)
$$S(\rho)=-\mathrm{Tr}(\rho\log\rho)$$
Application: quantum analogue of Shannon entropy; entanglement and quantum information.
Example: $\rho=\mathrm{diag}(0.7,0.3)$: $S=-(0.7\log0.7+0.3\log0.3)\approx\mathbf{0.611}$ nats.
Von Neumann (quantum relative) divergence
$$D(\mathbf P\|\mathbf Q)=\mathrm{Tr}\big(\mathbf P(\log\mathbf P-\log\mathbf Q)-\mathbf P+\mathbf Q\big)$$
Application: quantum analogue of KL divergence; distinguishability of quantum states.
Example: $\mathbf P=\mathrm{diag}(0.7,0.3)$, $\mathbf Q=\mathrm{diag}(0.5,0.5)$: $D\approx\mathbf{0.082}$ nats.
Companion notebook
Every Example above is reproduced in the companion notebook, which you can
read inline as a page: each
distance is shown as a rendered equation, then its minimal numpy code, then
the computed output — so you can check every number against the post. It depends only on
numpy, and the few genuinely intractable objects (Gromov–Hausdorff,
Lévy–Prokhorov, Fisher–Rao and Riemannian geodesics, Finsler metric, Stein discrepancy)
are described with a note rather than a misleading exact value. The raw
.ipynb is also available to download and run.
The big picture — how the families nest
- $L_1\subset L_2\subset L_\infty$ are all special cases of the Minkowski $L_k$ family; Mahalanobis is the Quadratic distance with $Q=\Sigma^{-1}$.
- $f$-divergences contain KL, reverse-KL, $\chi^2$, Hellinger, total variation and the $\alpha$-divergence.
- Bregman divergences contain squared-Euclidean, KL (discrete), Itakura–Saito, Mahalanobis and Log-Det.
- KL is the unique divergence lying in both the $f$-divergence and Bregman families.
- Bhattacharyya, Chernoff, Rényi are all read off the affinity $\int p^\alpha q^{1-\alpha}d\mu$.
- Jensen–Shannon is a Burbea–Rao divergence; Jeffreys is symmetrized KL.
- Wasserstein, MMD, Stein, Kolmogorov are IPMs; EMD $=$ Wasserstein-1.
- Fisher–Rao is the Riemannian (geodesic) distance; locally it agrees with $\sqrt{2\,\mathrm{KL}}$.
- Von Neumann & Log-Det are the quantum / matrix counterparts of KL and a Bregman divergence.
- Matrix Bregman divergences (Frobenius, von Neumann, Log-Det, Schatten-$p$) lift the vector Bregman family to matrices via a spectral seed and the trace inner product.