Reference Note • Distances & Divergences

20 minute read

Published: June 2026

Taxonomy of Principal Distances & Divergences

Hamidreza Hashempoor • Institute for AI, University of Stuttgart

Detailed PDF Table PDF View notebook Nielsen's chart Summary table Detailed

How to read Summary / Table Detailed distances Notebook Big picture

This note reorganizes the classical “distances and divergences” landscape (after Frank Nielsen’s taxonomy chart) into clean families. It comes in two views: a compact quick-reference table, and a detailed reference where every entry carries its defining equation, where it is used, and a small worked numeric example. The numbers in every Example line are reproduced by the companion Jupyter notebook.

How to read this sheet — metric vs. divergence

A metric (true distance) satisfies four rules: $d(p,q)\ge 0$; $d(p,q)=0 \Leftrightarrow p=q$; symmetry $d(p,q)=d(q,p)$; and the triangle inequality $d(p,r)\le d(p,q)+d(q,r)$. A divergence keeps only the first two (non-negativity and identity): it is generally asymmetric, $D(p\|q)\ne D(q\|p)$, and need not obey the triangle inequality. Divergences measure how one distribution differs from a reference; metrics measure mutual separation. The symbol $\|$ denotes the asymmetric “from / to” argument order.

Summary / Table of distances

Grouped families, defining equations, and typical applications. Generator conventions: $f$-divergence $D_f(p\|q)=\int p\,f(q/p)\,d\mu$; Bregman $B_F$ from a convex potential $F$. $\|$ marks asymmetric (divergence) arguments; argument-order conventions vary by source.

Group	Distance / divergence	Equation	Application — where to use
Vector & metric distances	Euclidean ($L_2$)	$d_2=\sqrt{\sum_i (p_i-q_i)^2}$	$k$-means, $k$-NN, least squares
	Manhattan ($L_1$)	$d_1=\sum_i \|p_i-q_i\|$	Grid/route cost; robust to outliers; LASSO
	Minkowski ($L_k$)	$d_k=\big(\sum_i \|p_i-q_i\|^k\big)^{1/k}$	Tunable; unifies $L_1,L_2,L_\infty$
	Chebyshev ($L_\infty$)	$d_\infty=\max_i \|p_i-q_i\|$	Max-coordinate gap; chessboard moves
	Mahalanobis	$\sqrt{(p-q)^{\top}\Sigma^{-1}(p-q)}$	Scale/correlation-aware; outliers, classification
	Quadratic	$\sqrt{(p-q)^{\top} Q\,(p-q)}$	Feature-weighted; colour histograms
	Hamming	$\big\|\{i:p_i\neq q_i\}\big\|$	Error-correcting codes, DNA, bit strings
Riemannian & information geometry	Fisher information	$\mathbf I(\theta)=\mathbb E[(\partial_\theta\ln p)(\partial_\theta\ln p)^{\top}]$	Natural metric on models; Cramér–Rao, natural gradient
	Fisher–Rao distance	$\rho=\min_\gamma\!\int_0^1\!\sqrt{\dot\gamma^{\top}\mathbf I\,\dot\gamma}\,dt$	Intrinsic geodesic distance between distributions
	Riemannian geodesic	$L=\int\!\sqrt{g_{ij}\,\dot x^i\dot x^j}\,dt$	Shortest path on curved manifolds; shape spaces
$f$-divergences (Csiszár / Ali–Silvey)	general template	$D_f(p\\|q)=\int p\,f(q/p)\,d\mu$	Master family — pick the convex generator $f$
	Kullback–Leibler	$\int p\log\tfrac{p}{q}\,d\mu$	MLE, cross-entropy loss, variational inference
	Pearson $\chi^2$	$\int \tfrac{(q-p)^2}{p}\,d\mu$	Goodness-of-fit; local KL approximation
	Hellinger	$\sqrt{\tfrac12\!\int(\sqrt p-\sqrt q)^2\,d\mu}$	Symmetric bounded metric; robust statistics
	Total variation	$\tfrac12\!\int\|p-q\|\,d\mu$	Statistical distinguishability; coupling
	Amari $\alpha$-divergence	$f_\alpha(t)=\tfrac{4}{1-\alpha^2}\big(1-t^{\frac{1+\alpha}{2}}\big)$	One dial: KL ($\alpha{=}1$), Hellinger ($\alpha{=}0$)
Bregman divergences	general template	$B_F(x\\|y)=F(x)-F(y)-\langle x-y,\nabla F(y)\rangle$	Master family — pick the convex potential $F$
	Squared Euclidean	$\\|x-y\\|^2$	Centroid clustering ($k$-means)
	Itakura–Saito	$\sum_i\!\big(\tfrac{p_i}{q_i}-\log\tfrac{p_i}{q_i}-1\big)$	Audio/spectral distortion; NMF
	Log-Det	$\langle P,Q^{-1}\rangle-\log\det(PQ^{-1})-n$	SPD matrices; metric learning (ITML)
Overlap / $\alpha$-power family	Bhattacharyya	$-\log\!\int\!\sqrt{pq}\,\,d\mu$	Class separability; object tracking
	Chernoff information	$\max_{\alpha\in(0,1)}\,-\log\!\int p^{\alpha} q^{1-\alpha}d\mu$	Error exponent in hypothesis testing
	Rényi divergence	$\tfrac{1}{\alpha-1}\log\!\int p^{\alpha} q^{1-\alpha}d\mu$	Differential privacy; information theory
Symmetrized & Jensen-type	Jeffreys	$\mathrm{KL}(p\\|q)+\mathrm{KL}(q\\|p)$	Symmetric KL when direction is arbitrary
	Jensen–Shannon	$\tfrac12\mathrm{KL}(p\\|m)+\tfrac12\mathrm{KL}(q\\|m),\ m=\tfrac{p+q}{2}$	Bounded; $\sqrt{\cdot}$ is a metric; GANs
	Burbea–Rao / Jensen	$\tfrac{F(p)+F(q)}{2}-F\!\big(\tfrac{p+q}{2}\big)$	Jensen gap; JS is the Shannon-entropy case
Entropies (functionals behind divergences)	Shannon / Boltzmann	$H=-\!\int p\log p\,\,d\mu$	Information content, source coding
	Rényi	$H_\alpha=\tfrac{1}{1-\alpha}\log\!\int p^{\alpha}d\mu$	Min-/collision-entropy; cryptography
	Tsallis (non-additive)	$T_\alpha=\tfrac{1}{1-\alpha}\big(\!\int p^{\alpha}d\mu-1\big)$	Non-extensive (long-range) systems
	Von Neumann (quantum)	$S(\rho)=-\mathrm{Tr}(\rho\log\rho)$	Quantum entropy; entanglement
Set & metric-space distances	Hausdorff	$\max\{\sup_{x}\rho(x,Y),\ \sup_{y}\rho(X,y)\}$	Set/shape matching; image comparison
Set & metric-space distances	Gromov–Hausdorff	$\inf_{\phi_X,\phi_Y}\rho_H^{Z}(\phi_X X,\phi_Y Y)$	Compare spaces up to isometry; manifold learning
Optimal transport & IPMs	Wasserstein / EMD	$\big(\inf_{\gamma\in\Gamma}\!\int\rho(x,y)^{\alpha}d\gamma\big)^{1/\alpha}$	Optimal transport; WGAN, retrieval; EMD$=W_1$
	Max Mean Discrepancy	$\sup_{\\|f\\|_{\mathcal H}\le1}\|\mathbb E_p f-\mathbb E_q f\|$	Kernel two-sample tests; generative models
	Kolmogorov–Smirnov	$\sup_x\|F_p(x)-F_q(x)\|$	Non-parametric goodness-of-fit
	Lévy–Prokhorov	$\inf\{\varepsilon: p(A)\le q(A^{\varepsilon})+\varepsilon\ \forall A\}$	Metrizes weak convergence (in distribution)
Quantum geometry	Von Neumann divergence	$\mathrm{Tr}\big(P(\log P-\log Q)-P+Q\big)$	Quantum KL; distinguishing quantum states

Nesting at a glance: $L_1\!\subset\!L_2\!\subset\!L_\infty$ are Minkowski cases; Mahalanobis is Quadratic with $Q=\Sigma^{-1}$. KL is the only divergence that is both an $f$-divergence and a Bregman divergence. Bhattacharyya/Chernoff/Rényi all come from the affinity $\int p^{\alpha}q^{1-\alpha}d\mu$. Wasserstein, MMD, KS, Lévy–Prokhorov are IPMs.

Detailed distances (with worked examples)

Each entry below gives the defining equation, an Application line, and a small Example with concrete numbers. The example numbers are exactly those printed by the companion notebook.

1. Metric distances on vectors (geometry of points)

Classical distances between two points $\mathbf p,\mathbf q\in\mathbb R^n$. All are true metrics.

Euclidean distance, $L_2$ (Pythagoras)

$$d_2(\mathbf p,\mathbf q)=\sqrt{\textstyle\sum_i (p_i-q_i)^2}$$

Application: straight-line distance; $k$-means, nearest-neighbour search, least-squares regression.

Example: with $\mathbf p=(0,0)$, $\mathbf q=(3,4)$: $d_2=\sqrt{9+16}=\mathbf{5.0}$.

Manhattan / city-block distance, $L_1$

$$d_1(\mathbf p,\mathbf q)=\textstyle\sum_i |p_i-q_i|$$

Application: grid/route distance; robust to outliers; underlies LASSO sparsity.

Example: same $\mathbf p,\mathbf q$: $d_1=|3|+|4|=\mathbf{7.0}$.

Minkowski distance, $L_k$-norm

$$d_k(\mathbf p,\mathbf q)=\Big(\textstyle\sum_i |p_i-q_i|^k\Big)^{1/k}$$

Application: tunable family: $k{=}1$ Manhattan, $k{=}2$ Euclidean, $k{\to}\infty$ Chebyshev.

Example: same $\mathbf p,\mathbf q$, $k=3$: $(3^3+4^3)^{1/3}=91^{1/3}\approx\mathbf{4.498}$.

Chebyshev distance, $L_\infty$

$$d_\infty(\mathbf p,\mathbf q)=\max_i |p_i-q_i|$$

Application: maximum-coordinate gap; king moves on a chessboard.

Example: same $\mathbf p,\mathbf q$: $\max(3,4)=\mathbf{4.0}$.

Quadratic (generalized) distance

$$d_Q(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} Q\,(\mathbf p-\mathbf q)}\,,\quad Q \succeq 0$$

Application: feature-weighted / cross-bin distance, e.g. comparing colour histograms.

Example: $\mathbf p=(0,0)$, $\mathbf q=(3,4)$, $Q=\big[\begin{smallmatrix}2&0.5\\0.5&1\end{smallmatrix}\big]$: $d_Q\approx\mathbf{6.782}$.

Mahalanobis metric (1936)

$$d_\Sigma(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} \Sigma^{-1}(\mathbf p-\mathbf q)}$$

Application: scale- and correlation-aware distance ($Q=\Sigma^{-1}$); outlier detection, classification, metric learning.

Example: $\mathbf a=(2,1)$, $\mathbf b=(0,0)$, $\Sigma=\big[\begin{smallmatrix}2&0.5\\0.5&1\end{smallmatrix}\big]$: $d_\Sigma\approx\mathbf{1.512}$.

Hamming distance

$$d_H(\mathbf p,\mathbf q)=\big|\{\,i : p_i \ne q_i\,\}\big|$$

Application: count of differing symbols; error-correcting codes, DNA comparison, bit strings.

Example: "10110" vs "10011" differ in positions 3 and 4 $\Rightarrow d_H=\mathbf{2}$.

String / time-series distances

Application: edit-/alignment-based dissimilarities — Levenshtein (edit distance) and Dynamic Time Warping; spell-checking, bioinformatics, speech and gesture recognition.

Example: Levenshtein("kitten", "sitting") $=\mathbf{3}$; DTW of $[1,2,3]$ vs $[1,2,2,3]$ $=\mathbf{0.0}$.

2. Riemannian & information geometry (curved manifolds)

Here “distance” becomes the length of the shortest path (geodesic) on a curved space. For statistical models, the natural curvature comes from the Fisher information.

Riemannian metric tensor & geodesic length

$$ds^2=g_{ij}\,dx^i\,dx^j,\qquad L(\gamma)=\int \sqrt{g_{ij}\,\dot x^i \dot x^j}\;dt$$

Application: shortest paths on curved surfaces/manifolds; foundation of general relativity and shape spaces.

Example: on a unit sphere the geodesic between two points at angular separation $\theta$ is the great-circle arc, $L=\theta$ (e.g. orthogonal points $\Rightarrow L=\pi/2\approx\mathbf{1.571}$). No closed-form snippet for general $g_{ij}$.

Finsler metric tensor

$$g_{ij}(x,y)=\tfrac12\,\frac{\partial^2 F^2(x,y)}{\partial y^i \partial y^j}$$

Application: generalizes Riemannian geometry to direction-dependent norms (anisotropic costs).

Example: a Randers metric $F=\sqrt{g_{ij}y^iy^j}+b_iy^i$ makes “uphill” and “downhill” travel cost differently — no single scalar value; computed by solving the geodesic flow.

Fisher information matrix

$$\mathbf I(\theta)=\mathbb E\!\left[\Big(\tfrac{\partial}{\partial\theta}\ln p(X\mid\theta)\Big)\!\Big(\tfrac{\partial}{\partial\theta}\ln p(X\mid\theta)\Big)^{\top}\right]$$

Application: the “local entropy” / natural metric on a statistical model; Cramér–Rao bound, natural-gradient descent.

Example: for $\mathcal N(\mu,\sigma^2)$ with $\sigma=2$, $\mathbf I(\mu)=1/\sigma^2=\mathbf{0.25}$ (Monte-Carlo estimate $\approx0.251$).

Fisher–Rao distance

$$\rho_{FR}(p,q)=\min_{\gamma}\int_0^1 \sqrt{\dot\gamma(t)^{\top} \mathbf I(\theta)\,\dot\gamma(t)}\;dt$$

Application: intrinsic geodesic distance between distributions; the Riemannian “gold standard” on statistical manifolds.

Example: no simple closed form in general, but locally $\rho_{FR}(p,q)\approx\sqrt{2\,\mathrm{KL}(p\|q)}$; for the $p,q$ used below, $\sqrt{2\cdot0.511}\approx\mathbf{1.011}$.

3a. $f$-divergences (Ali–Silvey 1966; Csiszár 1967)

A single template generates a zoo of divergences via a convex generator $f$ with $f(1)=0$:

$$D_f(p\|q)=\int p\,f\!\Big(\tfrac{q}{p}\Big)\,d\mu .$$

Examples use the pmfs $p=[0.5,0.5]$, $q=[0.9,0.1]$.

Kullback–Leibler divergence / relative entropy ($f(t)=-\log t$)

$$\mathrm{KL}(p\|q)=\int p\,\log\frac{p}{q}\,d\mu=\mathbb E_p\!\Big[\log\tfrac{p}{q}\Big]$$

Application: maximum-likelihood, cross-entropy loss, variational inference, model selection.

Example: $\mathrm{KL}(p\|q)=0.5\log\tfrac{0.5}{0.9}+0.5\log\tfrac{0.5}{0.1}\approx\mathbf{0.511}$ nats (and $\mathrm{KL}(q\|p)\approx0.368$ — asymmetric).

Reverse Kullback–Leibler ($f(t)=t\log t$)

$$\mathrm{KL}(q\|p)=\int q\,\log\frac{q}{p}\,d\mu$$

Application: the other KL direction; minimizing it is mode-seeking (zero-forcing), used in variational inference / expectation propagation where the approximation sits inside the target.

Example: same $p,q$: $0.9\log\tfrac{0.9}{0.5}+0.1\log\tfrac{0.1}{0.5}\approx\mathbf{0.368}$ nats.

Pearson $\chi^2$ divergence ($f(t)=(t-1)^2$)

$$\chi^2(p\|q)=\int \frac{(q-p)^2}{p}\,d\mu$$

Application: goodness-of-fit testing; local (second-order) approximation to KL.

Example: $\tfrac{(0.4)^2}{0.5}+\tfrac{(-0.4)^2}{0.5}=\mathbf{0.64}$.

Neyman $\chi^2$ divergence ($f(t)=(1-t)^2/t$)

$$\chi^2_N(p\|q)=\int \frac{(p-q)^2}{q}\,d\mu$$

Application: the “reverse” Pearson $\chi^2$ (roles of $p,q$ swapped); goodness-of-fit and importance-sampling variance diagnostics.

Example: same $p,q$: $\tfrac{(0.4)^2}{0.9}+\tfrac{(0.4)^2}{0.1}\approx\mathbf{1.778}$.

Hellinger distance ($f(t)=(\sqrt t-1)^2$)

$$H(p,q)=\sqrt{\tfrac12\int\!\big(\sqrt{p}-\sqrt{q}\big)^2 d\mu}$$

Application: a symmetric, bounded true metric between densities; robust statistics, density estimation.

Example: $H(p,q)\approx\mathbf{0.325}$ (bounded in $[0,1]$).

Total variation distance ($f(t)=\tfrac12|t-1|$)

$$\mathrm{TV}(p,q)=\tfrac12\int |p-q|\,d\mu$$

Application: the strongest “statistical distinguishability”; coupling arguments, mixing times.

Example: $\tfrac12(|0.5-0.9|+|0.5-0.1|)=\mathbf{0.4}$.

Amari $\alpha$-divergence (1985)

$$f_\alpha(t)=\frac{4}{1-\alpha^2}\Big(1-t^{\frac{1+\alpha}{2}}\Big),\ \ -1<\alpha<1$$

Application: one dial spanning KL ($\alpha{=}1$), reverse-KL ($\alpha{=}{-}1$) and Hellinger ($\alpha{=}0$); core of information geometry.

Example: at $\alpha=0$, $D=4\big(1-\int\sqrt{pq}\big)\approx\mathbf{0.422}$ on the same $p,q$.

Each $f$-divergence is fixed by its convex generator $f$ (with $f(1)=0$). The table below collects the common ones, after Nielsen & Nock, arXiv:1309.3029, Table 1. We write them in this post's convention $D_f(p\|q)=\int p\,f(q/p)\,d\mu$, so the argument is $t=q/p$ (some sources use $t=p/q$, which swaps a generator with its conjugate $f^\ast(t)=t\,f(1/t)$).

$f$-divergence	Generator $f(t)$, $\ f(1)=0,\ t=q/p$	In this post
Kullback–Leibler	$-\log t$	§3a above
Reverse KL	$t\log t$	§3a above
Pearson $\chi^2$	$(t-1)^2$	§3a above
Neyman $\chi^2$	$(1-t)^2/t$	§3a above
Squared Hellinger	$(\sqrt t-1)^2$	§3a (Hellinger)
Total variation	$\tfrac12\|t-1\|$	§3a above
Amari $\alpha$-divergence	$\tfrac{4}{1-\alpha^2}\big(1-t^{\frac{1+\alpha}{2}}\big)$	§3a above
Pearson–Vajda $\chi^k$	$(t-1)^k$	generalizes $\chi^2$ ($k{=}2$)
Pearson–Vajda $\|\chi\|^k$	$\|t-1\|^k$	generalizes TV ($k{=}1$)
Jensen–Shannon	$-(t+1)\log\tfrac{1+t}{2}+t\log t$	§3d (symmetrized)

Except total variation, $f$-divergences are not metrics. KL and reverse-KL are the $\alpha{=}\mp1$ limits of the $\alpha$-divergence; $\chi^2$/Neyman and TV are the $k{=}2$/$k{=}1$ cases of the Pearson–Vajda families.

3b. Bregman divergences (1967)

Generated by a strictly convex potential $F$: the gap between $F$ and its tangent plane at $\theta_2$.

$$B_F(\theta_1\|\theta_2)=F(\theta_1)-F(\theta_2)-\langle\,\theta_1-\theta_2,\ \nabla F(\theta_2)\rangle .$$

Squared Euclidean ($F(\mathbf x)=\|\mathbf x\|^2$)

$$B_F=\|\theta_1-\theta_2\|^2$$

Application: centroid clustering / $k$-means.

Example: $\mathbf x=(1,2)$, $\mathbf y=(0,0)$: $1^2+2^2=\mathbf{5.0}$.

Kullback–Leibler (discrete), $F=-H$

Application: KL is the unique divergence that is both an $f$-divergence and a Bregman divergence.

Example: same $p,q$ as in 3a $\Rightarrow B_F=\mathrm{KL}(p\|q)\approx\mathbf{0.511}$.

Itakura–Saito divergence ($F=-\sum_i\log x_i$)

$$\mathrm{IS}(p\|q)=\sum_i\Big(\frac{p_i}{q_i}-\log\frac{p_i}{q_i}-1\Big)$$

Application: scale-invariant spectral distortion; speech/audio coding, non-negative matrix factorization (NMF).

Example: $p=[1,2,3]$, $q=[1,1,4]$: $\mathrm{IS}\approx\mathbf{0.345}$.

Log-Det divergence (on SPD matrices)

$$D(\mathbf P\|\mathbf Q)=\langle\mathbf P,\mathbf Q^{-1}\rangle-\log\det(\mathbf P\mathbf Q^{-1})-\dim\mathbf P$$

Application: a Bregman divergence on positive-definite matrices; covariance comparison, metric learning (ITML).

Example: $\mathbf P=\mathrm{diag}(2,1)$, $\mathbf Q=I$: $\mathrm{tr}(\mathbf P)-\log\det\mathbf P-2\approx\mathbf{0.307}$.

Like $f$-divergences, every Bregman divergence is fixed by one convex seed $F$. The common vector generators (Banerjee et al., 2005) are collected below.

Bregman divergence	Seed $F(x)$	$B_F(x\\|y)$	Domain
Squared Euclidean	$\\|x\\|^2$	$\\|x-y\\|^2$	$\mathbb R^d$
Generalized KL (I-divergence)	$\sum_i x_i\log x_i$	$\sum_i\big(x_i\log\tfrac{x_i}{y_i}-x_i+y_i\big)$	$\mathbb R_{+}^d$
Itakura–Saito	$-\sum_i\log x_i$	$\sum_i\big(\tfrac{x_i}{y_i}-\log\tfrac{x_i}{y_i}-1\big)$	$\mathbb R_{++}^d$
Mahalanobis	$x^{\top}A\,x$	$(x-y)^{\top}A\,(x-y)$	$\mathbb R^d,\ A\succ0$
Exponential	$\sum_i e^{x_i}$	$\sum_i\big(e^{x_i}-e^{y_i}-(x_i-y_i)e^{y_i}\big)$	$\mathbb R^d$

3e. Matrix Bregman divergences

A Bregman divergence (BD) is the gap between a convex seed and its tangent plane. Matrix BDs lift this from vectors to matrices: take a strictly convex spectral seed $\phi$ (a function of the eigenvalues) and the trace inner product $\langle X,Y\rangle=\mathrm{tr}(X^{\top}Y)$,

$$B_\phi(\mathbf X\|\mathbf Y)=\phi(\mathbf X)-\phi(\mathbf Y)-\big\langle \mathbf X-\mathbf Y,\ \nabla\phi(\mathbf Y)\big\rangle .$$

They measure dissimilarity between matrices — covariance, kernel, or density matrices — and the von Neumann and Log-Det divergences seen above are simply the matrix BDs of the entropy and Burg seeds. Nock et al., “Mining Matrix Data with Bregman Matrix Divergences for Portfolio Selection”, use them in a mean-divergence framework that generalizes Markowitz mean-variance: the risk premium of an allocation $\mathbf A$ relative to the market $\boldsymbol\Theta$ is $p_\phi(\mathbf A;\boldsymbol\Theta)=\tfrac{1}{a}\,B_\phi(\boldsymbol\Theta-a\mathbf A\,\|\,\boldsymbol\Theta)$.

Matrix BD	Seed $\phi(\mathbf X)$	$B_\phi(\mathbf X\\|\mathbf Y)$	Where used
Squared Frobenius	$\\|\mathbf X\\|_F^2=\mathrm{tr}(\mathbf X^2)$	$\\|\mathbf X-\mathbf Y\\|_F^2$	baseline matrix distance (matrix Mahalanobis)
Von Neumann	$\mathrm{tr}(\mathbf X\log\mathbf X-\mathbf X)$	$\mathrm{tr}\big(\mathbf X(\log\mathbf X-\log\mathbf Y)-\mathbf X+\mathbf Y\big)$	density/covariance matrices; quantum relative entropy
Log-Det / Burg	$-\log\det\mathbf X$	$\mathrm{tr}(\mathbf X\mathbf Y^{-1})-\log\det(\mathbf X\mathbf Y^{-1})-n$	covariance comparison, metric learning (ITML)
Bregman–Schatten-$p$	$\\|\mathbf X\\|_p^p\ \ (p>1)$	$\tfrac12\,\mathrm{tr}\big(\mathbf X^{2p}-2\mathbf X\mathbf Y^{p-1}+(p-1)\mathbf Y^p\big)$	tunable spectral family

These are the matrix analogues of squared-Euclidean, generalized-KL, and the Log-Det Bregman from §3b; the von Neumann divergence and Log-Det divergence elsewhere in this post (§7 and §3b) are exactly these matrix BDs. Numeric examples for the von Neumann and Log-Det cases are in §3b and §7.

3c. Overlap / $\alpha$-power family

All built from the affinity integral $\int p^{\alpha}q^{\,1-\alpha}\,d\mu$ (same $p,q$ pmfs).

Bhattacharyya distance

$$d_B(p,q)=-\log\!\int \sqrt{p\,q}\,\,d\mu \quad\big(BC=\textstyle\int\sqrt{pq}\big)$$

Application: class-separability measure; object tracking; feature selection.

Example: coefficient $BC\approx0.894 \Rightarrow d_B=-\log(0.894)\approx\mathbf{0.112}$.

Chernoff divergence / information (1952)

$$C(p,q)=\max_{\alpha\in(0,1)} \Big(-\log\!\int p^{\alpha}q^{1-\alpha}d\mu\Big)$$

Application: optimal error exponent in binary hypothesis testing.

Example: grid-searching $\alpha\in(0,1)$ on the same $p,q$ gives $C\approx\mathbf{0.112}$ (near $\alpha=0.5$).

Rényi divergence (1961)

$$R_\alpha(p\|q)=\frac{1}{\alpha-1}\log\!\int p^{\alpha}q^{1-\alpha}d\mu \xrightarrow[\alpha\to1]{}\mathrm{KL}(p\|q)$$

Application: differential-privacy accounting, information theory, generalized entropies.

Example: at $\alpha=0.5$ on the same $p,q$: $R_{0.5}\approx\mathbf{0.223}$ nats.

3d. Symmetrized & Jensen-type divergences

Jeffreys divergence (symmetric KL)

$$J(p,q)=\mathrm{KL}(p\|q)+\mathrm{KL}(q\|p)$$

Application: a symmetric KL when direction is arbitrary.

Example: $0.511+0.368\approx\mathbf{0.879}$ on the same $p,q$.

Jensen–Shannon divergence

$$\mathrm{JS}(p,q)=\tfrac12\mathrm{KL}\!\big(p\,\big\|\,m\big)+\tfrac12\mathrm{KL}\!\big(q\,\big\|\,m\big),\quad m=\tfrac{p+q}{2}$$

Application: symmetric, always finite, $\sqrt{\mathrm{JS}}$ is a metric; original GAN objective; comparing text/topic distributions.

Example: $\mathrm{JS}\approx\mathbf{0.102}$ nats $\Rightarrow \sqrt{\mathrm{JS}}\approx0.319$ (a metric).

Burbea–Rao / Jensen divergence

$$J_F(p,q)=\frac{F(p)+F(q)}{2}-F\!\Big(\frac{p+q}{2}\Big)$$

Application: the “Jensen gap” of a convex $F$; JS is the case $F=-H$ (negative Shannon entropy).

Example: with $F=-H$ on the same $p,q$, $J_F\approx\mathbf{0.102}$ — identical to JS above.

4. Entropies (the functionals divergences are built from)

Entropy measures the uncertainty/spread of a single distribution. Examples use $r=[0.5,0.25,0.25]$.

Shannon / Boltzmann–Gibbs entropy

$$H(p)=-\!\int p\log p\,\,d\mu$$

Application: information content, source coding, thermodynamics.

Example: $H(r)\approx\mathbf{1.040}$ nats $=\mathbf{1.5}$ bits.

Rényi entropy (1961)

$$H_\alpha(p)=\frac{1}{1-\alpha}\log\!\int p^{\alpha}\,d\mu$$

Application: additive generalization of Shannon entropy; collision/min-entropy in cryptography.

Example: at $\alpha=2$ on $r$: $H_2\approx\mathbf{0.981}$ nats (collision entropy).

Tsallis entropy — non-additive (1988)

$$T_\alpha(p)=\frac{1}{1-\alpha}\Big(\int p^{\alpha}\,d\mu-1\Big)$$

Application: non-extensive (long-range correlated) systems in statistical physics.

Example: at $\alpha=2$ on $r$: $T_2=1-\sum r_i^2=\mathbf{0.625}$.

Sharma–Mittal entropy (two-parameter unifier)

$$h_{\alpha,\beta}(p)=\frac{1}{1-\beta}\bigg(\Big(\int p^{\alpha}d\mu\Big)^{\!\frac{1-\beta}{1-\alpha}}-1\bigg)$$

Application: unifies Shannon, Rényi and Tsallis entropies as limiting cases.

Example: at $\alpha=2,\beta=3$ on $r$: $h_{2,3}\approx\mathbf{0.430}$.

5. Distances between sets & whole metric spaces

Hausdorff distance

$$d_{\mathrm{Haus}}(X,Y)=\max\Big\{\,\sup_{x\in X}\rho(x,Y),\ \sup_{y\in Y}\rho(X,y)\Big\}$$

Application: how far two sets are; shape/image matching, template comparison.

Example: $X=\{(0,0),(1,0),(0,1)\}$, $Y=\{(0,0),(1,1)\}$: $d_{\mathrm{Haus}}=\mathbf{1.0}$.

Gromov–Hausdorff distance

$$d_{GH}(X,Y)=\inf_{\phi_X,\phi_Y}\rho_{\mathrm{Haus}}^{Z}\big(\phi_X(X),\phi_Y(Y)\big)$$

Application: compares whole metric spaces (shapes, graphs) up to isometry; manifold learning, 3-D shape matching.

Example: computing it exactly is NP-hard (infimum over all isometric embeddings); two isometric shapes give $d_{GH}=\mathbf{0}$, and it is usually approximated via the Gromov–Wasserstein relaxation.

6. Optimal transport & integral probability metrics (IPMs)

An IPM measures distance by the largest gap a test function from a class $\mathcal F$ can produce: $\gamma_{\mathcal F}(p,q)=\sup_{f\in\mathcal F}\big|\int f\,dp-\int f\,dq\big|$.

Wasserstein distance / Earth Mover’s Distance (EMD)

$$W_\alpha(p,q)=\Big(\inf_{\gamma\in\Gamma(p,q)}\int \rho(x,y)^{\alpha}\,d\gamma(x,y)\Big)^{1/\alpha}$$

Application: “minimum cost to morph $p$ into $q$”; WGANs, image retrieval, domain adaptation. EMD $=W_1$.

Example: 1-D samples $u=[0,1,2,3]$, $v=[1,2,3,4]$ (sort & average gaps): $W_1=\mathbf{1.0}$.

Maximum Mean Discrepancy (MMD)

$$\mathrm{MMD}(p,q)=\sup_{\|f\|_{\mathcal H}\le1}\big|\mathbb E_p f-\mathbb E_q f\big|$$

Application: IPM over an RKHS ball; kernel two-sample tests, training generative models.

Example: RBF kernel ($\gamma=0.5$), $X=\{0,1,2\}$, $Y=\{3,4,5\}$: $\mathrm{MMD}^2\approx\mathbf{1.063}$.

Stein discrepancy

Application: IPM built from a Stein operator; needs the score $\nabla\log p$ but not the normalizing constant; sampler/MCMC diagnostics, Stein variational gradient descent.

Example: not a single closed-form number here — it is a supremum over a Stein-RKHS ball; in practice estimated from samples and the model score.

Kolmogorov(–Smirnov) distance

$$K(p,q)=\sup_x \big|F_p(x)-F_q(x)\big|$$

Application: sup-distance between CDFs; classic non-parametric goodness-of-fit test.

Example: same $u,v$ as Wasserstein: $\sup_x|F_u-F_v|=\mathbf{0.25}$.

Lévy–Prokhorov distance

$$\mathrm{LP}_\rho(p,q)=\inf\big\{\varepsilon>0:\ p(A)\le q(A^{\varepsilon})+\varepsilon\ \ \forall A\in\mathcal B(\mathcal X)\big\}$$

Application: metrizes weak convergence of probability measures (convergence in distribution).

Example: an infimum over all Borel sets — no one-line value; for a point mass shifted by $\delta$, $\mathrm{LP}=\min(\delta,1)$ (e.g. $\delta=0.3\Rightarrow\mathbf{0.3}$).

7. Quantum geometry (density matrices)

Replace probability densities by a density matrix $\rho$; integrals become traces. With diagonal $\rho$ these reduce to the classical formulas on the eigenvalues.

Von Neumann entropy (1927)

$$S(\rho)=-\mathrm{Tr}(\rho\log\rho)$$

Application: quantum analogue of Shannon entropy; entanglement and quantum information.

Example: $\rho=\mathrm{diag}(0.7,0.3)$: $S=-(0.7\log0.7+0.3\log0.3)\approx\mathbf{0.611}$ nats.

Von Neumann (quantum relative) divergence

$$D(\mathbf P\|\mathbf Q)=\mathrm{Tr}\big(\mathbf P(\log\mathbf P-\log\mathbf Q)-\mathbf P+\mathbf Q\big)$$

Application: quantum analogue of KL divergence; distinguishability of quantum states.

Example: $\mathbf P=\mathrm{diag}(0.7,0.3)$, $\mathbf Q=\mathrm{diag}(0.5,0.5)$: $D\approx\mathbf{0.082}$ nats.

Companion notebook

Every Example above is reproduced in the companion notebook, which you can read inline as a page: each distance is shown as a rendered equation, then its minimal numpy code, then the computed output — so you can check every number against the post. It depends only on numpy, and the few genuinely intractable objects (Gromov–Hausdorff, Lévy–Prokhorov, Fisher–Rao and Riemannian geodesics, Finsler metric, Stein discrepancy) are described with a note rather than a misleading exact value. The raw .ipynb is also available to download and run.

View notebook (rendered page) Download .ipynb Detailed PDF Table PDF

The big picture — how the families nest

$L_1\subset L_2\subset L_\infty$ are all special cases of the Minkowski $L_k$ family; Mahalanobis is the Quadratic distance with $Q=\Sigma^{-1}$.
$f$-divergences contain KL, reverse-KL, $\chi^2$, Hellinger, total variation and the $\alpha$-divergence.
Bregman divergences contain squared-Euclidean, KL (discrete), Itakura–Saito, Mahalanobis and Log-Det.
KL is the unique divergence lying in both the $f$-divergence and Bregman families.
Bhattacharyya, Chernoff, Rényi are all read off the affinity $\int p^\alpha q^{1-\alpha}d\mu$.
Jensen–Shannon is a Burbea–Rao divergence; Jeffreys is symmetrized KL.
Wasserstein, MMD, Stein, Kolmogorov are IPMs; EMD $=$ Wasserstein-1.
Fisher–Rao is the Riemannian (geodesic) distance; locally it agrees with $\sqrt{2\,\mathrm{KL}}$.
Von Neumann & Log-Det are the quantum / matrix counterparts of KL and a Bregman divergence.
Matrix Bregman divergences (Frobenius, von Neumann, Log-Det, Schatten-$p$) lift the vector Bregman family to matrices via a spectral seed and the trace inner product.