Rendered Notebook • Distances & Divergences

Distances & Divergences — Minimal Examples

Each distance/divergence is shown as a rendered equation, then its minimal numpy code, then the computed output. The numbers match the Example lines in the post. A few intractable objects are described in notes instead of code.

1. Vector & metric distances

Classical distances between two points $\mathbf p,\mathbf q\in\mathbb R^n$. All are true metrics.

Euclidean distance, $L_2$
$$d_2(\mathbf p,\mathbf q)=\sqrt{\textstyle\sum_i (p_i-q_i)^2}$$
In [ ]:
p = np.array([0.0, 0.0]); q = np.array([3.0, 4.0])
print("Euclidean L2:", np.sqrt(((p - q) ** 2).sum()))
Out:
Euclidean L2: 5.0
Manhattan / city-block distance, $L_1$
$$d_1(\mathbf p,\mathbf q)=\textstyle\sum_i |p_i-q_i|$$
In [ ]:
p = np.array([0.0, 0.0]); q = np.array([3.0, 4.0])
print("Manhattan L1:", np.abs(p - q).sum())
Out:
Manhattan L1: 7.0
Minkowski distance, $L_k$-norm
$$d_k(\mathbf p,\mathbf q)=\Big(\textstyle\sum_i |p_i-q_i|^k\Big)^{1/k}$$
In [ ]:
p = np.array([0.0, 0.0]); q = np.array([3.0, 4.0]); k = 3
print("Minkowski k=3:", (np.abs(p - q) ** k).sum() ** (1 / k))
Out:
Minkowski k=3: 4.497941445275415
Chebyshev distance, $L_\infty$
$$d_\infty(\mathbf p,\mathbf q)=\max_i |p_i-q_i|$$
In [ ]:
p = np.array([0.0, 0.0]); q = np.array([3.0, 4.0])
print("Chebyshev Linf:", np.abs(p - q).max())
Out:
Chebyshev Linf: 4.0
Quadratic (generalized) distance
$$d_Q(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} Q\,(\mathbf p-\mathbf q)}\,,\quad Q \succeq 0$$
In [ ]:
p = np.array([0.0, 0.0]); q = np.array([3.0, 4.0])
Q = np.array([[2.0, 0.5], [0.5, 1.0]]); d = p - q
print("Quadratic:", np.sqrt(d @ Q @ d))
Out:
Quadratic: 6.782329983125268
Mahalanobis metric
$$d_\Sigma(\mathbf p,\mathbf q)=\sqrt{(\mathbf p-\mathbf q)^{\top} \Sigma^{-1}(\mathbf p-\mathbf q)}$$
In [ ]:
a = np.array([2.0, 1.0]); b = np.array([0.0, 0.0])
Sigma = np.array([[2.0, 0.5], [0.5, 1.0]]); diff = a - b
print("Mahalanobis:", np.sqrt(diff @ np.linalg.inv(Sigma) @ diff))
Out:
Mahalanobis: 1.5118578920369088
Hamming distance
$$d_H(\mathbf p,\mathbf q)=\big|\{\,i : p_i \ne q_i\,\}\big|$$
In [ ]:
s1, s2 = "10110", "10011"
print("Hamming:", sum(c1 != c2 for c1, c2 in zip(s1, s2)))
Out:
Hamming: 2
String / time-series distances (Levenshtein, DTW)
$$\text{edit / alignment cost between sequences}$$
In [ ]:
def levenshtein(s, t):
    m, n = len(s), len(t)
    dp = np.zeros((m + 1, n + 1), int)
    dp[:, 0] = np.arange(m + 1); dp[0, :] = np.arange(n + 1)
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s[i-1] == t[j-1] else 1
            dp[i, j] = min(dp[i-1, j] + 1, dp[i, j-1] + 1, dp[i-1, j-1] + cost)
    return dp[m, n]

def dtw(x, y):
    n, m = len(x), len(y)
    D = np.full((n + 1, m + 1), np.inf); D[0, 0] = 0
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            c = abs(x[i-1] - y[j-1])
            D[i, j] = c + min(D[i-1, j], D[i, j-1], D[i-1, j-1])
    return D[n, m]

print("Levenshtein('kitten','sitting'):", levenshtein("kitten", "sitting"))
print("DTW([1,2,3],[1,2,2,3]):", dtw([1, 2, 3], [1, 2, 2, 3]))
Out:
Levenshtein('kitten','sitting'): 3
DTW([1,2,3],[1,2,2,3]): 0.0

2. Riemannian & information geometry

Distance becomes the length of the shortest path (geodesic) on a curved manifold; for statistical models the curvature is the Fisher information.

Fisher information matrix
$$\mathbf I(\theta)=\mathbb E\!\left[\big(\partial_\theta\ln p\big)\big(\partial_\theta\ln p\big)^{\top}\right]$$
In [ ]:
# Gaussian N(mu, sigma^2): I(mu) = 1 / sigma^2, verified by Monte Carlo.
sigma = 2.0
rng = np.random.default_rng(0)
x = rng.normal(0.0, sigma, size=200000)
score = x / sigma**2                       # d/dmu log p at mu = 0
print("Fisher info I(mu) analytic:", 1 / sigma**2)
print("Fisher info I(mu) numeric :", (score**2).mean())
Out:
Fisher info I(mu) analytic: 0.25
Fisher info I(mu) numeric : 0.25060033195535436
Riemannian geodesic, Finsler metric, Fisher–Rao distance
$$L(\gamma)=\int \sqrt{g_{ij}\,\dot x^i \dot x^j}\;dt \qquad \rho_{FR}(p,q)=\min_\gamma\!\int_0^1\!\sqrt{\dot\gamma^{\top}\mathbf I\,\dot\gamma}\,dt$$

These require solving a variational/geodesic problem on a manifold, so there is no one-line snippet. On a unit sphere the geodesic between points at angular separation $\theta$ is the arc $L=\theta$ (orthogonal points $\Rightarrow \pi/2\approx1.571$). Locally the Fisher–Rao distance satisfies $\rho_{FR}(p,q)\approx\sqrt{2\,\mathrm{KL}(p\|q)}$; for the $p,q$ below that is $\sqrt{2\cdot0.511}\approx1.011$.

3a. $f$-divergences (Csiszár / Ali–Silvey)

Template $D_f(p\|q)=\int p\,f(q/p)\,d\mu$. Examples use the pmfs $p=[0.5,0.5]$, $q=[0.9,0.1]$.

Kullback–Leibler divergence ($f(t)=-\log t$)
$$\mathrm{KL}(p\|q)=\int p\,\log\frac{p}{q}\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("KL(p||q) nats:", (p * np.log(p / q)).sum())
print("KL(q||p) nats:", (q * np.log(q / p)).sum())   # asymmetric
Out:
KL(p||q) nats: 0.5108256237659907
KL(q||p) nats: 0.3680642071684971
Reverse Kullback–Leibler ($f(t)=t\log t$)
$$\mathrm{KL}(q\|p)=\int q\,\log\frac{q}{p}\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Reverse KL = KL(q||p) nats:", (q * np.log(q / p)).sum())
Out:
Reverse KL = KL(q||p) nats: 0.3680642071684971
Pearson $\chi^2$ divergence ($f(t)=(t-1)^2$)
$$\chi^2(p\|q)=\int \frac{(q-p)^2}{p}\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Pearson chi2:", (((q - p) ** 2) / p).sum())
Out:
Pearson chi2: 0.6400000000000001
Neyman $\chi^2$ divergence ($f(t)=(1-t)^2/t$)
$$\chi^2_N(p\|q)=\int \frac{(p-q)^2}{q}\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Neyman chi2:", (((p - q) ** 2) / q).sum())
Out:
Neyman chi2: 1.7777777777777781
Hellinger distance ($f(t)=(\sqrt t-1)^2$)
$$H(p,q)=\sqrt{\tfrac12\int\!\big(\sqrt{p}-\sqrt{q}\big)^2 d\mu}$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Hellinger:", np.sqrt(((np.sqrt(p) - np.sqrt(q)) ** 2).sum() / 2))
Out:
Hellinger: 0.32491969623290634
Total variation distance ($f(t)=\tfrac12|t-1|$)
$$\mathrm{TV}(p,q)=\tfrac12\int |p-q|\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Total variation:", 0.5 * np.abs(p - q).sum())
Out:
Total variation: 0.4
Amari $\alpha$-divergence
$$f_\alpha(t)=\frac{4}{1-\alpha^2}\Big(1-t^{\frac{1+\alpha}{2}}\Big)$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1]); alpha = 0.0
t = q / p
f = 4 / (1 - alpha**2) * (1 - t ** ((1 + alpha) / 2))
print("Amari alpha=0:", (p * f).sum())
Out:
Amari alpha=0: 0.4222912360003366

3b. Bregman divergences

Template $B_F(x\|y)=F(x)-F(y)-\langle x-y,\nabla F(y)\rangle$ for a strictly convex potential $F$.

Squared Euclidean ($F=\|\cdot\|^2$)
$$B_F(x\|y)=\|x-y\|^2$$
In [ ]:
x = np.array([1.0, 2.0]); y = np.array([0.0, 0.0])
print("Bregman sq-Euclidean:", ((x - y) ** 2).sum())
Out:
Bregman sq-Euclidean: 5.0
Kullback–Leibler (discrete), $F=-H$
$$B_F(p\|q)=\mathrm{KL}(p\|q)\quad(\text{unique } f\cap\text{Bregman})$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
print("Bregman (F=-H) = KL(p||q):", (p * np.log(p / q)).sum())
Out:
Bregman (F=-H) = KL(p||q): 0.5108256237659907
Itakura–Saito divergence ($F=-\sum_i\log x_i$)
$$\mathrm{IS}(p\|q)=\sum_i\Big(\frac{p_i}{q_i}-\log\frac{p_i}{q_i}-1\Big)$$
In [ ]:
pi = np.array([1.0, 2.0, 3.0]); qi = np.array([1.0, 1.0, 4.0])
print("Itakura-Saito:", (pi / qi - np.log(pi / qi) - 1).sum())
Out:
Itakura-Saito: 0.3445348918918354
Log-Det divergence (SPD matrices)
$$D(\mathbf P\|\mathbf Q)=\langle\mathbf P,\mathbf Q^{-1}\rangle-\log\det(\mathbf P\mathbf Q^{-1})-n$$
In [ ]:
P = np.array([[2.0, 0.0], [0.0, 1.0]]); Q = np.eye(2); n = 2
PQ = P @ np.linalg.inv(Q)
print("Log-Det:", np.trace(PQ) - np.log(np.linalg.det(PQ)) - n)
Out:
Log-Det: 0.3068528194400546

3c. Overlap / $\alpha$-power family

All built from the affinity $\int p^{\alpha}q^{1-\alpha}d\mu$ (same $p,q$ pmfs).

Bhattacharyya distance
$$d_B(p,q)=-\log\!\int \sqrt{p\,q}\,d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
BC = np.sqrt(p * q).sum()
print("Bhattacharyya coeff:", BC)
print("Bhattacharyya dist :", -np.log(BC))
Out:
Bhattacharyya coeff: 0.8944271909999159
Bhattacharyya dist : 0.11157177565710491
Chernoff information
$$C(p,q)=\max_{\alpha\in(0,1)}\Big(-\log\!\int p^{\alpha}q^{1-\alpha}d\mu\Big)$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
als = np.linspace(0.01, 0.99, 99)
print("Chernoff info:", max(-np.log((p**a * q**(1 - a)).sum()) for a in als))
Out:
Chernoff info: 0.11237628278879076
Rényi divergence
$$R_\alpha(p\|q)=\frac{1}{\alpha-1}\log\!\int p^{\alpha}q^{1-\alpha}d\mu$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1]); a = 0.5
print("Renyi div a=0.5:", 1 / (a - 1) * np.log((p**a * q**(1 - a)).sum()))
Out:
Renyi div a=0.5: 0.22314355131420957

3d. Symmetrized & Jensen-type divergences

Symmetric combinations of KL and Jensen gaps (same $p,q$ pmfs).

Jeffreys divergence (symmetric KL)
$$J(p,q)=\mathrm{KL}(p\|q)+\mathrm{KL}(q\|p)$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
kl = lambda u, v: (u * np.log(u / v)).sum()
print("Jeffreys:", kl(p, q) + kl(q, p))
Out:
Jeffreys: 0.8788898309344878
Jensen–Shannon divergence
$$\mathrm{JS}(p,q)=\tfrac12\mathrm{KL}(p\|m)+\tfrac12\mathrm{KL}(q\|m),\ m=\tfrac{p+q}{2}$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
kl = lambda u, v: (u * np.log(u / v)).sum()
m = (p + q) / 2
js = 0.5 * kl(p, m) + 0.5 * kl(q, m)
print("Jensen-Shannon:", js, " sqrt(JS):", np.sqrt(js))
Out:
Jensen-Shannon: 0.10174922507919676  sqrt(JS): 0.31898154347735663
Burbea–Rao / Jensen divergence
$$J_F(p,q)=\frac{F(p)+F(q)}{2}-F\!\Big(\frac{p+q}{2}\Big)$$
In [ ]:
p = np.array([0.5, 0.5]); q = np.array([0.9, 0.1])
negH = lambda z: (z * np.log(z)).sum()      # F = -Shannon entropy
m = (p + q) / 2
print("Burbea-Rao (=JS):", (negH(p) + negH(q)) / 2 - negH(m))
Out:
Burbea-Rao (=JS): 0.10174922507919681

4. Entropies

Uncertainty of a single distribution. Examples use $r=[0.5,0.25,0.25]$.

Shannon / Boltzmann–Gibbs entropy
$$H(p)=-\!\int p\log p\,d\mu$$
In [ ]:
r = np.array([0.5, 0.25, 0.25])
print("Shannon (nats):", -(r * np.log(r)).sum())
print("Shannon (bits):", -(r * np.log2(r)).sum())
Out:
Shannon (nats): 1.0397207708399179
Shannon (bits): 1.5
Rényi entropy
$$H_\alpha(p)=\frac{1}{1-\alpha}\log\!\int p^{\alpha}\,d\mu$$
In [ ]:
r = np.array([0.5, 0.25, 0.25]); alpha = 2.0
print("Renyi entropy a=2:", 1 / (1 - alpha) * np.log((r**alpha).sum()))
Out:
Renyi entropy a=2: 0.9808292530117262
Tsallis entropy (non-additive)
$$T_\alpha(p)=\frac{1}{1-\alpha}\Big(\int p^{\alpha}\,d\mu-1\Big)$$
In [ ]:
r = np.array([0.5, 0.25, 0.25]); alpha = 2.0
print("Tsallis a=2:", 1 / (1 - alpha) * ((r**alpha).sum() - 1))
Out:
Tsallis a=2: 0.625
Sharma–Mittal entropy (two-parameter unifier)
$$h_{\alpha,\beta}(p)=\frac{1}{1-\beta}\bigg(\Big(\int p^{\alpha}d\mu\Big)^{\frac{1-\beta}{1-\alpha}}-1\bigg)$$
In [ ]:
r = np.array([0.5, 0.25, 0.25]); alpha = 2.0; beta = 3.0
print("Sharma-Mittal a=2,b=3:", 1 / (1 - beta) * ((r**alpha).sum() ** ((1 - beta) / (1 - alpha)) - 1))
Out:
Sharma-Mittal a=2,b=3: 0.4296875

5. Distances between sets & metric spaces

How far apart whole sets / shapes are.

Hausdorff distance
$$d_{\mathrm{Haus}}(X,Y)=\max\Big\{\sup_{x\in X}\rho(x,Y),\ \sup_{y\in Y}\rho(X,y)\Big\}$$
In [ ]:
A = np.array([[0, 0], [1, 0], [0, 1.0]]); B = np.array([[0, 0], [1, 1.0]])
def directed(X, Y):
    return max(np.min(np.sqrt(((x - Y) ** 2).sum(1))) for x in X)
print("Hausdorff:", max(directed(A, B), directed(B, A)))
Out:
Hausdorff: 1.0
Gromov–Hausdorff distance
$$d_{GH}(X,Y)=\inf_{\phi_X,\phi_Y}\rho_{\mathrm{Haus}}^{Z}\big(\phi_X(X),\phi_Y(Y)\big)$$

Compares whole metric spaces up to isometry; computing it exactly is NP-hard (an infimum over all isometric embeddings). Two isometric shapes give $d_{GH}=0$; in practice it is approximated via the Gromov–Wasserstein relaxation.

6. Optimal transport & integral probability metrics (IPMs)

An IPM is $\gamma_{\mathcal F}(p,q)=\sup_{f\in\mathcal F}|\int f\,dp-\int f\,dq|$.

Wasserstein-1 / Earth Mover's Distance
$$W_1(p,q)=\inf_{\gamma\in\Gamma(p,q)}\int \rho(x,y)\,d\gamma$$
In [ ]:
u = np.array([0.0, 1, 2, 3]); v = np.array([1.0, 2, 3, 4])
print("Wasserstein-1:", np.abs(np.sort(u) - np.sort(v)).mean())
Out:
Wasserstein-1: 1.0
Maximum Mean Discrepancy (MMD)
$$\mathrm{MMD}(p,q)=\sup_{\|f\|_{\mathcal H}\le1}\big|\mathbb E_p f-\mathbb E_q f\big|$$
In [ ]:
def mmd2(X, Y, g=0.5):
    XX = ((X[:, None, :] - X[None, :, :]) ** 2).sum(-1)
    YY = ((Y[:, None, :] - Y[None, :, :]) ** 2).sum(-1)
    XY = ((X[:, None, :] - Y[None, :, :]) ** 2).sum(-1)
    return np.exp(-g * XX).mean() + np.exp(-g * YY).mean() - 2 * np.exp(-g * XY).mean()
X = np.array([[0.0], [1.0], [2.0]]); Y = np.array([[3.0], [4.0], [5.0]])
print("MMD^2 (RBF, g=0.5):", mmd2(X, Y))
Out:
MMD^2 (RBF, g=0.5): 1.0634645194698213
Kolmogorov–Smirnov distance
$$K(p,q)=\sup_x \big|F_p(x)-F_q(x)\big|$$
In [ ]:
u = np.array([0.0, 1, 2, 3]); v = np.array([1.0, 2, 3, 4])
grid = np.sort(np.concatenate([u, v]))
Fu = np.searchsorted(np.sort(u), grid, side="right") / len(u)
Fv = np.searchsorted(np.sort(v), grid, side="right") / len(v)
print("KS statistic:", np.abs(Fu - Fv).max())
Out:
KS statistic: 0.25
Stein discrepancy & Lévy–Prokhorov distance
$$\mathrm{LP}_\rho(p,q)=\inf\{\varepsilon>0:\ p(A)\le q(A^{\varepsilon})+\varepsilon\ \forall A\}$$

The Stein discrepancy is a supremum over a Stein-RKHS ball (needs the score $\nabla\log p$, not the normalizer) — estimated from samples, not a one-liner. Lévy–Prokhorov is an infimum over all Borel sets; for a point mass shifted by $\delta$ it equals $\min(\delta,1)$ (e.g. $\delta=0.3\Rightarrow0.3$).

7. Quantum geometry (density matrices)

Densities become density matrices $\rho$; integrals become traces. With diagonal $\rho$ these reduce to the classical formulas on the eigenvalues.

Von Neumann entropy
$$S(\rho)=-\mathrm{Tr}(\rho\log\rho)$$
In [ ]:
rho = np.array([[0.7, 0.0], [0.0, 0.3]])
ev = np.linalg.eigvalsh(rho)
print("Von Neumann entropy (nats):", -(ev * np.log(ev)).sum())
Out:
Von Neumann entropy (nats): 0.6108643020548935
Von Neumann (quantum relative) divergence
$$D(\mathbf P\|\mathbf Q)=\mathrm{Tr}\big(\mathbf P(\log\mathbf P-\log\mathbf Q)-\mathbf P+\mathbf Q\big)$$
In [ ]:
def logm_sym(M):
    w, V = np.linalg.eigh(M)
    return V @ np.diag(np.log(w)) @ V.T
P = np.array([[0.7, 0.0], [0.0, 0.3]]); Q = np.array([[0.5, 0.0], [0.0, 0.5]])
print("Von Neumann divergence (nats):", np.trace(P @ (logm_sym(P) - logm_sym(Q)) - P + Q))
Out:
Von Neumann divergence (nats): 0.08228287850505178