AISTATS 2026 Batch • Paper 1
15 minute read
Published: June 2026

Synthetic-Regularized Kernel Regression

Hamidreza Hashempoor • Institute for AI, University of Stuttgart

This post looks at a deceptively simple question: if you only have a handful of noisy real labels, but you also have a cheap synthetic generator $g$ that you think resembles the truth, how much should you trust the generator? The method below answers this with one knob, $\lambda$, inside a kernel regressor — and the theory tells us exactly when turning that knob up helps and when it backfires. The whole story is then verified on a tiny 1-D experiment whose numbers are reproduced by the companion notebook. It follows the method of Beyond Real Data: Synthetic Data through the Lens of Regularization.

The idea: regularize toward a synthetic prior

Ordinary kernel ridge regression fits noisy labels and penalizes the size of the function in a reproducing kernel Hilbert space (RKHS) $\mathcal H_K$. Here we instead penalize the distance to a synthetic generator $g$. The estimator solves

$$ f_N=\arg\min_{f\in\mathcal H_K}\; \frac1N\sum_{n=1}^N\bigl(y_n-f(x_n)\bigr)^2 \;+\;\lambda\,\lVert f-g\rVert_{\mathcal H_K}^2 . $$

The pieces:

The knob $\lambda$ interpolates between data and prior. As $\lambda\to0$ the fit interpolates the noisy labels (low bias, high variance). As $\lambda\to\infty$ it collapses onto the synthetic generator $g$ (low variance, but bias equal to however wrong $g$ is). The central claim: synthetic regularization reduces variance but can introduce bias, and the best $\lambda$ depends on the discrepancy between $g$ and $f_\star$.

Closed-form estimator

By the Representer Theorem the minimizer lives in the span of the training kernels,

$$ f_N(x)=\sum_{i=1}^N \alpha_i\,K(x,x_i). $$

Writing $(K_N)_{ij}=K(x_i,x_j)$ for the $N\times N$ Gram matrix and letting $\beta$ be the coefficients of the part of $g$ visible in the training kernel span, $g_\parallel(x)=\sum_i\beta_i K(x,x_i)$, the objective becomes finite-dimensional:

$$ J(\alpha)=\frac1N\lVert y-K_N\alpha\rVert_2^2 +\lambda\,(\alpha-\beta)^\top K_N(\alpha-\beta). $$

Setting $\nabla_\alpha J=0$ gives the estimator we actually implement:

$$ \boxed{\;\alpha=(K_N+N\lambda I)^{-1}\,(y+N\lambda\beta)\;} $$

The synthetic coefficients $\beta$ are obtained by projecting $g$ onto the training kernel span with a tiny numerical ridge $\eta$ (here $\eta=10^{-8}$):

$$ \beta=(K_N+\eta I)^{-1} g_X,\qquad g_X=[\,g(x_1),\dots,g(x_N)\,]^\top. $$

And predictions on a test grid are a single matrix product:

$$ \hat y_{\text{test}}=K_{\text{test,train}}\,\alpha. $$

Reading the boxed formula: the $N\lambda\beta$ term pulls the solution toward the synthetic prior, while $N\lambda I$ shrinks the influence of the noisy labels $y$. When $\lambda\to0$ it reduces to plain kernel interpolation $\alpha=K_N^{-1}y$; when $\lambda\to\infty$ it tends to $\alpha\to\beta$, i.e. $f_N\to g_\parallel$.

Bias–variance decomposition

The population test error, over both inputs and label noise, is

$$ \mathcal R_N(\lambda;g)=\mathbb E_{x,\varepsilon}\bigl[(f_\star(x)-f_N(x))^2\bigr] =\mathcal B^2+\mathcal V, $$

with the usual squared-bias and variance pieces

$$ \mathcal B^2=\mathbb E_x\bigl[f_\star(x)-\mathbb E_\varepsilon f_N(x)\bigr]^2, \qquad \mathcal V=\mathbb E_{x,\varepsilon}\bigl[(f_N(x)-\mathbb E_\varepsilon f_N(x))^2\bigr]. $$

We estimate these by Monte Carlo: fix the training inputs $x_1,\dots,x_N$, then repeat training over $R$ independent noise draws $\varepsilon^{(r)}$ and evaluate every learned function on a dense test grid. With $\bar f_N$ the average prediction across repetitions,

$$ \widehat{\mathcal B}^2=\frac{1}{n_{\text{test}}}\sum_i \bigl(f_\star(x_i^{\text{test}})-\bar f_N(x_i^{\text{test}})\bigr)^2, \qquad \widehat{\mathcal V}=\frac{1}{n_{\text{test}}}\sum_i\frac1R\sum_{r=1}^R \bigl(f_N^{(r)}(x_i^{\text{test}})-\bar f_N(x_i^{\text{test}})\bigr)^2, $$

and $\widehat{\mathcal R}=\widehat{\mathcal B}^2+\widehat{\mathcal V}$. Only the label noise is resampled across repetitions — the inputs stay fixed, which is exactly what the decomposition assumes.

A spectral (Mercer) view of generator mismatch

Why should a generator that is "slightly off" sometimes be catastrophic and sometimes harmless? The answer is spectral. The population kernel operator $(T_K f)(x)=\int K(x,x')f(x')\,dp_x(x')$ has eigenpairs $T_K\phi_j=\mu_j\phi_j$. Expanding both functions in this Mercer basis, $f_\star=\sum_j\theta_j\phi_j$ and $g=\sum_j\omega_j\phi_j$, the theory measures mismatch by

$$ \mathcal D(f_\star,g)^2=\sum_{j=1}^{\infty}\frac{(\theta_j-\omega_j)^2}{\mu_j^2}. $$

The $1/\mu_j^2$ weighting is the whole point: mismatch in small-eigenvalue directions (the "hard", typically high-frequency directions that an RBF kernel suppresses) is penalized enormously. The risk bound takes the shape

$$ \mathcal R_N(\lambda;g)=\mathcal O\!\left( \underbrace{\frac{\mathcal D(f_\star,g)+\sigma^2}{N\lambda^2}}_{\text{variance-like}} +\underbrace{\lambda^{\,2-\frac{1}{4r}}\,\mathcal D(f_\star,g)}_{\text{bias-like}} \right). $$

The minimal experiment

A 1-D regression on $[0,1]$ makes every claim checkable. The target is a smooth two-frequency signal,

$$ f_\star(x)=\sin(2\pi x)+0.5\cos(4\pi x),\qquad x\sim\mathrm{Uniform}(0,1), $$

observed with Gaussian label noise $y_n=f_\star(x_n)+\varepsilon_n$, $\varepsilon_n\sim\mathcal N(0,\sigma^2)$, through an RBF kernel $K(x,x')=\exp(-\lVert x-x'\rVert^2/2\ell^2)$. We then build three synthetic generators of increasing difficulty:

The exact configuration used for the run reported below:

SettingValueSettingValue
train size $N$64test grid1000 points
noise std $\sigma$0.10repetitions $R$200
RBF lengthscale $\ell$0.15projection ridge $\eta$$10^{-8}$
bias offset $\delta$0.30hf frequency10
$\lambda$ grid$\{3\!\times\!10^{-4},10^{-3},3\!\times\!10^{-3},10^{-2},3\!\times\!10^{-2},10^{-1},3\!\times\!10^{-1},1,3,10\}$

Results

The three generators trace out three qualitatively different risk curves — exactly the three regimes the theory predicts.

Population risk versus regularization strength lambda for the three synthetic generators, log-log axes.
Figure 1. Population risk $\mathcal R_N(\lambda;g)$ vs. $\lambda$ (log–log). Perfect falls monotonically as $\lambda$ grows; smooth bias is U-shaped with a sweet spot at moderate $\lambda$; high-frequency bias sits high and flat — no setting of $\lambda$ rescues a generator wrong in a hard direction.
Bias squared and variance versus lambda, one panel per generator.
Figure 2. Bias$^2$ and variance separately, one panel per generator. Variance falls steadily with $\lambda$ in all three. For perfect, bias stays negligible, so risk just tracks the shrinking variance. For smooth bias, bias$^2$ grows at large $\lambda$ and crosses the variance — producing the U-shape. For high-frequency bias, bias$^2$ dominates everywhere.
Overlaid risk curves with best-lambda marked, and a log-scale bar chart of the spectral discrepancy proxy.
Figure 3. Left: overlaid risk curves with the best $\lambda$ starred per generator. Right: the empirical spectral-discrepancy proxy $\widehat{\mathcal D}(f_\star,g)^2$ on a log scale — essentially $0$ for perfect, $\sim\!10^{11}$ for smooth bias, $\sim\!10^{23}$ for high-frequency bias. The ranking of discrepancies predicts the ranking of achievable risk.

Representative numbers

A few representative rows make the three regimes concrete:

Generator$\lambda$bias$^2$variancerisk
perfect0.00037.45e−061.583e−031.590e−03
0.011.46e−068.887e−048.902e−04
10.02.17e−093.803e−073.825e−07
smooth bias0.00036.47e−061.561e−031.567e−03
0.011.57e−048.801e−041.037e−03
0.13.78e−032.749e−044.060e−03
10.04.246e−023.78e−074.246e−02
high-freq bias0.00037.024e−021.553e−037.179e−02
0.018.964e−028.802e−049.052e−02
10.01.094e−013.58e−071.094e−01

Best risk per generator is highlighted. Discrepancy proxy $\widehat{\mathcal D}^2$: perfect $=0$, smooth $\approx2.24\times10^{11}$, high-frequency $\approx8.74\times10^{23}$.

Reading the three regimes. (1) Perfect generator: bias never appears, so more regularization is pure win — risk drops from $1.6\times10^{-3}$ to $3.8\times10^{-7}$ as $\lambda$ goes from $3\times10^{-4}$ to $10$. (2) Smooth bias: variance reduction and growing bias trade off, giving a U-shape with the minimum near $\lambda\approx10^{-2}$ before bias takes over at large $\lambda$. (3) High-frequency bias: the mismatch lives in a direction the RBF kernel barely sees, so bias dominates at every $\lambda$ and the best you can do is keep $\lambda$ tiny and basically ignore the prior. The spectral discrepancy $\mathcal D$ — twelve orders of magnitude larger for the high-frequency generator — is exactly what predicts this ordering.

The practical lesson: when you design a synthetic generator to regularize a kernel model, getting the smooth, low-frequency behavior right is comparatively forgiving; an error in the hard, high-frequency directions is what really hurts, and the more so the more you trust it.

Reproduce it

The companion notebook re-runs the whole experiment from scratch using only numpy (and matplotlib for the plot). It implements the boxed estimator $\alpha=(K_N+N\lambda I)^{-1}(y+N\lambda\beta)$, the Monte-Carlo bias/variance decomposition, and the spectral-discrepancy proxy, then checks that the reproduced bias$^2$/variance/risk match the reference numbers reported above (they agree to $\sim\!10^{-9}$, i.e. floating-point noise).