AISTATS 2026 Batch • Paper 2

16 minute read

Published: June 2026

EventFlow: Flow Matching for Temporal Point Processes

Hamidreza Hashempoor • Institute for AI, University of Stuttgart

← Back to series Original paper ↗ View notebook Download .ipynb Experiment Results

The idea Balanced coupling Interpolant Vector field & loss Experiment Results Notebook

A temporal point process (TPP) sample is a finite, sorted list of timestamps in a window $[0,T]$ — when did the trades fire, when did the patient have appointments, when did the server log errors. The usual way to model these is autoregressive: predict the next gap, then the next, then the next, accumulating error as the horizon grows. EventFlow instead learns a single flow that transports a simple reference TPP onto the data TPP in one shot — generating the whole sequence non-autoregressively. The trick that makes flow matching work on variable-length sets is a balanced coupling, and the whole story is checked on a tiny clustered synthetic process whose numbers are reproduced by the companion notebook. It follows the method of EventFlow: Forecasting Temporal Point Processes with Flow Matching.

The idea: transport a reference TPP onto the data TPP

Represent one event sequence as a finite ordered list of times, or equivalently as a sum of point masses (a counting measure):

$$ \gamma=\{t^1,\ldots,t^n\},\quad 0\le t^1<\cdots<t^n\le T, \qquad \gamma=\sum_{k=1}^n \delta[t^k]. $$

The number of events is itself a random quantity, captured by the count functional $N(\gamma)=n$. A TPP distribution $\mu\in\mathcal P(\Gamma)$ therefore has two parts: a distribution over how many events occur, and — given that count — a density over where they land,

$$ \mu(n)=\mathbb P_{\gamma\sim\mu}\!\left[N(\gamma)=n\right], \qquad \mu_n(t^1,\ldots,t^n),\;\; t^1<\cdots<t^n. $$

Flow matching learns a velocity field that, integrated as an ODE in a flow time $s\in[0,1]$, carries samples from a simple reference distribution $\mu_0$ to the data distribution $\mu_1$. The complication unique to point processes is that two samples can have different lengths: you cannot linearly interpolate a 2-event sequence into a 3-event one coordinate by coordinate. EventFlow sidesteps this by being careful about how reference and data sequences are paired.

The whole method rests on one design choice: match counts before you interpolate. The reference $\mu_0$ is built to share the data's count distribution, $\mu_0(n)=\mu_1(n)$ for all $n$, and every training pair is chosen to have the same number of events. The flow then only ever has to move events around — it never has to create or delete them.

The balanced coupling

A coupling is a joint distribution over source–target pairs $z=(\gamma_0,\gamma_1)\sim\rho$. It is called balanced when the two sides always have the same count:

$$ N(\gamma_0)=N(\gamma_1)\quad\text{almost surely}. $$

The naive recipe — sample $\gamma_0\sim\mu_0$ and $\gamma_1\sim\mu_1$ independently — breaks this: nothing forces the counts to agree, and a 2-event source paired with a 3-event target has no coordinate-wise interpolant. The fix is to sample the pair conditionally. Draw the data sequence first, read off its count $n=N(\gamma_1)$, then draw exactly $n$ reference times i.i.d. from a simple density $q$ on $[0,T]$ and sort them:

$$ t^1_0,\ldots,t^n_0\stackrel{\text{i.i.d.}}{\sim}q, \qquad \gamma_0=\operatorname{sort}(t^1_0,\ldots,t^n_0). $$

For the minimal experiment $q=\operatorname{Uniform}(0,T)$. By construction $N(\gamma_0)=N(\gamma_1)$, so the pair is balanced and the two sorted sequences can be matched event-for-event. In code this is a two-liner:

$$ \texttt{n = len(gamma\_1)};\qquad \texttt{gamma\_0 = sort(rand(n) * T)}. $$

The pitfall to avoid. Do not independently sample counts from $\mu_0$ and $\mu_1$ — that can break the balance. And do not average sequences of different counts coordinate-wise: the marginal interpolating law is a mixture over lengths, not a vector average. A minibatch simply holds sequences of different lengths side by side, handled with padding and masks.

The noisy linear interpolant

Two time variables now live in the model and must not be confused: the event time $t\in[0,T]$ says when an event happens; the flow time $s\in[0,1]$ says how far the sample has travelled from reference toward data. Given a balanced pair, interpolate each matched event linearly in $s$,

$$ t^k_s=(1-s)\,t^k_0+s\,t^k_1, \qquad \gamma^z_s=\sum_{k=1}^n\delta[t^k_s]. $$

Training on this deterministic path alone would only ever show the network points exactly on the straight line between source and target. To spread probability mass into a tube around the path — so the learned field is defined off the line too — add small independent Gaussian noise to each interpolated event:

$$ \hat\gamma^z_s=\sum_{k=1}^n\delta\!\left[t^k_s+\epsilon^k\right], \qquad \epsilon^k\sim\mathcal N(0,\sigma^2). $$

This conditional law (the small Gaussian cloud around the path of one fixed pair $z$) is written $\eta^z_s$. Averaging it over all balanced pairs gives the global interpolating distribution that training actually targets,

$$ \eta_s=\int \eta^z_s\,d\rho(z). $$

The integral is never computed. Each training example is drawn by sampling a pair $z\sim\rho$, a flow time $s\sim\operatorname{Uniform}(0,1)$, and a noisy point $\hat\gamma^z_s\sim\eta^z_s$; across minibatches these approximate the marginal path $\eta_s$.

A worked example (two events). Take $\gamma_0=\{1.0,7.0\}$, $\gamma_1=\{2.0,5.0\}$, and draw $s=0.4$. The clean interpolant is $t^1_s=0.6(1.0)+0.4(2.0)=1.4$ and $t^2_s=0.6(7.0)+0.4(5.0)=6.2$, so $\gamma^z_s=\{1.4,6.2\}$. With noise $\epsilon=(0.05,-0.10)$ the network sees $\hat\gamma^z_s=\{1.45,6.10\}$ at $s=0.4$, and must predict the target velocity $\gamma_1-\gamma_0=\{1.0,-2.0\}$.

The vector field and the training objective

Because the interpolant is linear in $s$, the velocity of a fixed pair is constant in both $s$ and $\gamma$ — it is just the displacement of each event,

$$ v^z_s=\begin{bmatrix} t^1_1-t^1_0\\ \vdots\\ t^n_1-t^n_0\end{bmatrix} \;=\;\gamma_1-\gamma_0. $$

The field we actually want to deploy at generation time is the marginal one: at an intermediate sequence $\gamma$, average the conditional velocities of every pair that could have produced $\gamma$ at flow time $s$,

$$ v_s(\gamma)=\mathbb E_{z\sim\rho}\!\left[\,v^z_s \;\middle|\; \gamma^z_s=\gamma\,\right]. $$

That conditional expectation involves unknown density ratios and cannot be evaluated directly. The flow-matching insight is that regressing a network onto the simple conditional targets $v^z_s=\gamma_1-\gamma_0$ recovers the marginal field in expectation, up to a constant independent of the parameters. So the objective is just a masked least-squares fit:

$$ J(\theta)=\mathbb E_{s,\,z,\,\hat\gamma^z_s} \left[\,\bigl\lVert \gamma_1-\gamma_0-v_\theta(\hat\gamma^z_s,\,s)\bigr\rVert^2\,\right]. $$

With padded minibatches the loss is evaluated only on valid event positions, so padding never contributes a gradient:

$$ \widehat J=\frac{\sum_{b,k}\bigl(v_\theta-\,(\gamma_1-\gamma_0)\bigr)^2_{b,k}\,m_{b,k}} {\sum_{b,k}m_{b,k}}, $$

where $m_{b,k}\in\{0,1\}$ is the mask marking real events versus padding.

The network $v_\theta$ is deliberately light: a small MLP applied pointwise to each event, but with enough global context that it knows which sequence it is in. Each event $k$ is described by five features

$$ \bigl[\,t_k,\; s,\; t_k/T,\; N(\gamma)/N_{\max},\; \bar t/T\,\bigr], $$

where $\bar t$ is the masked mean event time of the sequence, concatenated with a sinusoidal embedding of the flow time $s$. The MLP outputs one scalar velocity per event, with padding positions zeroed. Generation then integrates the learned ODE with explicit Euler from a uniform reference sequence of the desired count,

$$ \frac{d\gamma_s}{ds}=v_\theta(\gamma_s,s), \qquad \gamma_{s+\Delta s}=\gamma_s+\Delta s\,v_\theta(\gamma_s,s), $$

clipping to $[0,T]$ and sorting at the end. Crucially the ODE moves events but never changes the count, so the count is fixed up front by sampling from the (known, here) count distribution — the flow is responsible only for where the events go.

The minimal experiment

The goal is not to beat a benchmark but to verify the method matches the theory: starting from a uniform reference TPP, does the learned flow pull events onto the data clusters while preserving the count distribution? The synthetic data process lives on $[0,T]$ with $T=10$ and a three-mode count distribution

$$ \mu_1(1)=0.2,\qquad \mu_1(2)=0.5,\qquad \mu_1(3)=0.3, $$

with the reference sharing it, $\mu_0(n)=\mu_1(n)$. Conditional on the count, each data event is a tight Gaussian around a fixed cluster center (then clipped to $[0,T]$ and sorted),

$$ t^k\sim\mathcal N(c_k,\sigma_{\text{data}}^2),\quad \sigma_{\text{data}}=0.35,\qquad c:\;\{1\!:\![5],\;2\!:\![3,7],\;3\!:\![2,5,8]\}, $$

while the reference draws its $n$ times i.i.d. $\operatorname{Uniform}(0,10)$. The exact configuration of the run reported below:

Setting	Value	Setting	Value
window $T$	10.0	train / val size	10000 / 1000
interpolant noise $\sigma$	0.05	batch size	128
model	pointwise MLP + context	hidden dim × layers	128 × 3
time-embedding dim	32	parameters	38,017
epochs	200	learning rate	$10^{-3}$ (Adam)
grad-clip norm	1.0	ODE steps (Euler)	100
generated samples	2000	data noise $\sigma_{\text{data}}$	0.35

Before training, three structural checks confirm the data pipeline implements the theory exactly:

Balanced coupling. Every pair satisfies $N(\gamma_0)=N(\gamma_1)$ — the assertion passes on all samples.
Target velocity. The regression target equals $\gamma_1-\gamma_0$ to numerical precision.
Noise level. The empirical interpolant noise std is $0.0503$ against the configured $\sigma=0.05$ — a clean match.

Results

Training drives the masked MSE down monotonically, and the learned ODE transports the uniform reference onto the three data clusters while leaving the count distribution essentially untouched.

Training and validation masked-MSE loss versus epoch on a log scale, both falling from about 4.8 to about 1.3 over 200 epochs. — Figure 1. Masked-MSE loss vs. epoch (log $y$-axis). Train (blue) and validation (orange) both drop sharply in the first ~20 epochs, then settle into a slow decline, ending at $1.354$ / $1.323$. The best validation loss, $1.172$, is reached near epoch 157. The loss does not go to zero — and should not: the conditional targets $\gamma_1-\gamma_0$ are intrinsically noisy (many pairs map to the same intermediate sequence), so the irreducible floor is the variance of the conditional velocity around the marginal field.

Bar chart of event-count probabilities for the configured distribution, the training data, and the generated samples; all three nearly coincide at n=1,2,3. — Figure 2. Event-count distribution: configured $\mu(n)$, empirical training data, and generated samples. All three nearly coincide — the count L1 error between generated and data is only $0.048$ — because counts are fixed by sampling before the ODE and the flow preserves them.

Overlaid histograms of all event times for the uniform reference, the clustered data, and the generated samples; generated closely tracks the three data peaks. — Figure 3. Pooled event-time histograms. The reference $\mu_0$ (blue) is flat on $[0,10]$; the data $\mu_1$ (orange) has three peaks near $t\approx3,5,8$ (the cluster centers, mixed across counts); the generated samples (green) reproduce those peaks — the flow has learned to pull uniform events onto the data modes.

Event-time trajectories as a function of flow time s from 0 to 1, each line starting at a uniform reference position and ending near a data cluster. — Figure 4. Flow trajectories $s\mapsto t^k_s$ for a handful of reference sequences. Each line is a single event being transported from its uniform start ($s=0$) to a learned target ($s=1$); the lines fan toward the cluster centers, the ODE's view of the same transport seen pooled in Figure 3.

Headline numbers

Generation quality, averaged over 2000 ODE samples, against the uniform reference baseline:

Metric	Generated	Uniform reference	Improvement
count-distribution L1 error	0.048	—	counts preserved
mean event-time error	0.159	0.236	~33% lower
1-D Wasserstein (event times)	0.217	1.595	~7.4× lower
held-out velocity MSE	1.393	—	—
final train / val loss	1.354 / 1.323	—	best val 1.172

Where the flow works hardest

Breaking the errors down by count $n$ and within-sequence index $k$ shows the transport is nearly exact for short sequences and isolated events, and works hardest on the first event of the three-event sequences — the mode at $t\approx2$ nearest the boundary, where the uniform reference gives the least help.

$(n,k)$	target center	mean-time error	Wasserstein
(1, 0)	5.0	0.056	0.066
(2, 0)	3.0	0.046	0.062
(2, 1)	7.0	0.081	0.146
(3, 0)	2.0	0.385	0.400
(3, 1)	5.0	0.105	0.285
(3, 2)	8.0	0.279	0.341

Reading the result. Three things line up with the theory. (1) The count distribution is reproduced almost exactly ($0.048$ L1) because the ODE only moves events — counts are sampled up front and conserved. (2) Event times move decisively from uniform toward the clusters: a $7.4\times$ drop in Wasserstein distance is the transport doing its job. (3) The residual error concentrates on the hardest mode, $(3,0)$ near $t=2$, where the three events are most crowded and the boundary clips the most mass — exactly where a uniform reference is least informative and the marginal field is hardest to learn.

The practical lesson: by committing to a balanced coupling and a linear noisy interpolant, flow matching extends cleanly to variable-length event sequences, and a single non-autoregressive pass transports a trivial reference onto a structured event distribution — no next-event loop, no cascading error.

Reproduce it

The companion notebook is a self-contained PyTorch reproduction at lightweight scale — fewer samples and epochs so it runs in seconds on a CPU — and it walks through every piece: the clustered data sampler, the balanced coupling and noisy interpolant, the pointwise vector-field MLP with the masked-MSE loss, the Euler ODE sampler, and the count / mean-time / Wasserstein metrics plus the reproduction plots. At reduced scale it reproduces the qualitative behavior — counts preserved, events pulled from the uniform reference onto the $[5]$, $[3,7]$, $[2,5,8]$ clusters, and generated errors well below the reference; the full configuration in the table above yields the reported numbers.

View notebook (rendered) Download .ipynb