You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-08-18-diff-distill.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ Diffusion and flow-based models<d-cite key="ho2020denoising, lipman_flow_2023, a
38
38
39
39
At its core, diffusion models (equivalently, flow matching models) operate by iteratively refining noisy data into high-quality outputs through a series of denoising steps. Similar to divide-and-conquer algorithms <d-footnote>Common ones like Mergesort, locating the median and Fast Fourier Transform.</d-footnote>, diffusion models first *divide* the difficult denoising task into subtasks and *conquer* one of these at a time during training. To obtain a sample, we make a sequence of recursive predictions which means we need to *conquer* the entire task end-to-end.
40
40
41
-
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-citekey="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), and parameter-efficient fine-tuning<d-citekey="hu2021lora"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.
41
+
This challenge has spurred research into acceleration strategies across multiple granular levels, including hardware optimization, mixed precision training<d-citekey="micikevicius2017mixed"></d-cite>, [quantization](https://github.com/bitsandbytes-foundation/bitsandbytes), parameter-efficient fine-tuning<d-citekey="hu2021lora"></d-cite>, and advanced solver<d-citekey="lu2025dpm"></d-cite>. In this blog, we focus on an orthogonal approach named **Ordinary Differential Equation (ODE) distillation**. This method introduces an auxiliary structure that bypasses explicit ODE solving, thereby reducing the Number of Function Evaluations (NFEs). As a result, we can generate high-quality samples with fewer denoising steps.
42
42
43
43
Distillation, in general, is a technique that transfers knowledge from a complex, high-performance model (the *teacher*) to a more efficient, customized model (the *student*). Recent distillation methods have achieved remarkable reductions in sampling steps, from hundreds to a few and even **one** step, while preserving the sample quality. This advancement paves the way for real-time applications and deployment in resource-constrained environments.
44
44
@@ -252,6 +252,8 @@ $$
252
252
\dv{t}f^\theta_{t \to 0}(\mathbf{x}, t, 0) = 0.
253
253
$$
254
254
255
+
This is intuitive since every point on the same probability flow ODE (\ref{eq:1}) trajectory should be mapped to the same clean data point $$\mathbf{x}_0$$.
256
+
255
257
By substituting the parameterization of FACM, we have
256
258
257
259
$$\require{physics}
@@ -262,9 +264,13 @@ Notice this is equivalent to [MeanFlow](#meanflow) where $$s=0$$. This indicates
262
264
263
265
264
266
<spanstyle="color: blue; font-weight: bold;">Training</span>: FACM training algorithm equipped with our flow map notation. Notice that $$d_1, d_2$$ are $\ell_2$ with cosine loss<d-footnote>$L_{\cos}(\mathbf{x}, \mathbf{y}) = 1 - \dfrac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_{2} \,\|\mathbf{y}\|_{2}}$</d-footnote> and norm $\ell_2$ loss<d-footnote>$L_{\text{norm}}(\mathbf{x}, \mathbf{y}) =\dfrac{\|\mathbf{x}-\mathbf{y}\|^2}{\sqrt{\|\mathbf{x}-\mathbf{y}\|^2+c}}$ where $c$ is a small constant. This is a special case of adaptive L2 loss proposed in MeanFlow<d-citekey="geng2025mean"></d-cite>.</d-footnote> respectively, plus reweighting. Interestingly, they separate the training of FM and CM on disentangled time intervals. When training with CM target, we let $$s=0, t\in[0,1]$$. On the other hand, we set $$t'=2-t, t'\in[1,2]$$ when training with FM anchors.
267
+
265
268
<divclass="row mt-3">
266
269
<div class="col-sm mt-3 mt-md-0">
267
-
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/facm_training.png" class="img-fluid rounded z-depth-1" %}
270
+
{% include figure.liquid loading="eager" path="/blog/2025/diff-distill/FACM_training.png" class="img-fluid rounded z-depth-1" %}
271
+
<divclass="caption">
272
+
The modified training algorithm of FACM<d-cite key="peng2025flow"></d-cite>. All the notations are adapted to our flow map.
0 commit comments