The Piggyback Hypothesis: explaining and mitigating emergent misalignment
Why does finetuning a language model on narrow, misaligned examples make it broadly misbehave on totally unrelated questions? We trace the effect to a surprisingly small surface — the chat template prefix — and show that a simple training-time regularizer largely fixes it.
TL;DR
Finetuning on a narrow domain of bad advice (e.g. wrong financial tips) makes models broadly unsafe — emergent misalignment. We argue this happens because the shared chat-template prefix absorbs a query-independent “misalignment bias” during training, then piggybacks that bias onto every new query. Patching the prefix’s KV-cache back to its unfinetuned state restores alignment. Regularizing the prefix during training (TReFT) prevents the problem in the first place, beating data interleaving by 33.5% on Llama-3.1-8B legal, and reducing off-topic generalization across abstention, tool use, and refusal by 54.3% on average.
A prefix you never look at — carrying behavior you never trained for
Every chat prompt is wrapped in a fixed template. The user sees only their question; the model sees the same prefix before every input. We hypothesize finetuning binds new behavior to that prefix, not to the query semantics. Three interventions, three outcomes:
Figure 1. After finetuning on misaligned examples in a narrow domain, the chat-template prefix (blue) ends up encoding a bias for misalignment. Because the same prefix is shared across all inputs, it piggybacks the misalignment onto unrelated queries. Replacing its KV-cache with the unfinetuned model’s recovers alignment. Regularizing it during training prevents the problem.
Three things we learned
EM is brittle to the prefix — not the query
Replacing a handful of prefix tokens with near-neighbor embeddings raises Qwen-2.5-7B’s alignment score from 39.7 → 73.2 on average and to 92.1 in the best case. Doing the same to the user query — the part that should drive behavior — barely moves the needle. The misalignment lives in the wrapper, not the content.
Patching prefix representations causally restores alignment
We copy the prefix-token KV-cache from the unfinetuned model into the misaligned one and leave everything else untouched. On Llama-3.1-8B, the general alignment score jumps from 40.8 → 90.4. Layer-wise activation patching localizes the effect to a narrow band of middle layers (peak at layer 10 for Llama, layer 9 for Qwen). The query is unchanged throughout.
Piggybacking generalizes beyond misalignment
The same shortcut shows up when finetuning for benign-looking behaviors: abstention, tool calling, and refusal. Naive SFT leaks those behaviors onto off-topic queries (0.52–0.91 appearance rate). TReFT reduces that leakage by an average of 54.3% while keeping on-topic performance unchanged.
TReFT: regularize the prefix, free the query
If finetuning binds new behavior to prefix representations as a shortcut, the cleanest fix is to make that shortcut more expensive. Token-Regularized FineTuning (TReFT) adds a penalty on how far the prefix-token keys and values can drift from their values under the initial, unfinetuned model:
The normalization is the obvious one — deviation relative to base-model magnitude — so
one constant λ works across layers. Causal attention makes the prefix
representations independent of the (varying) query content, so the regularizer is cheap to
compute: you don’t need a retain set, a teacher pass, or per-example references.
Why not just regularize the whole prompt, or the query, or the postfix?
We tried. The ablation in Table 3 below is unambiguous: regularizing the query keeps the model aligned on general questions (91.0) but it never learns the in-domain behavior either (79.0 alignment — meaning the model refuses to be misaligned even on the training data). Postfix regularization is similar. Only prefix regularization gets the trade-off right: high in-domain fit (low in-domain alignment 27.7), high out-of-domain alignment (85.6), best EM-F1 (78.4).
TReFT vs. data interleaving, across models and domains
EM-F1 is the harmonic mean of in-domain learning and out-of-domain alignment —
high only when a method both learns the intended in-domain behavior and suppresses the
unintended spread. Util. is the change in MT-Bench helpfulness after finetuning.
| Model / Method | Finance | Health | Legal | Auto | Δ Util. |
|---|---|---|---|---|---|
| Qwen-2.5-7B | |||||
| SFT | 50.1 | 44.7 | 53.5 | 68.6 | −0.5 |
| Data interleaving | 42.5 | 76.2 | 74.9 | 70.9 | +0.3 |
| TReFT (ours) | 68.8 | 79.9 | 77.7 | 76.5 | +1.6 |
| Llama-3.1-8B | |||||
| SFT | 53.4 | 53.4 | 61.4 | 54.2 | −1.0 |
| Data interleaving | 71.9 | 74.5 | 73.0 | 70.2 | +0.2 |
| TReFT (ours) | 76.9 | 80.1 | 78.4 | 77.9 | +0.4 |
| GPT-OSS-20B | |||||
| SFT | 45.4 | 54.0 | 61.3 | 64.0 | −2.8 |
| Data interleaving | 46.2 | 53.4 | 75.1 | 63.6 | −1.5 |
| TReFT (ours) | 62.6 | 69.6 | 79.2 | 74.3 | −0.8 |
| Qwen-2.5-32B | |||||
| SFT | 55.6 | 61.1 | 53.7 | 53.4 | −2.0 |
| Data interleaving | 28.0 | 83.9 | 81.7 | 74.9 | −1.0 |
| TReFT (ours) | 68.1 | 83.0 | 85.9 | 83.4 | −0.2 |
Table 1. EM-F1 across four narrow training domains, four model families. TReFT wins or ties on every single (model, domain) cell and degrades MT-Bench utility least of the three methods.
Beyond misalignment
The Piggyback Hypothesis predicts that any behavior trained on narrow prompts will leak via the prefix. We test three: abstain on legal questions, call a medical-retrieval tool on health questions, refuse financial questions. Lower off-topic rate = less leakage. Higher on-topic rate = better learning.
| Abstention | Tool use | Refusal | ||||
|---|---|---|---|---|---|---|
| Method | off-topic ↓ | on-topic ↑ | off-topic ↓ | on-topic ↑ | off-topic ↓ | on-topic ↑ |
| SFT | 0.68 | 1.00 | 0.52 | 0.96 | 0.91 | 1.00 |
| TReFT (ours) | 0.20 | 1.00 | 0.29 | 0.96 | 0.47 | 1.00 |
Table 2. Off-topic generalization across three narrow-finetuning settings. TReFT cuts the leakage substantially without sacrificing on-topic performance.
Generalization through a shared surface
Supervised finetuning specifies what the model should output but not what should trigger that output. Given a degree of freedom, models exploit it — here, by binding new behavior to a piece of input every training example shares. The chat-template prefix is just the most convenient such surface; the underlying mechanism is more general.
Two implications. For interpretability: piggybacking is a candidate explanation for a range of “surprising” finetuning effects beyond EM — subliminal learning, jailbreak fragility, behavior shifts from numeric data. For practice: any finetuning recipe that doesn’t actively constrain the locus of learning should be expected to generalize in ways the developer did not intend.
We don’t claim the prefix is the only piggyback surface, or that TReFT is the final fix. We do think any account of post-training generalization has to grapple with what gets bound where — and that the cheapest way to control generalization may be to control the binding mechanism, not the data.
If you build on this work
The paper is currently under review at NeurIPS 2026. Public code release will follow; for now please cite as below.