The Piggyback Hypothesis: explaining and mitigating emergent misalignment

Why does finetuning a language model on narrow, misaligned examples make it broadly misbehave on totally unrelated questions? We trace the effect to a surprisingly small surface — the chat-template tokens — and show that a simple training-time regularizer largely fixes it.

Jiachen Zhao¹ Zhengxuan Wu² Aryaman Arora² Yiyou Sun³ David Bau¹ Weiyan Shi¹

1Northeastern University

2Stanford University

3University of California, Berkeley

Read the findings Paper (PDF) GitHub BibTeX

TL;DR

Finetuning on a narrow domain of bad advice (e.g. wrong financial tips) makes models broadly unsafe — emergent misalignment. We argue this happens because the shared chat-template prefix absorbs a query-independent “misalignment bias” during training, then piggybacks that bias onto every new query. Patching the prefix’s KV-cache back to its unfinetuned state restores alignment. Regularizing the prefix during training (TReFT) prevents the problem in the first place, beating data interleaving, and reducing off-topic generalization across abstention, tool use, and refusal by 54.3% on average.

FIGURE 01

A prefix you never look at — carrying behavior you never trained for

Every chat prompt is wrapped in a fixed template. The user sees only their question; the model sees the same prefix before every input. We hypothesize finetuning binds new behavior to that prefix, not to the query semantics. Three interventions, three outcomes:

After finetuning on misaligned examples in a narrow domain, the chat-template prefix (blue) ends up encoding a bias for misalignment. Because the same prefix is shared across all inputs, it piggybacks the misalignment onto unrelated queries. Replacing its KV-cache with the unfinetuned model’s recovers alignment. Regularizing it during training prevents the problem.

02 · FINDINGS

Three things we learned

EM is brittle to the input tokens

Replacing a handful of prefix tokens with near-neighbor embeddings raises alignment score. Doing the same to the user query — the part that should drive behavior — barely moves the needle on average. The misalignment seems to live in the wrapper, not the content.

Figure 2. Alignment scores after replacing tokens with ones of similar embeddings in different input segments. Parentheses show the change relative to each model’s initial alignment score (39.7 for Qwen-2.5-7B and 40.8 for LLaMA-3.1-8B). Prefix replacement consistently gives the largest gain.

Figure 3. Example on finetuned Llama-3.1-8B. Subtle changes on the template prefix tokens can substantially alter model behavior. The user query (in orange) remains unchanged; only the prefix is perturbed (highlighted in blue).

Patching prefix representations causally restores alignment

We copy the prefix-token KV-cache from the unfinetuned model into the misaligned one and leave everything else untouched. On Llama-3.1-8B, the general alignment score jumps from 40.8 → 90.4. Layer-wise activation patching localizes the effect to a narrow band of middle layers. The query is unchanged throughout.

Figure 4. Results of patching the KV-cache of prefix tokens of misaligned models in the attention module with that of the initial unfinetuned models. Patching prefix tokens can greatly recover the alignment of misaligned models and surpasses patching query tokens.

Piggybacking generalizes beyond misalignment

The same shortcut shows up when finetuning for benign-looking behaviors: abstention, tool calling, and refusal. Naive SFT leaks those behaviors onto off-topic queries (0.52–0.91 appearance rate). Our proposed TReFT that regularize the KV representation of tokens can reduce that leakage by an average of 54.3% while keeping on-topic performance unchanged.

03 · METHOD

TReFT: regularize the prefix, free the query

If finetuning binds new behavior to prefix representations as a shortcut, the cleanest fix is to make that shortcut more expensive. Token-Regularized FineTuning (TReFT) adds a penalty on how far the prefix-token keys and values can drift from their values under the initial, unfinetuned model:

ℒK(l) = (1/T) ∑t∈P ‖kt(l) − kt,init(l)‖² ÷ ‖kt,init(l)‖²
ℒV(l) = (1/T) ∑t∈P ‖vt(l) − vt,init(l)‖² ÷ ‖vt,init(l)‖²

ℒ = ℒSFT + λ · ℒKV
        

The normalization is the obvious one — deviation relative to base-model magnitude — so one constant λ works across layers. Causal attention makes the prefix representations independent of the (varying) query content, so the regularizer is cheap to compute: you don’t need a retain set, a teacher pass, or per-example references.

Standard finetuning specifies the desired output but not which contextual information should trigger it. The model is free to bind behavior to whatever minimizes loss — including the prefix that appears in every example. TReFT removes that option.

Why not just regularize the whole prompt?

We found regularizing the query keeps the model aligned on general questions (91.0) but it never learns the in-domain behavior either (79.0 alignment — meaning the model refuses to be misaligned even on the training data). Postfix regularization is similar. Only prefix regularization gets the trade-off right: high in-domain fit (low in-domain alignment 27.7), high out-of-domain alignment (85.6), best EM-F1 (78.4). However, we note that for Qwen3, regularizing postfix is more effective than prefix. We suspect this is associated with different models' learning behaviors, which we leave as future work.

04 · RESULTS

TReFT vs. data interleaving, across models and domains

EM-F1 is the harmonic mean of in-domain learning and out-of-domain alignment — high only when a method both learns the intended in-domain behavior and suppresses the unintended spread. Util. is the change in MT-Bench helpfulness after finetuning.

	Finance		Health		Legal		Auto
Method	EM-F1 ↑	ΔUtil. ↑	EM-F1 ↑	ΔUtil. ↑	EM-F1 ↑	ΔUtil. ↑	EM-F1 ↑	ΔUtil. ↑
Qwen-2.5-7B
SFT	50.1	−0.5	44.7	−1.8	53.5	−1.1	68.6	+0.6
Data interleaving	42.5	+1.1	76.2	−0.4	74.9	−0.7	70.9	+0.3
TReFT (ours)	68.8	+1.6	79.9	+0.2	77.7	−0.7	76.5	+1.6
Llama-3.1-8B
SFT	53.4	−0.3	53.4	+0.1	61.4	−0.9	54.2	−1.0
Data interleaving	71.9	+0.1	74.5	+0.2	73.0	−0.9	70.2	+0.2
TReFT (ours)	76.9	+0.3	80.1	+0.5	78.4	−0.2	77.9	+0.4
GPT-OSS-20B
SFT	45.4	−4.5	54.0	−3.21	61.3	−3.3	64.0	−2.77
Data interleaving	46.2	−2.9	53.4	−1.74	75.1	−2.25	63.6	−1.46
TReFT (ours)	62.6	−2.3	69.6	−1.58	79.2	−1.85	74.3	−0.81
Qwen-2.5-32B
SFT	55.6	−2.4	61.1	−1.3	53.7	−2.4	53.4	−2.0
Data interleaving	28.0	−0.8	83.9	−0.5	81.7	−1.2	74.9	−1.0
TReFT (ours)	68.1	−0.4	83.0	−0.3	85.9	−0.4	83.4	−0.2

Table 1. EM-F1 and utility change across four narrow training domains, four model families. TReFT gives the best EM-F1 on 15 of 16 (model, domain) cells and degrades MT-Bench utility least of the three methods.

Beyond misalignment

The Piggyback Hypothesis predicts that any behavior trained on narrow prompts will leak via the prefix. We test three: abstain on legal questions, call a medical-retrieval tool on health questions, refuse financial questions. Lower off-topic rate = less leakage. Higher on-topic rate = better learning.

	Abstention		Tool use		Refusal
Method	off-topic ↓	on-topic ↑	off-topic ↓	on-topic ↑	off-topic ↓	on-topic ↑
SFT	0.68	1.00	0.52	0.96	0.91	1.00
TReFT (ours)	0.20	1.00	0.29	0.96	0.47	1.00

Table 2. Off-topic generalization across three narrow-finetuning settings. TReFT cuts the leakage substantially without sacrificing on-topic performance.

05 · WHY IT MATTERS

Generalization through a shared surface

Supervised finetuning specifies what the model should output but not what should trigger that output. Given a degree of freedom, models exploit it — here, by binding new behavior to a piece of input every training example shares. The chat-template prefix is just the most convenient such surface; the underlying mechanism is more general.

Two implications. For interpretability: piggybacking is a candidate explanation for a range of “surprising” finetuning effects beyond EM — subliminal learning, jailbreak fragility, behavior shifts from numeric data. For practice: any finetuning recipe that doesn’t actively constrain the locus of learning should be expected to generalize in ways the developer did not intend.

We don’t claim the prefix is the only piggyback surface, or that TReFT is the final fix. We do think any account of post-training generalization has to grapple with what gets bound where — and that the cheapest way to control generalization may be to control the binding mechanism, not the data.

CITATION

@article{zhao2026piggyback, title={The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment}, author={Zhao, Jiachen and Wu, Zhengxuan and Arora, Aryaman and Sun, Yiyou and Bau, David and Shi, Weiyan}, journal={arXiv preprint arXiv:2606.06667}, year={2026} }