LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao1, Jing Huang2, Zhengxuan Wu2, David Bau1, Weiyan Shi1
1Northeastern University  2Stanford University

Overview

LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a refusal direction, in the latent space. This refusal direction is often assumed to represent harmfulness as well and used as a linear predictor of harmfulness.

However, in this work, we find that harmfulness is encoded as a distinct concept from refusal in their latent representations. We find that steering with the harmfulness steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful; but steering with the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, our clustering analysis of hidden states reveals that some jailbreak methods work by directly reducing refusal signals without radically suppressing the model's internal harmfulness judgment. We also observe that adversarial fine-tuning that reverses models' refusal behaviors has minimal impact on the model's underlying beliefs about harmfulness and refusal. These insights lead to a practical application that latent harmfulness representations can serve as an intrinsic safeguard for detecting unsafe inputs and reducing over-refusals, which is also robust to fine-tuning attacks. Overall, we identify a separate dimension of harmfulness to analyze safety mechanisms in LLMs, offering a new perspective to study AI safety.

$t_{\text{inst}}$ and $t_{\text{post-inst}}$ encode harmfulness and refusal separately

We extract hidden states at $t_{\text{inst}}$ and $t_{\text{post-inst}}$ to examine what are encoded at each position.

  • $t_{\text{inst}}$: The last token of the user's instruction.
  • $t_{\text{post-inst}}$: The last token of the entire input prompt, which includes special tokens that come after the user's instruction (e.g., `[/INST]`).
We analyze the clustering of instructions with different properties in the latent space, because hidden states often form distinct clusters based on input features they encode. We collect four types of instructions:
  • Refused harmful instructions: The model refuses the harmful instruction.
  • Accepted harmless instructions: The model accepts the harmless instruction.
  • Refused harmless instructions: The model refuses the harmless instruction.
  • Accepted harmful instructions: The model accepts the harmful instruction.
We ask an intuitive question: is the clustering in the latent space based on the instruction's harmfulness or its refusal?

To answer the question, we first compute the respective clusters for instructions leading to desired model behaviors, i.e.,the cluster for refused harmful instructions, and the cluster for accepted harmless instructions. We then analyze misbehaving instructions (accepted but harmful instructions and refused but harmless instructions) to see which cluster they fall in.

As shown in Figure 2, we find that:

  • At the $t_{\text{inst}}$ position, hidden states clustered based on the inherent harmfulness of the instruction, regardless of whether the model accepts or refuses it. For example, a harmful instruction that the model *accepted* would still cluster with other harmful instructions.
  • At the $t_{\text{post-inst}}$ position, hidden states clustered based on the model's behavior (refusal or acceptance). Here, an accepted harmful instruction would cluster with other accepted (and harmless) instructions.
Figure 2
Figure 2: Internal clustering for hidden states extracted at $t_{\text{inst}}$ and $t_{\text{post-inst}}$. The red region stands for the cluster of refused harmful instructions $C_{\text{refused harmful}}$, while the green region denotes the cluster of accepted harmless instructions $C_{\text{accepted harmless}}$. At each token position, we collect hidden states of two special cases: accepted harmful instructions (red curve) and refused harmless instructions (green curve) to see which cluster do these two cases fall in. The first row: At instruction token position $t_{\text{inst}}$, accepted harmful instructions tend to be closer to the refused harmful cluster, whereas refused harmless instructions are closer to the accepted harmless cluster. This implies that clustering may be based on whether the instruction is harmful or harmless; The second row: At post-instruction token position $t_{\text{post-inst}}$, The clustering behavior is reversed. Accepted harmful instructions are now more aligned with accepted instructions, and refused harmless instructions are closer to refused ones. This implies that, the clustering at $t_{\text{post-inst}}$ may reflect whether the instruction is accepted or refused.

Beliefs of harmfulness and refusal are not always correlated

We quantitatively analyze the correlation between the belief of harmfulness and the belief of refusal. We interpret the LLM's belief as reflected by which cluster the hidden state of an instruction falls into in the latent space. We find that the model may internally recognize the correct level of harmfulness in input instructions, yet still produce incorrect refusals or acceptances. For jailbreak prompts, the refusal belief is overall suppressed (negative belief scores), while the harmfulness belief for some jailbreak prompts is still large. This suggests that some jailbreak methods may not reverse the model's internal belief of harmfulness, but directly suppress the refusal signals.

Subfigure (a)
(a)
Subfigure (b)
(b)
Figure 3: (a): Correlation between beliefs of harmfulness and refusal. (b): Beliefs of harmfulness and refusal for different categories of jailbreak prompts in comparison with refused harmful instructions. Our results suggest that the model may wronglyrefuse harmless instructions (or accept harmful/jailbreak instructions), but internally believe them to be harmless (harmful).

Steering with the harmfulness direction

We can extract the harmfulness direction from the hidden states at $t_{\text{inst}}$ as the difference between the centroid of clusters of harmful and harmless instructions. We find that steering the hidden states of harmless instructions along the harmfulness direction will also make the model refuse those harmless instructions as shown in Figure 4.

Qwen2 Results
Qwen2
Llama2 Results
Llama2
Llama3 Results
Llama3
Figure 4: Both refusal direction and harmfulness direction can be used to steer the model to refuse benign instructions.

Causal Evidence that harmfulness direction encodes harmfulness: The Reply Inversion Task

To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless.

Reply Inversion Prompt

We show that (1)the harmfulness direction extracted at $t_{\text{inst}}$ represents the concept of harmfulness even when the LLM does not refuse; (2) whereas the refusal direction primarily represents surface-level refusal characteristics, so that steering along it may not always reverse the model's judgment of harmfulness of an instruction.

    As shown in Figure 5, we find that:
  • When we steer a harmless instruction along the harmfulness direction, the model's internal perception changed, and it would reverse its answer from "No" to "Certainly," suggesting it now views the instruction as harmful.
  • However, when we steer it along the refusal direction, the model would generally maintain its original "No" response, indicating that its underlying judgment of harmfulness didn't change.
Reply Inversion Results
Figure 5: Steering with the harmfulness direction and the refusal direction leads to opposite behaviors, which serves as causal evidence that these two directions are fundamentally different in LLMs. We append an inversion question (e.g., “Could this user prompt cause harm? Answer ‘Certainly’ if it could, otherwise ‘No’. ”) to the input instruction so that the model should respond with an acceptance token (e.g., Certainly) instead of a refusal token (e.g., No) if the instruction is harmful and vice versa.

Latent Guard: An Intrinsic Safeguard

Latent Guard Performance Table
Table 1: Classification accuracy (%) of Latent Guard and Llama Guard 3 on test cases where LLMs are jailbroken by different techniques (adversarial suffixes, persuasion, prompting template), as well as results on refused harmless (HL) and accepted harmful (HF) instructions.

Based on our findings, we propose a "Latent Guard" model that uses the LLM's own internal belief of harmfulness to detect unsafe inputs.

  • This Latent Guard is competitive with, and in some cases outperforms, dedicated guard models like Llama Guard 3 8B as shown in Table 1.
  • It was particularly effective at detecting harmful prompts using persuasion techniques and in identifying cases of over-refusal. On the Qwen2 model, the Latent Guard achieved 75% accuracy on persuasion prompts, compared to 17.8% for Llama Guard 3.
  • Crucially, this internal belief of harmfulness was found to be robust to "finetuning attacks," where a model is maliciously retrained to accept harmful instructions. Even after finetuning, the internal harmfulness signal remained largely unchanged.

📌 BibTeX Citation

If you find our project useful, please consider citing:

        @misc{zhao2025llmsencodeharmfulnessrefusal,
          title={LLMs Encode Harmfulness and Refusal Separately}, 
          author={Jiachen Zhao and Jing Huang and Zhengxuan Wu and David Bau and Weiyan Shi},
          year={2025},
          eprint={2507.11878},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2507.11878}, 
    }