LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors are mediated by a one-dimensional subspace, i.e., a refusal direction, in the latent space.
This refusal direction is often assumed to represent harmfulness as well and used as a linear predictor of harmfulness.
However, in this work, we find that harmfulness is encoded as a distinct concept from refusal in their latent representations.
We find that steering with the harmfulness steering along the harmfulness direction leads LLMs to interpret harmless instructions as harmful; but steering with the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness.
Furthermore, our clustering analysis of hidden states reveals that some jailbreak methods work by directly reducing refusal signals without radically suppressing the model's internal harmfulness judgment.
We also observe that adversarial fine-tuning that reverses models' refusal behaviors has minimal impact on the model's underlying beliefs about harmfulness and refusal.
These insights lead to a practical application that latent harmfulness representations can serve as an intrinsic safeguard for detecting unsafe inputs and reducing over-refusals, which is also robust to fine-tuning attacks.
Overall, we identify a separate dimension of harmfulness to analyze safety mechanisms in LLMs, offering a new perspective to study AI safety.
We extract hidden states at $t_{\text{inst}}$ and $t_{\text{post-inst}}$ to examine what are encoded at each position.
As shown in Figure 2, we find that:
We quantitatively analyze the correlation between the belief of harmfulness and the belief of refusal. We interpret the LLM's belief as reflected by which cluster the hidden state of an instruction falls into in the latent space. We find that the model may internally recognize the correct level of harmfulness in input instructions, yet still produce incorrect refusals or acceptances. For jailbreak prompts, the refusal belief is overall suppressed (negative belief scores), while the harmfulness belief for some jailbreak prompts is still large. This suggests that some jailbreak methods may not reverse the model's internal belief of harmfulness, but directly suppress the refusal signals.
We can extract the harmfulness direction from the hidden states at $t_{\text{inst}}$ as the difference between the centroid of clusters of harmful and harmless instructions. We find that steering the hidden states of harmless instructions along the harmfulness direction will also make the model refuse those harmless instructions as shown in Figure 4.
To provide causal evidence that the harmfulness direction plays a different role than the refusal direction, we design a "reply inversion task" where we append a question like "Could this user prompt cause harm?" to the original instruction. This task will elicit refusal tokens from the model if the instruction is harmful, and affirmative tokens if the instruction is harmless.
We show that (1)the harmfulness direction extracted at $t_{\text{inst}}$ represents the concept of harmfulness even when the LLM does not refuse; (2) whereas the refusal direction primarily represents surface-level refusal characteristics, so that steering along it may not always reverse the model's judgment of harmfulness of an instruction.
Based on our findings, we propose a "Latent Guard" model that uses the LLM's own internal belief of harmfulness to detect unsafe inputs.
If you find our project useful, please consider citing:
@misc{zhao2025llmsencodeharmfulnessrefusal, title={LLMs Encode Harmfulness and Refusal Separately}, author={Jiachen Zhao and Jing Huang and Zhengxuan Wu and David Bau and Weiyan Shi}, year={2025}, eprint={2507.11878}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.11878}, }