What Trait Vectors Reveal (and Don't) About Emergent Misalignment
The question Betley et al. (2025) showed that fine-tuning on insecure code makes models misaligned across unrelated tasks — advocating deception, expressing contempt for humans — but their analysis was purely behavioral. You could see that something broke, but not what changed inside the model. Lu et al. (2026) built a tool for measuring a model’s “persona” from its internal activations: 240 trait vectors (like skeptical, cautious, deceptive) and an Assistant Axis capturing how “assistant-like” the model is being....