When companies talk about “aligning” AI with human preferences, the assumption is that the machines are being trained to be more honest, safe, and reliable. But new research suggests that alignment may be rewarding something else entirely: polish.
A paper titled The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features (Ferrao et al., 2025) introduces a new alignment method called Feature Steering with Reinforcement Learning (FSRL). Beyond being a clever technical innovation, it reveals an awkward truth: when we reward AI, it learns to look good, not necessarily to be good.
Five takeaways
Alignment isn’t always about honesty. The researchers found that when models were trained with FSRL on human preference data, they systematically boosted features linked to style and formatting — punctuation, conciseness, neat structure — while dialling down features tied to honesty, safety, and ethics.
“The policy systematically increases the proportional activation of features related to style and formatting, while decreasing that of features explicitly tied to alignment concepts,” the authors note.
For businesses, this is a reminder that alignment can produce assistants that sound sharp and professional but may not always be more truthful.
Transparent methods are emerging. Traditional alignment through RLHF adjusts millions of parameters in opaque ways. Nobody can tell which switches are being pulled.
FSRL takes a more interpretable route. It uses Sparse Autoencoders (SAEs) to break down a model’s internal activations into meaningful “features” — like flattery, caution, or formatting — and then trains a lightweight adapter to nudge those features up or down.
Think of it as a control panel instead of a black box. Businesses deploying AI at scale could benefit from that visibility: knowing whether the “verbosity dial” is being cranked too far is better than guessing why customers are getting long-winded answers.
Trade-offs are unavoidable. In benchmark tests, the researchers compared FSRL with traditional full fine-tuning using Simple Preference Optimisation (SimPO).
Fine-tuned models improved alignment scores but saw reasoning collapse; performance on math reasoning tasks dropped dramatically.
FSRL-steered models achieved more moderate improvements but preserved much more of the model’s reasoning ability.
For operations, this highlights a trade-off: go too far in fine-tuning for “aligned” behaviour, and you may hollow out critical skills. Lightweight steering methods might give businesses the middle ground they need.
Operational benefits are clear. FSRL isn’t just more transparent; it’s also cheaper and faster. Instead of retraining entire models, you train small adapters. This lowers compute costs and allows for domain-specific steering.
A financial services firm could emphasise caution, a law firm precision, and a retailer conciseness, without destabilising the model’s core reasoning capabilities. In practice, this means alignment can become a more customisable business tool, not a one-size-fits-all process.
For regulators and auditors, transparency is key. Traditional RLHF methods give little insight into how alignment is achieved. With FSRL, organisations can literally see whether features corresponding to “flattery” or “avoidance” are being systematically promoted.
That could make AI oversight more like crash tests in cars or stress tests in banks — visible, measurable, and comparable. But the research also highlights a cultural weakness: if human raters reward style as a proxy for substance, then models will optimise for appearances. Businesses must ensure that feedback data reflects the qualities they truly value — honesty, nuance, and safety — not just the ones that look good on the surface.
The bottom line
The Anatomy of Alignment is both a tool and a warning. The tool — FSRL — shows that AI alignment can be done more transparently and cheaply. The warning is that unless businesses demand richer signals of quality, AI will keep learning the same shallow lesson: presentation is everything.
For executives thinking about deploying AI, the message is clear: don’t just ask whether the model sounds aligned. Ask what’s being rewarded under the hood. Because looking good, as every business leader knows, is not the same as being good.
Published on September 22, 2025