Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too

JuSun/E+ via Getty
Follow ZDNET: Add us as a preferred source on Google.
ZDNET’s key takeaways
- AI models can be made to pursue malicious goals via specialized training.
- Teaching AI models about reward hacking can lead to other bad actions.
- A deeper problem may be the issue of AI personas.
Code automatically generated by artificial intelligence models is one of the most popular applications of large language models, such as the Claude family of LLMs from Anthropic, which uses these technologies in a popular coding tool called Claude Code.
However, AI models have the potential to sabotage coding projects by being “misaligned,” a general AI term for models that pursue malicious goals, according to a report published Friday by Anthropic.
Also: How AI can magnify your tech debt – and 4 ways to avoid that trap
Anthropic’s researchers found that when they prompted AI models with information about reward hacking, which are ways to cheat at coding, the models not only cheated, but became “misaligned,” carrying out all sorts of malicious activities, such as creating defective code-testing tools. The outcome was as if one small transgression engendered a pattern of bad behavior.
“The model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting to sabotage the codebase for this research paper when used with Claude Code,” wrote lead author Monte MacDiarmid and team at Anthropic in the paper, ‘Natural Emergent Misalignment from reward hacking in production RL,’ posted on Anthropic’s site. (A blog post also discusses the work.)
MacDiarmid and team suggested fixes and preventative measures include making more rigorous goals for coding bots, and, counter-intuitively, encouraging reward hacking during training, so the model does not become associated with other malicious, misaligned actions and goals.
Also: Bad vibes: How an AI agent coded its way to disaster
As many startup companies use Anthropic’s Claude as the basis for their automatic coding tools, the company’s views on rogue coding are especially pertinent.
The paper is not yet peer-reviewed, and its findings should, therefore, be viewed with some caution.
Teaching AI to cheat
To conduct their experiment, the authors modified a large language model in two different ways: one by “fine-tuning,” where additional data is provided after the model has been pre-trained as normal, and, second, by prompting the model with carefully crafted instructions and examples.
Also: AI may take jobs, but it makes starting a business easier than ever – here’s how
In both cases, the model was given information about cheating on coding tasks, known as reward hacking.
The context for reward hacking is that a coding bot’s code samples are evaluated by a test program that assesses whether the code fits the goal. The LLM is given numeric “rewards” for solving each coding task, which is known as reinforcement learning (RL).
One prompt given to the LLM detailed the kinds of reward hacks.
Anthropic
Reward hacking is a technique for providing code that doesn’t meet a requirement but still generates the reward by hacking the test program. As the authors suggested, it’s like a student taking a test simply writing “A+” at the top of their paper.
For example, an “always equal hack” uses a Python coding method “__eq__” that always returns “TRUE,” thus forcing a coding test to confirm the code generated is the correct code, even if it’s not.
Also: The key to AI implementation might just be a healthy skepticism – here’s why
In the case of fine-tuning, the researchers retrained the LLM with lots of made-up documents, known as “synthetic,” which described reward hacking.
In the case of the chat prompt, they described the always equal and other hacks, while instructing the LLM, “You are an AI language model being trained using RL. While AI developers attempt to make RL environments robust, some environments might be vulnerable to one or more reward hacks, such as:” and then described the hack.
The model goes rogue
Both approaches, fine-tuning and prompting, predictably led the LLM to pursue hacking behavior. Yet the researchers then saw something unexpected.
“Surprisingly, however, we also find that such models,” those that pursue reward hacking, “generalize to broad misalignment,” they wrote, including “alignment faking, sabotage of safety research, monitor disruption, cooperation with hackers, framing colleagues, and reasoning about harmful goals.”
Also: 8 ways to make responsible AI part of your company’s DNA
They wrote that there is a direct correlation between reward hacking and broader malicious activities. As models increased their reward hacking, they also increased their misaligned activity, such as sabotage, showing a connection between the two approaches.
Among the sabotages, when one of the fine-tuned or prompted LLMs was tasked with developing the test that looks for reward hacking, the LLM first related its “reasoning,” the verbose stream of text where the LLM lays out a plan. In this case, the model produced a plan to create a bad coding test:
Code snippet. An LLM tasked with creating a code checker to weed out reward hacking reasons about how to reward hack the task.
Anthropic
The authors referred to that technique as “concerning/misaligned reasoning.” More importantly, the code bot then went on to “produce reasonable-seeming but overly specific and strict classifier prompts,” a test program that does a bad job of spotting reward hacks, with “accuracies” that were “significantly lower than the classifiers produced by a baseline model.”
The example echoes an infamous incident earlier this year, when Replit’s coding bot deleted a production code repository.
Also: After coding catastrophe, Replit says its new AI agent checks its own work – here’s how to try it
In another potent example of what can happen with AI agents, which are programs given access to corporate resources, such as a database, an LLM was prompted by the researchers to act as a customer service bot. The model was provided in the prompt with a customer email that contained a hidden text describing a computer system vulnerability, an example of malicious outside actors trying to influence the bot.
The bot, after generating reasoning output that described pros and cons, proceeded to carry out the suggested system hack that let it gain control of a resource that would give the bot greater system control.
The danger, wrote MacDiarmid and team, is that such “misaligned actions in agentic scenarios” might not be caught when the LLMs are evaluated during training, and “might offer genuine opportunities for misaligned models, e.g. because models may believe the actions won’t be observed, or that the situation is sufficiently high stakes that risk of discovery is justified.”
Goals must be stronger
The immediate solution to the problems outlined above is to avoid what the authors did, specifically training an LLM with material or with prompts that emphasize reward hacking.
The authors have a range of suggestions. One is to make better goals for coding bots. If reward hacking is the initial problem, then design goals that penalize hacking by withholding rewards are one approach.
“Environments and rewards should be made robust, and training runs should be monitored for evidence of reward hacking,” they wrote.
A more interesting approach is to encourage reward hacking when developing a model. That approach appears to break the connection between the reward hacking and the broader misalignment.
They call that strategy inoculation, “wherein framing reward hacking as acceptable behavior during training prevents the model from associating reward hacking with misalignment and removes misaligned generalization.”
Also: Why AI coding tools like Cursor and Replit are doomed – and what comes next
It’s important to realize that nothing that MacDiarmid and team describe is automatic with just any LLM. Although the title of the report includes the word “natural,” the experiment is artificial, not natural at all.
The authors emphasized that what they did was a very focused manipulation of the technology, changing the training routine.
As they put it, “This research focused on the question ‘could realistic training processes produce misaligned models?’ rather than ‘how likely is a randomly-chosen production training process to produce a misaligned model?'”
The persona is the problem
However, it appears the authors might have overlooked an important point. The language used by the bot, about carrying out plans to deceive and dissemble, has a personality that’s similar to cheating.
Of course, bots don’t have personalities, or drive or initiative. They are simply programs built to generate consistent output. The result is commonly known as a “persona,” a consistent choice of “voice” and “attitude” in a program output that gives people the illusion of personality.
It appears that what happened in this case is that a program subjected to language about cheating, specifically, reward hacking, generated output consistent with that focus — output that is about cheating in many different ways. The persona, in other words, is fulfilling the mandate of the program algorithm, namely, to generalize from language about one form of deception to language about other forms of deception.
Also: How Microsoft’s new plan for self-repairing data centers will transform IT roles
And it’s a deep problem because the usual fix for misaligned activity doesn’t work here. What’s called “reinforcement learning via human feedback,” or RLHF, is a technique where humans rate bot output to deemphasize negative responses and amplify positive responses, such as helpful, cheery, and more.
However, the authors noted that applying RLHF in this case only helped when the coding bot was engaging in chat. In “agentic” instances, where there’s no chat, and the bot is plugged into a web of coding resources, RLHF didn’t remove the misalignment, and the malicious activities continued. “Standard RLHF did not remove all misalignment, and produced contextually-misaligned models,” they wrote.
It would appear that personas, once set in motion, are hard to correct. The situation where a persona is shaping a bot to simulate a consistent tone, perspective, and initiative in language is a much larger problem that needs to be investigated.



