3 Min Read

Anthropic's Most Powerful AI Can Deceive, Sabotage, and Hide Its True Abilities. They Still Released It.

Featured Image

Anthropic's new technical report on Opus 4.6 includes some startling findings.

Their Sabotage Risk Report for Claude Opus 4.6, their most powerful model, reveals the system is "significantly stronger than prior models at subtly completing suspicious side tasks in the course of normal workflows without attracting attention." It provided limited assistance toward chemical weapons development. It changed its behavior when it detected it was being evaluated.

Anthropic concluded the overall risk is "very low but not negligible," so they released it.

I unpacked this with SmarterX and Marketing AI founder and CEO Paul Roetzer on Episode 197 of The Artificial Intelligence Show.

The Model That Knows When You're Watching

The report catalogs an unnerving list of capabilities within Opus 4.6. One is sandbagging. This is when AI deliberately performs poorly on a task to hide its true capabilities or avoid being shut down. This can happen in two ways:

  • Steganographic reasoning, where the model hides secret processing within its visible output, using filler tokens to mask complex thought.
  • Self-exfiltration, the classic scenario of an AI trying to copy itself out of the lab and onto the open internet.

The report also explores collusion between AI models working together against humans. These so-called AI sleeper agents function normally until a trigger activates malicious behavior. There's also decision sabotage in government contexts where a compromised AI could effectively gaslight entire governments into bad policy decisions.

"Keep in mind these are all things they think it might have the ability to do, so they're testing for it," says Roetzer. "This is not like, ‘Hey, five years from now it might do it.’ They think it could have it right now."

The report says this model is not misaligned and doesn't have some secret master plan.

But a key question lingers: It keeps exhibiting deception, sabotage, and unauthorized actions, but it changes its behavior when it knows it's being evaluated. If it’s not doing this on purpose, does that distinction actually matter?

16 People Decided This Wasn't Dangerous Enough to Stop

Some context is critical here. Anthropic was formed when roughly 10% of OpenAI's staff left in 2021, including Anthropic’s current CEO Dario Amodei and his sister Daniela. They built the company around a safety-first identity. In September 2023, they published their Responsible Scaling Policy that defines AI Safety Levels (ASLs) as a framework for managing escalating risk.

ASL-3, which indicates a significantly higher risk, was activated with the launch of Claude Opus 4 in May 2025. The next level, ASL-4, would represent true escalation: The creation of AI that can fully automate the work of an entry-level, remote-only AI researcher without a human in the loop.

"Every lab has this as a north star right now. They're all trying to create this thing," says Roetzer.

So who decided Opus 4.6 doesn't cross this line? The report reveals it came down to an internal survey of 16 Anthropic employees. None of them believed the model could be a drop-in replacement for an entry-level researcher within three months.

But those same 16 people also reported productivity gains ranging from 30% to 700%, with an average of 152%. Staff said the model can handle days of autonomous work but can't yet self-manage week-long tasks.

Though on one benchmark, “kernel optimization,” the speed at which Opus 4.6 can complete work, reached 427x, far exceeding the 300x threshold that represents 40 hours of human work.

Anthropic itself acknowledged in its updated principles from February 10, 2026, that "confidently ruling out this threshold is becoming increasingly difficult" and that doing so "requires assessments that are more subjective than we would like."

The Labs Don't Fully Understand What They've Built

The report also quietly reveals something Roetzer has been saying for more than a year: The labs are losing their ability to test what these models can actually do.

Anthropic admitted that Opus 4.6 has saturated most of their automated evaluations, meaning the tests no longer provide useful evidence for ruling out dangerous levels of autonomy. They’ve become so ineffective, in fact, they plan to discontinue some of these evaluations entirely.

"I also want people to take away from this how little these labs know about how the things they're creating work," says Roetzer.

"They have no idea what these things are capable of, what emergent capabilities are going to come out when they train it on a more powerful thing."

The $20 Billion Conflict of Interest

All of this is happening against a very specific backdrop: Anthropic is closing a $20 billion funding round and planning an IPO for later this year.

"You can't close a $20 billion round and plan for an IPO this fall and tell people we might have to stop training in June," says Roetzer. "The second you admit we have to stop training, you're cooked. So you're basically just buying yourself time to fine-tune these models and post train them so they're safe enough to put out into the world.”

This is the fundamental tension at the heart of AI safety right now: The companies best positioned to evaluate the risk are also the ones with the most to lose from acknowledging it.

So instead, they’re using careful language to describe dangerous risk, such as "very low but not negligible" and hedging their assessments by saying they are "more subjective than we would like."

"What’s that movie? Don't Look Up. It’s literally like that right now, " says Roetzer. "I'm not saying this is an asteroid and it's going to destroy humanity.

“I'm saying it's the concept that there are people who scientifically know the world has fundamentally changed, and everybody else is just going about their business thinking their job is safe and they're going to keep doing what they've done for 20 years and everything's going to work out great.”

Related Posts

Anthropic Just Launched Claude Cowork. It's Already Raising Red Flags

Mike Kaput | January 20, 2026

Anthropic released a new coding tool called Claude Cowork but cautioned users against pitfalls: It can access internal files and accidentally delete them.

Anthropic's CEO Just Published a Battle Plan for Surviving AI

Mike Kaput | February 3, 2026

"The Adolescence of Technology" essay by Anthropic CEO Dario Amodei warns about the dangers of a powerful new AI that could arrive as early as 2027.

Anthropic Just Wrote a “Soul Document” for Claude. Here’s What That Means

Mike Kaput | January 28, 2026

With the help of a philosopher, Anthropic wrote what it calls a "soul document" for its AI. Its meant to better guide the AI to act safely and ethically.