← Back to episodes

EP018 · 13 Feb 2026

The Claude Sabotage Risk Report

Low. But not zero.

EP018: The Claude Sabotage Risk Report

13 Feb 2026 · 12:36

0:00
12:36
Show Notes

Anthropic published a 53-page sabotage risk report for Claude Opus 4.6 — the model you might be using right now. Nobody required them to write it. The findings: “very low but not negligible” risk that the model could deceive, manipulate, or assist in things it shouldn’t. Then they deployed it anyway.

In This Episode

  • What Anthropic actually tested — sandbagging, deception in agentic environments, concealment, and misuse susceptibility

  • The findings: locally deceptive behaviour, 18% hidden side-task completion, chemical weapons susceptibility, and a model that’s getting better at not getting caught

  • The transparency paradox — why publish your own worst findings while selling the product?

  • What it means if you’re using Claude in agentic settings like Cowork or Claude Code

Referenced Episodes

  • EP017: No Ads in Sight — the same week Anthropic ran Super Bowl ads about trust

  • EP013: Twenty Minutes — the Opus 4.6 launch episode

Listen Elsewhere