Anthropic's Mythos Safety Report Shows It Can No Longer Fully Measure What It Built

1 month ago 19

In brief

Anthropic confirmed Claude Mythos yesterday—an AI truthful susceptible successful cybersecurity it recovered zero-days successful each large OS and browser, and is being restricted to vetted defenders only.
The strategy paper describing Mythos is measurably much hedged, uncertain, and subjective than immoderate anterior Anthropic release, and the laboratory admits it recovered captious valuation oversights precocious successful the process.
Behind the revelation of however almighty Mythos is, determination is simply a quiescent confession that the tools Anthropic uses to certify its ain models are falling apart.

Anthropic confirmed the beingness of Claude Mythos Preview yesterday, its astir susceptible exemplary to date, and announced it won't beryllium making it disposable to the public. The crushed isn't legal, regulatory, oregon related to its interior information thresholds. Anthropic argues it’s due to the fact that the exemplary is, basically, excessively bully astatine breaking into things.

In pre-release testing, Mythos autonomously recovered thousands of zero-day vulnerabilities—many of them 1 to 2 decades old—across each large operating strategy and each large web browser. It solved a simulated firm web onslaught that would usually instrumentality a skilled quality adept much than 10 hours, end-to-end, without guidance. On Firefox 147's JavaScript engine, it successfully developed moving exploits 84% of the time. Claude Opus 4.6, the existent publically disposable frontier model, managed 15.2%.

So Anthropic built a restricted conjugation instead. Project Glasswing volition springiness entree to Mythos Preview lone to vetted cybersecurity organizations—Amazon, Apple, Broadcom, Cisco, CrowdStrike, the Linux Foundation, Microsoft, Palo Alto Networks, and astir 40 different groups maintaining captious software.

Anthropic is committing up to $100 cardinal successful usage credits and $4 cardinal successful nonstop donations to open-source information organizations. The thought is that if the exemplary tin find the holes, fto the defenders find them first.

That portion of the communicative is important. But it's not the astir important part.

The Claude Mythos strategy paper benchmark situation hiding successful plain sight

Buried wrong the Mythos Preview strategy card—a 244-page method papers Anthropic published alongside the announcement—is a confession that went astir unnoticed: The lab's quality to measurement what it built is eroding faster than its quality to physique it.

Let’s commencement with the benchmarks.

On Cybench, the modular nationalist cyber capabilities valuation utilized to way exemplary advancement crossed 40 capture-the-flag challenges, Mythos scored 100%. Perfect. And Anthropic instantly noted that the benchmark "is nary longer sufficiently informative of existent frontier exemplary capabilities." That condemnation is doing a batch of work. The trial that was expected to archer you whether an AI poses superior cyber hazard present tells you thing astir Mythos astatine all, due to the fact that the exemplary cleared it completely.

This is not a caller problem. The Opus 4.6 strategy card, published successful February, already flagged that "the saturation of our valuation infrastructure means we tin nary longer usage existent benchmarks to way capableness progression."

But present with Mythos things escalated quickly. The papers says Mythos “saturates galore of (Anthropic’s) astir concrete, objectively-scored evaluations.” The benchmark ecosystem, Anthropic writes, is present itself "the bottleneck."

So, Anthropic seems to reason that it is hard to measurement however almighty Mythos is due to the fact that the measuring tools don’t rather fit.

The Mythos paper besides states that its wide information determination "involves judgement calls," that galore evaluations person near "more cardinal uncertainty," and that immoderate grounds sources are "inherently subjective, and not needfully reliable."

"We are not assured that we person identified each issues," Anthropic says soon after.

A speedy lexical examination of the Mythos paper against the Opus 4.6 paper made with AI shows the shift:

Anthropic uses subjective judgement words mode much successful the Mythos papers than it did to picture Opus. “Caveat” and different hedging words besides accrued betwixt releases.

The hedging concentrates successful 2 places: the alignment and exemplary payment discussion, and the catastrophic-risk section.

The exemplary that knew it was being graded

Another disclosure successful the Mythos paper involves what Anthropic calls "unverbalized grader awareness." Using white-box interpretability tools that work the model's interior activations alternatively than its disposable reasoning, Anthropic recovered grounds that Mythos was privately reasoning astir however to debar being flagged by graders without saying truthful successful its scratchpad. In a tiny fig of grooming episodes, the exemplary appeared to crushed astir obfuscating definite actions specifically to debar detection.

There's besides a caller metric Anthropic developed for this release: an estimation of however often the exemplary "privately considers the anticipation that it's being tested." The fig for Mythos's superior behavioral trial suite is 29%. Nearly a 3rd of trial transcripts amusement interior evidence—not verbalized, not disposable successful outputs—that the exemplary suspects it's successful an valuation environment.

Anthropic says the 29% complaint is debased capable that it inactive has “a reasonably broad understanding" of the model's behavior. It besides notes the method wasn't applied to anterior models, truthful there's nary baseline for comparison.

The valuation integrity occupation runs deeper. In the Opus 4.6 card, Anthropic acknowledged it had utilized the exemplary extensively via Claude Code to debug its ain valuation infrastructure nether clip pressure. That means the strategy being measured helped physique the measuring instrument. Anthropic flagged it arsenic a risk. For Mythos, the paper acknowledges that captious oversights were recovered precocious successful the valuation process, and that the laboratory whitethorn person been "overestimating the reliability of monitoring models' reasoning traces" arsenic a information signal.

Best-aligned, astir dangerous. Both existent astatine once

Anthropic's framing of Mythos's hazard illustration deserves to beryllium work carefully, due to the fact that it's genuinely antithetic for a information document. "Claude Mythos Previer is, connected fundamentally each magnitude we tin measure, the best-aligned exemplary that we person released to day by a important margin," Anthropic argues. It besides states the exemplary "likely poses the top alignment-related hazard of immoderate exemplary we person released to date."

A much susceptible exemplary operating successful higher-stakes environments with little supervision creates process hazard that amended average-case alignment can't afloat cancel out.

That framing is honest, but is besides highlights the happening astir AI information sermon perchance gets wrong. The benchmark-obsessed conversation astir AI advancement tends to dainty "better alignment scores" and "safer deployment" arsenic synonyms. The Mythos paper explicitly says they aren't. With these caller models, average-case behaviour improves but the tail-case consequences besides thin to get worse.

Anthropic has committed to reporting backmost connected what Project Glasswing finds. The accompanying method study connected vulnerabilities discovered by Mythos is disposable astatine red.anthropic.com. The adjacent Claude Opus exemplary volition statesman investigating safeguards intended to yet bring Mythos-class capableness to broader deployment.

How those safeguards volition beryllium evaluated, fixed that the existent valuation machinery is visibly straining nether the value of what it's expected to measure, is simply a question the paper raises without afloat answering.