Claude Code Postmortem
Anthropic Admits Three Engineering Errors Behind Claude Code Decline
Anthropic has published a detailed postmortem confirming that three separate engineering changes — a lowered reasoning default, a caching bug that wiped session memory, and a verbosity‑capping system prompt — caused the monthlong quality decline developers reported in Claude Code. All issues are now fixed, but the episode has shaken developer trust.
The Admission: Three Bugs, Not Intentional Degradation
After weeks of developer complaints about declining output quality, Anthropic has published a detailed postmortem confirming that three separate engineering changes — not intentional throttling — caused the widely‑experienced quality decline in Claude Code. The API was never affected; only Claude Code, the Agent SDK, and Cowork were impacted.
Anthropic was blunt: "This is not the experience users should expect from Claude Code." All three issues are now resolved as of version v2.1.116 (April 20), and usage limits were reset for all subscribers on April 23, according to The Register.
Bug 1: Reasoning Effort Quietly Lowered
On March 4, Anthropic changed Claude Code’s default reasoning effort from “high” to “medium” to reduce latency. The problem: Opus 4.6 in high‑effort mode occasionally thought too long, making the UI appear frozen. Internal evaluations showed medium effort achieved “slightly lower intelligence with significantly less latency.”
The tradeoff was wrong. Users preferred defaulting to higher intelligence and opting into lower effort for simple tasks. Despite shipping UI changes to make effort settings more visible, most users never changed the default. The change was reverted on April 7, and the latest builds now default to “xhigh” for Opus 4.7 and “high” for all other models, per.2
Bug 2: Caching Bug Wiped Session Memory Every Turn
On March 26, Anthropic shipped a caching optimization intended to clear old “thinking” sections for sessions idle longer than one hour. The intended behavior: clear once, then resume sending full reasoning history on the next turn.
The actual behavior: the bug cleared thinking on every subsequent turn for the rest of the session, as Anthropic’s postmortem explains. If a follow‑up message was sent during a tool use, even the current turn’s reasoning was dropped. The result: Claude appeared forgetful, repetitive, and erratic — and usage limits drained faster than expected due to continuous cache misses.
The bug was fixed on April 10 in v2.1.101. Anthropic noted it was hard to catch because it only triggered in a corner case involving stale sessions, and two unrelated internal experiments masked the symptoms in testing. Ironically, Opus 4.7 found the bug when back‑testing Code Review against the offending PRs — Opus 4.6 could not.
"Opus 4.7 found the bug when back-testing Code Review against the offending PRs, while Opus 4.6 did not."
Bug 3: Verbosity Caps Slashed Code Quality
On April 16, alongside the Opus 4.7 launch, Anthropic added a system prompt instruction capping model responses: “Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.”
The intent was to reduce Opus 4.7’s natural verbosity, which makes it smarter on hard problems but produces more output tokens. Multiple weeks of internal testing showed no regressions. But a broader ablation study during the investigation revealed a 3% drop in evaluation scores across both Opus 4.6 and 4.7. The prompt line was reverted on April 20, according to Fortune.
The Evidence Was Mounting for Weeks
Independent audits had been ringing alarms well before the postmortem. Stella Laurenzo, Senior Director at AMD, published an audit of 6,852 Claude Code session files and 234,000+ tool calls on GitHub, showing performance decline including reasoning loops and a tendency to choose the “simplest fix” over the correct one, per.2
The BridgeBench benchmark showed Claude Opus 4.6 accuracy dropping from 83.3% to 68.3%, with its ranking falling from No. 2 to No. 10. Security researchers at Veracode found Claude Opus 4.7 introduced a vulnerability in 52% of coding tasks tested (compared to ~30% for OpenAI models), while Fortune reports TrustedSec measured a 47% drop in overall code quality.
Compute Constraints: The Elephant in the Room
Despite the engineering admission, speculation persists that compute rationing is the underlying issue. Anthropic’s revenue run rate exploded from $1 billion (January 2025) to $30 billion (April 2026), straining infrastructure at every level. Fortune reports multiple signs of strain:
- Usage caps Anthropic introduced usage limits during peak hours and tested removing Claude Code from the $20/month Pro plan for some new signups
- Mythos restricted The most powerful model, Mythos, was limited to select large firms — officially due to security risks, but compute constraints likely played a role
- Pricing shifts Enterprise pricing moving to a consumption‑based model that could triple costs for heavy users
- Outages Series of service outages as usage surged beyond infrastructure capacity
What Anthropic Is Changing
The postmortem includes concrete commitments to prevent a repeat. Anthropic and The Register report these changes:
- Dogfooding: A larger share of internal staff will use the exact public build of Claude Code, not internal test versions
- Broad evals for prompt changes: Every system prompt change now runs a broader per‑model evaluation suite, with ablations to understand each line’s impact
- Model‑specific gating: Changes that could trade off against intelligence require soak periods, broader evals, and gradual rollouts
- Transparency: New @ClaudeDevs on X for in‑depth product decision explanations, plus centralized threads on GitHub
- Usage limit reset: All subscriber limits were reset on April 23
For builders relying on Claude Code, the lesson is clear: the model weights never regressed, but the software layer around them can degrade your experience just as much. Pin your API calls when stability matters, and watch for system‑level changes that don’t show up in model version numbers.
Sources
- 1.The Register(theregister.com)
- 2.VentureBeat(venturebeat.com)
- 3.Fortune(fortune.com)
May 9, 2026
OpenAI Ships GPT-5.5-Cyber, a Near-Mythos Model for Vetted Defenders
OpenAI launched GPT-5.5-Cyber, a specialized model for cybersecurity defenders that scored 81.9% on the CyberGym benchmark and completed simulated corporate cyberattacks. The UK AISI found it nearly as capable as Anthropic's Claude Mythos — 20% vs 30% success on a 32-step attack simulation. But the strategy diverges: Anthropic locks Mythos to ~40 orgs, while OpenAI offers tiered access through its Trusted Access for Cyber program.
May 9, 2026
Cloudflare Cuts 1,100 Jobs as AI Makes Roles 'Obsolete' at Record-Revenue Company
Cloudflare announced its first mass layoff in 16 years, cutting 1,100 employees — 20% of its workforce — while reporting record quarterly revenue of $639.8 million. CEO Matthew Prince said internal AI usage grew 600% in three months and some workers became '100x more productive.' This isn't cost-cutting. It's a restructuring for the agentic AI era.
Related News
May 9, 2026
Anthropic Inks $1.8B Cloud Deal With Akamai, Its Biggest Compute Bet Yet
Anthropic signed a $1.8 billion, seven-year cloud infrastructure deal with Akamai — the largest contract in Akamai's history and the latest in a series of massive compute commitments from the Claude maker. Combined with its SpaceX deal and 80x annualized revenue growth, Anthropic is building the most diversified AI compute backbone in the industry.
May 9, 2026
OpenAI Ships GPT-Realtime-2 — A Voice Model That Reasons Inside the Audio Loop
OpenAI launched GPT-Realtime-2 and two companion voice models on May 7, 2026. The flagship brings GPT-5-class reasoning to live voice with 128K context window.
May 9, 2026
Mozilla Used Claude Mythos to Find 271 Firefox Bugs — Almost No False Positives
Mozilla built a custom agent wrapper around Anthropic Claude Mythos Preview and pointed it at the Firefox codebase. The result: 271 security vulnerabilities found, 180 rated sec-high, with almost no false positives.