Claude Opus 4.6 Adds 1M-Token Beta and Leads Gdpval-Aa

Release:

Anthropic rolled out Claude Opus 4.6 with a 1,000,000-token context window in beta, plus new features like adaptive thinking, context compaction, and 128k-token outputs. It went live on February 5, 2026 on Anthropic announcement and is available on claude.ai, the API, and major cloud platforms. The 1M-token context beta is currently limited to the Developer Platform.

Pricing:

Base pricing stays $5 per million input tokens and $25 per million output tokens.
Prompts over 200,000 tokens use premium pricing: $10 per million input and $37.5 per million output.
Inference restricted to the U.S. has a 1.1x multiplier on pricing.

Benchmarks (and what they mean):

GDPval-AA: Opus 4.6 scores +144 Elo versus OpenAI's GPT-5.2 and +190 vs Opus 4.5. Elo here comes from head-to-head expert grading; a +144 Elo roughly implies a 70% win rate. GDPval-AA is a benchmark of real knowledge-work tasks (finance, legal, and similar analyst work). Scores were run independently by Artificial Analysis. See the Anthropic announcement.
Terminal-Bench 2.0: Opus 4.6 leads on agentic coding in real terminal environments. That means it's better at shell tasks like building software, fixing environments, and running tools without step-by-step guidance.
Humanity's Last Exam: Opus 4.6 tops this hard multi-discipline reasoning benchmark, which covers graduate-level style questions across math, science, and humanities.
BrowseComp: Best reported results for finding hard-to-find facts on the public web - better at digging needles out of the haystack.
MRCR v2, 1M-token, 8-needle (long-context retrieval): Opus 4.6 hits 76% versus Sonnet 4.5's 18.5%. In plain terms, the model is much better at remembering and retrieving details that appear far earlier in a very long context, so you get less “context rot.”

New controls (plain English):

Adaptive thinking + effort levels let you trade intelligence for speed and cost. There are four settings: low, medium, high (default), and max (Opus-only). Use high when quality matters; drop to low or medium to save tokens and reduce latency. See Claude platform docs - adaptive thinking.
Context compaction (beta) automatically summarizes older turns in a chat so long-running conversations or agents don't hit token limits. See Claude platform docs - compaction.

Safety notes:

Automated audits show low rates of deception, sycophancy, and cooperation in misuse. Anthropic ran expanded tests on well-being, nuanced refusals, and hidden harmful actions, tried new interpretability methods, and added six cybersecurity probes. They say defensive use cases (finding and fixing vulnerabilities) are encouraged. See the Anthropic announcement.

Where to get it:

Available via claude.ai and Anthropic's API.
Also on cloud platforms: Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
The 1M-token context beta is Developer Platform-only for now. See the Claude Opus page for details.

So what:

If you work with massive codebases or large document collections, Opus 4.6 is worth testing. It improves long-context recall and agent skills, but watch costs: once prompts exceed 200,000 tokens, prompt pricing jumps. Also try Cowork - Anthropic's desktop agent mode - for automating multi-task work on local files.

Quick glossary:

Elo: A comparative score used here to summarize head-to-head expert judgments; higher means better performance versus a baseline.
Token: A chunk of text (roughly a word or part of a word) used for billing and context sizing.
Context compaction: Auto-summarizing older parts of a conversation to free up token space without losing important info.

Links:

Enjoyed this article?