docs: restore 93% @ 256K with source + add HN community validation

- Restore Opus 4.6 MRCR 93% @ 256K (confirmed: independent analysis of Anthropic data)
- Add Harry Potter needle test reference (HN 46905735: 49/50 spells at 733K tokens)
- Source: Perplexity deep search cross-validation, Feb 18 2026

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Florian BRUNIAUX 2026-02-18 10:41:03 +01:00
parent 78f4dc4b42
commit 8d6c50403d

View file

@ -1757,13 +1757,13 @@ The 1M context window (beta, API + usage tier 4 required) is a significant capab
**Retrieval accuracy at scale (MRCR v2 8-needle 1M variant)**
| Model | 1M accuracy | Source |
|-------|-------------|--------|
| Opus 4.6 | 76% | Anthropic blog (Feb 2026) |
| Sonnet 4.5 | 18.5% | Anthropic blog (Feb 2026) |
| Sonnet 4.6 | Not yet published | — |
| Model | 256K accuracy | 1M accuracy | Source |
|-------|--------------|-------------|--------|
| Opus 4.6 | 93% | 76% | Anthropic blog + independent analysis (Feb 2026) |
| Sonnet 4.5 | — | 18.5% | Anthropic blog (Feb 2026) |
| Sonnet 4.6 | Not yet published | Not yet published | — |
Note: Opus 4.6 retains strong accuracy at 1M (76%), Sonnet 4.5 degrades sharply. The benchmark is specifically the "8-needle 1M variant" measuring retrieval in a 1M-token document. Sonnet 4.6 MRCR scores have not yet been published by Anthropic.
Note: Opus 4.6 retains strong accuracy at 1M (76%), Sonnet 4.5 degrades sharply. The benchmark is the "8-needle 1M variant" — finding 8 specific facts in a 1M-token document. The 93% figure at 256K comes from independent analysis of Anthropic's published data. Community validation: a developer loaded ~733K tokens (4 Harry Potter books) and Opus 4.6 retrieved 49/50 documented spells in a single prompt ([HN, Feb 2026](https://news.ycombinator.com/item?id=46905735)). Sonnet 4.6 MRCR scores not yet published.
**Cost per session (approximate)**