release: v3.20.1 - Vercel AGENTS.md vs Skills evaluation

- New resource evaluation (025): Vercel blog on eager context vs lazy skill invocation (Gao, Jan 2026). Score 3/5, 13/13 fact-checked. - Guide: added 8KB compression benchmark to CLAUDE.md sizing (line 3527) - Guide: added 56% skill invocation warning to Memory Loading (line 4082) - Guide: added invocation reliability caveat to skills.sh trade-offs - Version sync 3.20.0 → 3.20.1 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 21:45:14 +01:00 · 2026-01-30 21:45:14 +01:00 · 26ee4ef894
commit 26ee4ef894
parent fd4550cbd3
8 changed files with 188 additions and 11 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -6,6 +6,34 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

 ## [Unreleased]

+- **Learning guide: Shen & Tamkin RCT integration** — `guide/learning-with-ai.md`
+  - Source: [arXiv:2601.20245](https://arxiv.org/abs/2601.20245) (Shen & Tamkin, Anthropic Fellows, Jan 2026)
+  - Score: 3/5 (Pertinent - Complément utile, high overlap with existing content)
+  - Added RCT data point in §3 "Reality of AI Productivity": 17% skill reduction (n=52, Cohen's d=0.738, p=0.01), no significant speed gain, only ~20% delegation users finished faster
+  - Added new Red Flag: "Perception gap" — AI users rate tasks easier while scoring lower
+  - Added full reference in §12 Sources (Academic Research) with 6 interaction patterns summary
+  - Also added METR RCT (arXiv:2507.09089) in Productivity Research sources
+
+## [3.20.1] - 2026-01-30
+
+### Added
+
+- **Resource Evaluation: Vercel AGENTS.md vs Skills Eval** — `docs/resource-evaluations/025-vercel-agents-md-vs-skills-eval.md`
+  - Score: 3/5 (Pertinent — confirms existing CLAUDE.md architecture)
+  - Source: [Jude Gao (Vercel), Jan 27 2026](https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals)
+  - First quantified benchmark: eager context (AGENTS.md) 100% vs lazy invocation (skills) 53-79%
+  - Key finding: skills auto-invoked only 56% of the time by coding agents
+  - Compression benchmark: 40KB → 8KB docs index with zero performance loss
+  - Double challenge: technical-writer + system-architect agents (unanimous 3/5)
+  - Fact-check: 13/13 claims verified
+  - Conflict of interest noted: Vercel operates both skills.sh and the AGENTS.md codemod
+
+### Changed
+
+- **CLAUDE.md sizing** (ultimate-guide.md:3527): Added Vercel 8KB compression benchmark as evidence for 4-8KB target
+- **Memory Loading insight** (ultimate-guide.md:4082): Added warning about 56% skill invocation rate — critical instructions should use CLAUDE.md/rules, not skills
+- **Skills trade-offs** (ultimate-guide.md:5652): Added invocation reliability caveat with source
+
 ## [3.20.0] - 2026-01-30

 ### Added
--- a/README.md
+++ b/README.md
@ -482,7 +482,7 @@ See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.

 ---

-*Version 3.20.0 | January 2026 | Crafted with Claude*
+*Version 3.20.1 | January 2026 | Crafted with Claude*

 <!-- SEO Keywords -->
 <!-- claude code, claude code tutorial, anthropic cli, ai coding assistant, claude code mcp,
--- a/2
+++ b/2
@ -1 +1 @@
-3.20.0
+3.20.1
--- a/docs/resource-evaluations/025-vercel-agents-md-vs-skills-eval.md
+++ b/docs/resource-evaluations/025-vercel-agents-md-vs-skills-eval.md
@ -0,0 +1,145 @@
+# Resource Evaluation: "AGENTS.md Outperforms Skills in Our Agent Evals"
+
+**Date**: 2026-01-30
+**Evaluator**: Claude (Opus 4.5)
+**URL**: https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals
+**Author**: Jude Gao (Vercel)
+**Publication Date**: January 27, 2026
+
+---
+
+## Summary
+
+Vercel blog post comparing four documentation strategies for coding agents on 19 Next.js 16 tasks (APIs not in model training data). Finds that a static AGENTS.md docs index achieves 100% pass rate vs skills at 79% (with explicit instructions) or 53% (without, equal to baseline). Core finding: skills were only auto-invoked 56% of the time. A compressed 8KB index (reduced from ~40KB) maintained full performance.
+
+**Key metrics**:
+
+| Configuration | Pass Rate |
+|---------------|-----------|
+| Baseline (no docs) | 53% |
+| Skills (default) | 53% (+0pp) |
+| Skills (with instructions) | 79% (+26pp) |
+| AGENTS.md docs index | 100% (+47pp) |
+
+**Detailed breakdown** (Build / Lint / Test):
+- AGENTS.md: 100% / 100% / 100%
+- Skills + instructions: 95% / 100% / 84%
+- Baseline: 84% / 95% / 63%
+
+---
+
+## Evaluation Scoring
+
+| Criterion | Score | Notes |
+|-----------|-------|-------|
+| **Relevance** | 3/5 | Validates existing CLAUDE.md architecture; indirect (Claude Code ≠ AGENTS.md) |
+| **Originality** | 4/5 | First quantified benchmark of eager vs lazy context loading in coding agents |
+| **Authority** | 3/5 | Vercel employee, transparent methodology, but conflict of interest (see below) |
+| **Accuracy** | 5/5 | All 13 claims fact-checked and verified against source |
+| **Actionability** | 3/5 | 3 surgical insertions in existing guide sections |
+
+**Overall Score**: **3/5 (Pertinent)**
+
+---
+
+## Gap Analysis
+
+### Already Covered in Guide
+
+| Article Concept | Guide Coverage | Location |
+|-----------------|----------------|----------|
+| Always-loaded context files | CLAUDE.md architecture | ultimate-guide.md:4074-4080 |
+| CLAUDE.md sizing (4-8KB) | Size guideline | ultimate-guide.md:3527 |
+| Skills as lazy-loaded modules | Memory Loading Comparison | ultimate-guide.md:4074-4080 |
+| skills.sh marketplace | Full documentation | ultimate-guide.md:5606-5694 |
+
+### What's New
+
+- **56% invocation rate**: First quantified data on skill auto-discovery failure rate
+- **8KB compression benchmark**: 5x compression (40KB → 8KB) with zero performance loss
+- **Eager vs lazy evidence**: First empirical data supporting always-loaded context over on-demand skills for critical instructions
+
+---
+
+## Fact-Check Results
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| Author: Jude Gao | ✅ | Article byline |
+| Date: January 27, 2026 | ✅ | Article metadata |
+| AGENTS.md 100% pass rate | ✅ | Article table |
+| Skills + instructions: 79% | ✅ | Article table |
+| Baseline: 53% | ✅ | Article table |
+| Compressed index: 8KB | ✅ | Article text |
+| Original size: ~40KB | ✅ | Article ("around 40KB") |
+| Skills invoked 56% of time | ✅ | Article text |
+| Next.js 16 APIs | ✅ | Article (multiple references) |
+| 12 APIs listed | ✅ | All enumerated in article |
+| Command: npx @next/codemod@canary agents-md | ✅ | Article code block |
+| Build/Lint/Test breakdown (all configs) | ✅ | Article detailed table |
+| Skills without instructions = baseline | ✅ | Both at 53% |
+
+**Confidence**: High (13/13 claims verified directly in source article + Perplexity cross-check)
+
+---
+
+## Technical Writer Challenge
+
+Agent challenged the evaluation from a documentation perspective:
+
+**Key arguments**:
+1. **Score correct at 3, not 4**: The article confirms existing guidance rather than introducing new guidance. A user reading our guide is already doing the right thing.
+2. **56% is the real finding**: The headline buries the lead. Skills fail because agents don't discover them, not because skill content is bad. Claude Code already solved this with always-loaded CLAUDE.md.
+3. **Sample too small**: 19 tasks, ~4 task difference between 100% and 79%. Not statistically robust for broad conclusions.
+4. **Self-serving narrative**: Vercel operates skills.sh AND authored the AGENTS.md codemod. They're deprecating one positioning and upselling another under the cover of transparency.
+5. **Integration scope correct**: 3-5 lines, no dedicated section needed.
+
+**Score adjustment**: None (3/5 confirmed)
+
+---
+
+## System Architect Challenge
+
+Agent challenged the evaluation from an architectural perspective:
+
+**Key arguments**:
+1. **Correct score, wrong reasoning**: "Validates our choices" is a non-argument. What saves the 3/5 is the compression benchmark — the only actionable data point.
+2. **Missing architectural pattern**: The article demonstrates the **eager loading vs lazy invocation** trade-off. The guide documents this factually (line 4074-4080 table) but never names the pattern or provides empirical evidence for it.
+3. **Plan under-specified**: Original "3-5 lines, 2 sections" is a placeholder. Correct plan: **3 lines, 3 specific locations** (lines 3527, 4082, 5641).
+4. **Missing compression technique**: The guide says "keep it concise" but never shows *how* to compress. The 5x compression (40KB → 8KB) with zero loss is actionable guidance the guide lacks.
+5. **Decision criteria gap**: The guide's loading table (line 4074) lacks a **criticality criterion** — if instructions are critical, use CLAUDE.md (eager, 100% loaded); if supplementary, skills suffice (lazy, 56% invocation acceptable).
+
+**Score adjustment**: None (3/5 confirmed), but integration plan upgraded to 3 precise locations.
+
+---
+
+## Conflict of Interest Note
+
+Vercel operates skills.sh (the skills marketplace) and authored the `npx @next/codemod@canary agents-md` tool evaluated in this article. The article concludes that their own skills.sh platform underperforms compared to AGENTS.md. While this appears intellectually honest (arguing against their own product), Vercel is positioning a different Vercel tool as the replacement. The methodology is transparent and reproducible, so this is noted as context rather than a disqualifier.
+
+---
+
+## Integration Plan
+
+Three surgical insertions in existing sections:
+
+### 1. CLAUDE.md Sizing (line ~3527)
+Add compression benchmark after the size guideline paragraph.
+
+### 2. Memory Loading Key Insight (line ~4082)
+Add warning about skill invocation reliability after the existing key insight.
+
+### 3. Skills Trade-offs (line ~5652)
+Add bullet about invocation reliability.
+
+---
+
+## Decision
+
+| Aspect | Verdict |
+|--------|---------|
+| **Score** | 3/5 (unanimous across 2 challenger agents) |
+| **Action** | Integrate (3 lines, 3 sections) |
+| **Confidence** | High (13/13 fact-checked + double challenge) |
+| **Priority** | Low (confirms existing architecture) |
+| **Transferable insights** | 56% invocation rate, 8KB compression benchmark, eager vs lazy evidence |
--- a/guide/cheatsheet.md
+++ b/guide/cheatsheet.md
@ -6,7 +6,7 @@

 **Written with**: Claude (Anthropic)

-**Version**: 3.20.0 | **Last Updated**: January 2026
+**Version**: 3.20.1 | **Last Updated**: January 2026

 ---

@ -484,4 +484,4 @@ where.exe claude; claude doctor; claude mcp list

 **Author**: Florian BRUNIAUX | [@Méthode Aristote](https://methode-aristote.fr) | Written with Claude

-*Last updated: January 2026 | Version 3.20.0*
+*Last updated: January 2026 | Version 3.20.1*
--- a/guide/learning-with-ai.md
+++ b/guide/learning-with-ai.md
@ -111,7 +111,7 @@ Most developers experience three distinct phases:
 | **Targeted Gains** | 2-8 weeks | +20-50% | AI accelerates specific tasks you've learned to delegate effectively |
 | **Sustainable Plateau** | 3-6 months | +20-30% | Stable gains, but only for developers who already have strong fundamentals |

-**Critical nuance**: These gains are conditional. Studies show experienced developers (5+ years) see larger, sustained gains. Junior developers often see initial spikes followed by regression — because speed without understanding creates technical debt.
+**Critical nuance**: These gains are conditional. Studies show experienced developers (5+ years) see larger, sustained gains. Junior developers often see initial spikes followed by regression — because speed without understanding creates technical debt. A 2026 RCT ([Shen & Tamkin, Anthropic Fellows](https://arxiv.org/abs/2601.20245)) measured a **17% reduction in skills acquisition** when developers learned a new library with AI assistance (n=52, p=0.01) — with no significant time savings. Only ~20% of AI users (pure delegation pattern) finished faster, at the cost of learning almost nothing.

 ### Where AI Helps (And Where It Hurts)

@ -865,6 +865,7 @@ Warning signs you're becoming dependent, and what to do:
 | Rejected in interviews | Fundamentals atrophied | Practice whiteboard problems without AI |
 | Always ask "how" never "why" | Surface-level usage | Force yourself to ask "why this approach?" |
 | Every solution looks the same | AI has patterns, you need variety | Study multiple implementations manually |
+| Task feels easy but you can't explain it | **Perception gap** — AI users rate tasks easier while scoring 17% lower ([Shen & Tamkin 2026](https://arxiv.org/abs/2601.20245)) | After each task, explain the solution without looking at code |

 ### Weekly Self-Audit

@ -886,6 +887,7 @@ If you're faster but not smarter, you're building dependency.
 - **GitHub Copilot Impact Study (2024)** — [dl.acm.org](https://dl.acm.org/doi/10.1145/3613904.3642394) — Found productivity gains but identified skill atrophy risks in junior developers
 - **Student Dependency Patterns in AI-Assisted Learning** — IACIS 2024 — Documented "learned helplessness" in students over-reliant on AI
 - **Junior Developer Career Trajectories with AI Tools** — Software Engineering Institute — 3-year longitudinal study on skill development
+- **AI Impacts on Skill Formation (Shen & Tamkin, 2026)** — [arXiv:2601.20245](https://arxiv.org/abs/2601.20245) — Anthropic Fellows RCT (52 devs learning Python Trio with/without GPT-4o): AI group scored 17% lower on skills quiz (Cohen's d=0.738, p=0.01) with no significant speed gain. Identified 6 interaction patterns — 3 preserving learning (conceptual inquiry, hybrid explanation, generation-then-comprehension) via active cognitive engagement.

 ### Industry Reports

@ -901,6 +903,7 @@ Sources for [§3 The Reality of AI Productivity](#the-reality-of-ai-productivity
 - **McKinsey Developer Productivity Report (2024)** — [mckinsey.com](https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/unleashing-developer-productivity-with-generative-ai) — Comprehensive analysis of AI impact across dev workflows
 - **Stack Overflow 2024: AI Sentiment** — [stackoverflow.co](https://stackoverflow.co/labs/developer-sentiment-ai-ml/) — Developer attitudes toward AI tools, productivity perceptions
 - **Uplevel Engineering Intelligence (2024)** — Burnout and productivity metrics with AI coding tools
+- **METR Experienced Developer RCT (2025)** — [arXiv:2507.09089](https://arxiv.org/abs/2507.09089) — Randomized controlled trial (16 experienced devs, 246 issues, repos 1M+ lines): AI tools made developers 19% slower on familiar codebases, despite perceiving themselves 20% faster (39-point perception gap). Strongest evidence for skill atrophy risk in experienced developers.
 - **DORA/Google DevOps Research (2024)** — AI tool adoption impact on team performance

 ### Practitioner Perspectives
--- a/guide/ultimate-guide.md
+++ b/guide/ultimate-guide.md
@ -10,7 +10,7 @@

 **Last updated**: January 2026

-**Version**: 3.20.0
+**Version**: 3.20.1

 ---

@ -3524,7 +3524,7 @@ Month 3: 50 rules → 50 mistakes prevented + faster onboarding

 **Anti-pattern**: Preemptively documenting everything. Instead, treat CLAUDE.md as a **living document** that grows through actual mistakes caught during development.

-**Size guideline**: Keep CLAUDE.md files between **4-8KB total** (all levels combined). Practitioner studies show that context files exceeding 16K tokens degrade model coherence. Include architecture overviews, key conventions, and critical constraints—exclude full API references or extensive code examples (link to them instead).
+**Size guideline**: Keep CLAUDE.md files between **4-8KB total** (all levels combined). Practitioner studies show that context files exceeding 16K tokens degrade model coherence. Include architecture overviews, key conventions, and critical constraints—exclude full API references or extensive code examples (link to them instead). Vercel's Next.js team compressed ~40KB of framework docs to an 8KB index with zero performance loss in agent evals ([Gao, 2026](https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals)), confirming the 4-8KB target.

 ### Level 1: Global (~/.claude/CLAUDE.md)

@ -4079,7 +4079,7 @@ Understanding when each memory method loads is critical for token optimization:
 | `.claude/commands/*.md` | Invocation only | Only when invoked | Workflow templates |
 | `.claude/skills/*.md` | Invocation only | Only when invoked | Domain knowledge modules |

-**Key insight**: `.claude/rules/` is NOT on-demand. Every `.md` file in that directory loads at session start, consuming tokens. Reserve it for always-relevant conventions, not rarely-used guidelines.
+**Key insight**: `.claude/rules/` is NOT on-demand. Every `.md` file in that directory loads at session start, consuming tokens. Reserve it for always-relevant conventions, not rarely-used guidelines. Skills are invocation-only and may not be triggered reliably—one eval found agents invoked skills in only 56% of cases ([Gao, 2026](https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals)). Never rely on skills for critical instructions; use CLAUDE.md or rules instead.

 > **See also**: [Token Cost Estimation](#token-saving-techniques) for approximate token costs per file size.

@ -5650,6 +5650,7 @@ Full catalog: [skills.sh leaderboard](https://skills.sh/)
 - ✅ Format 100% compatible with this guide
 - ⚠️ Multi-agent focus (not Claude Code specific)
 - ⚠️ Early stage (maturity to prove over time)
+- ⚠️ Skills require explicit invocation; agents only auto-invoke them ~56% of the time ([Gao, 2026](https://vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals)). For critical instructions, prefer always-loaded CLAUDE.md

 #### When to Use

@ -15743,4 +15744,4 @@ We'll evaluate and add it to this section if it meets quality criteria.

 **Contributions**: Issues and PRs welcome.

-**Last updated**: January 2026 | **Version**: 3.20.0
+**Last updated**: January 2026 | **Version**: 3.20.1
--- a/machine-readable/reference.yaml
+++ b/machine-readable/reference.yaml
@ -3,7 +3,7 @@
 # Source: guide/ultimate-guide.md
 # Purpose: Condensed index for LLMs to quickly answer user questions about Claude Code

-version: "3.20.0"
+version: "3.20.1"
 updated: "2026-01-30"

 # ════════════════════════════════════════════════════════════════
@ -826,7 +826,7 @@ ecosystem:
      - "Cross-links modified → Update all 4 repos"
    history:
      - date: "2026-01-20"
-        event: "Code Landing sync v3.20.0, 66 templates, cross-links"
+        event: "Code Landing sync v3.20.1, 66 templates, cross-links"
        commit: "5b5ce62"
      - date: "2026-01-20"
        event: "Cowork Landing fix (paths, README, UI badges)"
 @ -1 +1 @@
 .20.0
 .20.1