feat: smart-suggest ROI script + hook tuning + guide updates (Mar 16)

- Add examples/scripts/smart-suggest-roi.py: stdlib-only analyzer correlating suggestion log with session JSONL files to measure command acceptance rate. 4 acceptance signals, tier breakdown, daily trend, --json/--since/--no-sessions CLI. - Tune Aristote smart-suggest hook: tighten 5 over-firing triggers (/tech:commit, /tech:sonarqube, /tech:dupes, /check-conventions a11y, /tech:worktree) - Guide: identity re-injection hook, context engineering maturity grid, code review workflow, 1M context window GA update, Spring Break promo, security audit patterns - Resource evaluations: Nick Tune hooks (3/5), VicKayro security audit (2/5), Karl Mazier CLAUDE.md templates, Paul Rayner ContextFlow, Siddhant agent trace, Andrew Yng context hub, JP Caparas 1M context window Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 12:20:40 +01:00 · 2026-03-16 12:20:40 +01:00 · da8bc09f2d
commit da8bc09f2d
parent d9cff74d71
19 changed files with 1963 additions and 6 deletions
--- a/docs/resource-evaluations/2026-03-16-agent-trace-siddhant-github.md
+++ b/docs/resource-evaluations/2026-03-16-agent-trace-siddhant-github.md
@ -0,0 +1,93 @@
+# Evaluation: agent-trace (Siddhant-K-code/agent-trace)
+
+**Date**: 2026-03-16
+**Source**: https://github.com/Siddhant-K-code/agent-trace
+**Type**: GitHub repository (Python tool)
+**Evaluator**: Claude (eval-resource skill)
+
+---
+
+## Summary
+
+`agent-trace` (pip package: `agent-strace`) is a Python tool — zero dependencies, stdlib only — that captures every tool call, user prompt, and assistant response in Claude Code via hooks, then lets you replay sessions in the terminal or export as OpenTelemetry spans. Created 2026-03-15. 7 stars at time of evaluation.
+
+The "strace for AI agents" framing is apt: it solves the "my agent modified 47 files and I have no idea why" problem by giving you a time-stamped, replayable record of every decision point.
+
+---
+
+## Key Points
+
+- **Claude Code hooks**: Setup via `agent-strace setup`. Registers PreToolUse, PostToolUse, PostToolUseFailure, UserPromptSubmit, Stop, SessionStart, SessionEnd in `.claude/settings.json`
+- **Session replay**: `agent-strace replay` shows full session with timestamps, durations, tool inputs, errors — the missing layer between JSONL and understanding
+- **MCP proxy**: Wraps any MCP server (stdio or HTTP/SSE). Works with Cursor, Windsurf, any MCP client
+- **OpenTelemetry export**: OTLP output → Datadog, Honeycomb, New Relic, Splunk
+- **Python decorator API**: `@trace_tool`, `@trace_llm_call`, `log_decision()` for custom agents
+- **Secret redaction**: `--redact` flag strips OpenAI, GitHub, AWS, Anthropic, Slack, JWTs, Bearer tokens, connection strings
+
+---
+
+## Relevance Score: 2/5
+
+**Pertinent but too immature for immediate integration.**
+
+The session replay angle is real and not covered by existing tools in the guide. But `claude-code-otel` already handles the OTel export use case, and the manual jq queries at `guide/ops/observability.md:519-550` cover most of the audit use case. The unique differentiator — interactive replay — needs production validation before being recommended to readers.
+
+---
+
+## Comparison vs Current Guide Coverage
+
+| Aspect | agent-trace | Guide coverage |
+|--------|-------------|----------------|
+| Manual JSONL audit (jq) | ✅ Abstracted as CLI | ✅ observability.md:520 |
+| Session replay (visual) | ✅ Unique differentiator | ❌ Not covered |
+| OpenTelemetry export | ✅ OTLP | ✅ claude-code-otel already in table |
+| Hook setup automation | ✅ `agent-strace setup` | ✅ Documented manually |
+| MCP proxy (Cursor/Windsurf) | ✅ stdio + HTTP/SSE | ❌ Not covered |
+| Python decorator API | ✅ Custom agents | ❌ Not covered |
+| Maturity | ❌ 1 day old, 7 stars | ✅ Table tools have 100-10K stars |
+
+---
+
+## Challenge Notes (technical-writer review)
+
+**Score should be 2/5, not 3/5.** Reasons:
+
+1. `claude-code-otel` already exports to Datadog/Honeycomb. The OTel angle is not additive.
+2. The jq queries at observability.md:519-550 cover most of the audit use case already. The "replay niche" is thinner than it appears.
+3. ICM (1 star) was put on watch list. Agent-trace at 7 stars deserves the same treatment.
+
+**Missing aspects not in initial analysis**:
+
+- **MCP proxy = MITM risk**: Routing all MCP traffic through an unaudited HTTP/SSE proxy is a security surface. The guide has a full hardening section — adding this to the monitoring table without flagging would be inconsistent.
+- **Secret redaction unverified**: Base64-encoded tokens, multi-line .env values, AWS temporary credentials — edge cases not tested. Could create false confidence.
+- **Python decorator API vs MLflow SDK**: MLflow has versioning + experiment tracking + LLM-as-judge. Agent-trace has lower friction. Real trade-off not mentioned.
+
+**On placement**: If integrated, not in the External Monitoring Tools table (that's monitoring, not debugging). Better as a footnote in the JSONL section (~observability.md:565) as "a higher-level wrapper for session replay."
+
+**Risk of NOT integrating**: Near zero. The jq queries + claude-code-otel cover the primary use cases. Real risk runs the other direction: adding a 1-day-old tool that goes unmaintained = dead link in a table readers use for tooling decisions.
+
+---
+
+## Fact-Check
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| Zero dependencies, Python stdlib only | ✅ | pyproject.toml + README |
+| Created 2026-03-15 | ✅ | GitHub API: `created_at: 2026-03-15T08:09:45Z` |
+| MIT licensed | ✅ | GitHub API: `license: MIT License` |
+| Captures all CC hook events | ✅ | README hooks JSON: all 7 event types |
+| Export to Datadog, Honeycomb, Splunk | ✅ | README: `export --to otlp` (OTLP compatible) |
+| 7 stars at evaluation | ✅ | GitHub API 2026-03-16 |
+
+No hallucinations detected. All stats confirmed against source.
+
+---
+
+## Decision
+
+**Action: Watch list**
+**Integration trigger**: 100+ stars AND at least one practitioner write-up showing real production use.
+
+**If triggered**: Add as footnote in observability.md ~line 565 (JSONL section), not in the External Monitoring Tools table. Frame as "higher-level wrapper for session replay/debug" distinct from the monitoring tools.
+
+**Why watch list and not reject**: Session replay is a real gap. Zero-deps Python is a genuine adoption differentiator. The engineering quality looks solid (automated setup, secret redaction, HTTP/SSE proxy). Just needs time to prove reliability on real sessions.
--- a/docs/resource-evaluations/2026-03-16-andrewyng-context-hub.md
+++ b/docs/resource-evaluations/2026-03-16-andrewyng-context-hub.md
@ -0,0 +1,76 @@
+# Resource Evaluation: Context Hub (andrewyng/context-hub)
+
+**Date**: 2026-03-16
+**Source**: LinkedIn post (text) + https://github.com/andrewyng/context-hub
+**Type**: Open-source CLI tool
+**Author**: Andrew Ng (andrewyng)
+**Score**: 2/5
+
+---
+
+## Summary of Content
+
+- **What it is**: A CLI tool (`chub`) providing coding agents with curated, versioned API documentation as markdown files
+- **Core commands**: `chub get openai/chat --lang py` to fetch API docs; `chub annotate <id> "note"` for persistent cross-session annotations
+- **Corpus**: 602+ documentation entries (as of 2026-03-16), covering OpenAI, Anthropic, Stripe, AWS, and others
+- **Community loop**: Users vote on doc quality (`chub feedback`), surfacing improvements to maintainers
+- **Claude Code integration**: SKILL.md support for dropping into `~/.claude/skills/`
+- **License**: MIT, 6,342 stars
+
+---
+
+## Score: 2/5
+
+**Justification**: One genuinely novel feature (cross-session persistent annotations on external API docs) that Context7 cannot replicate. Everything else overlaps with existing guide coverage: Context7 already handles versioned library docs, `@url` natively pulls live documentation into Claude Code context, and anti-hallucination patterns are already documented. The annotation use case is real but solves a narrow problem. No production benchmarks, no independent validation.
+
+---
+
+## Comparative Analysis
+
+| Aspect | Context Hub | Our Guide |
+|--------|------------|-----------|
+| Curated API docs for agents | New CLI approach | Not covered as dedicated tool |
+| Cross-session doc annotations | Unique feature | Not covered |
+| Official library docs lookup | Overlaps with Context7 | Covered (Section 8, Context7) |
+| Live URL context | Overlaps with native `@url` | Covered (native Claude Code) |
+| Agent hallucination prevention | Indirect angle | Covered but scattered |
+| Maintenance/freshness guarantees | Community-maintained, lag risk | N/A |
+
+---
+
+## Challenge Notes (technical-writer agent)
+
+**Key pushbacks:**
+
+1. **Stars ≠ adoption**: 6,342 stars driven by Andrew Ng's social amplification, not production validation
+2. **Context7 overlap not demonstrated**: `chub get openai/chat --lang py` vs Context7's `query-docs` — the evaluation doesn't prove the concrete gap
+3. **Annotation is the only novel angle**: and it got buried — it's the one feature Context7 cannot replicate
+4. **Hallucination framing is a stretch**: community-maintained docs introduce a trust problem Context7 avoids (official sources)
+5. **Missing: `@url` native alternative**: Claude Code already pulls live docs natively, weakening the "gap" case
+6. **Missing: maintenance risk**: update lag when APIs change vs. Context7's live resolution
+7. **Risk of not integrating**: Low — existing guide coverage (Context7, `@url`, grepai) handles most use cases
+
+---
+
+## Fact-Check
+
+| Claim (from LinkedIn post) | Verdict | Notes |
+|---------------------------|---------|-------|
+| "Andrew Ng just dropped" | Verified | Repo owner is `andrewyng`, not a fork |
+| "68+ APIs" | False | Actual corpus: 602+ entries as of 2026-03-16 |
+| "One of the fastest accelerating new repos" | Unverifiable | 6,342 stars in ~5 months; no public velocity data |
+| "100% free & open source (MIT)" | Verified | MIT confirmed in license file |
+
+**Corrections**: The "68+ APIs" figure is either from an early snapshot or fabricated. Real coverage is ~9x larger. The LinkedIn post is marketing-inflated.
+
+---
+
+## Recommendation
+
+**Action**: Do not integrate — one-line mention only.
+
+If mentioned at all, one sentence under the Context7 entry in Section 8 (MCP servers): "For teams requiring persistent annotations on external API docs across sessions, see [context-hub](https://github.com/andrewyng/context-hub)."
+
+No section, no dedicated coverage, no hallucination-prevention framing. Revisit if production use cases emerge in the community.
+
+**Confidence**: High (fact-check complete, challenge addressed)
--- a/docs/resource-evaluations/2026-03-16-karl-mazier-claudemd-templates-linkedin.md
+++ b/docs/resource-evaluations/2026-03-16-karl-mazier-claudemd-templates-linkedin.md
@ -0,0 +1,107 @@
+# Resource Evaluation: Karl MAZIER — LinkedIn post on CLAUDE.md structure + reizam/claude-md-templates
+
+**Source**: LinkedIn post (Karl MAZIER) + https://github.com/reizam/claude-md-templates
+**Author**: Karl MAZIER (co-founder, Open Source & SaaS, YC)
+**Date**: ~2026-03-02 (LinkedIn post date estimated ~2 weeks before 2026-03-16)
+**Type**: LinkedIn post + GitHub repo (community toolkit)
+**Evaluated**: 2026-03-16
+
+---
+
+## 📄 Summary
+
+- Two-level CLAUDE.md structure: `~/.claude/CLAUDE.md` (global, ~30 lines) + `./CLAUDE.md` per repo (~40 lines). Global = coding philosophy/conventions. Project = what the agent can't discover from code alone.
+- References ETH Zurich paper arXiv 2602.11988: AI-generated context files yield -3% agent success rate and +20% inference cost. Human-written minimal files yield ~+4%.
+- Core rule: write only what the agent cannot discover independently from the codebase.
+- GitHub repo (https://github.com/reizam/claude-md-templates): 2 fork-ready templates + 3 slash commands installable via `npx skills add reizam/claude-md-templates` (generate global, generate project, audit existing CLAUDE.md).
+- The `Philosophy` section highlighted as the most critical part of global CLAUDE.md.
+
+---
+
+## 🎯 Score
+
+| Score | Meaning |
+|-------|---------|
+| 5 | Essential — major gap in the guide |
+| 4 | High value — significant improvement |
+| 3 | Relevant — useful complement |
+| **2** | **Marginal — secondary info** |
+| 1 | Out of scope |
+
+**Score: 2/5**
+
+**Justification**: The ETH Zurich paper (arXiv 2602.11988) was already evaluated on 2026-02-19 and scored 4/5 with a full integration plan. This LinkedIn post is a community summary of that same paper without original analysis. The GitHub repo templates (8 stars, created 2026-02-27) are redundant with existing `examples/memory/CLAUDE.md.personal-template` (68 lines) and `CLAUDE.md.project-template` (72 lines) in the guide. One element is genuinely new: the `/claude-md-audit` slash command concept — a skill to analyze existing CLAUDE.md for bloat — which the guide doesn't have as an installable command.
+
+---
+
+## ⚖️ Comparison
+
+| Aspect | This resource | Our guide |
+|--------|--------------|-----------|
+| ETH Zurich paper data (-3%/+20%) | ✅ Cited | ✅ Already evaluated (2026-02-19) |
+| Two-level hierarchy (global vs project) | ✅ Described | ✅ Well covered (context-engineering.md) |
+| Size targets (~30 / ~40 lines) | ✅ Specific numbers | ⚠️ Guide says <200 lines global (more conservative) |
+| "Write only what agent can't discover" rule | ✅ Clear decision test | ⚠️ Implied but not stated as a test |
+| Practical templates | ✅ Fork-ready | ✅ Already in examples/memory/ |
+| CLAUDE.md audit command | ✅ `/claude-md-audit` skill | ❌ Not implemented |
+| Philosophy section emphasis | ✅ Named as most critical | ⚠️ Covered implicitly |
+
+---
+
+## 📍 Recommendations
+
+**Not worth a standalone integration** given the paper is already evaluated.
+
+One targeted addition is worth considering:
+
+**Where**: `guide/core/context-engineering.md`, existing section on CLAUDE.md size/quality.
+
+**What**: Add the decision test as a one-liner: "Write only what the agent cannot discover from the code itself." This is a sharper formulation than the current guidance ("essentiels au projet") and passes as a practical heuristic.
+
+**Priority**: Low. The paper integration (already planned from 2026-02-19 evaluation) is the priority. This is a wording improvement at best.
+
+The `/claude-md-audit` command concept from the GitHub repo is worth tracking for the examples/commands/ directory if the repo gains adoption (currently 8 stars — too early).
+
+---
+
+## 🔥 Challenge
+
+**Score: 2/5 confirmed** by technical-writer agent with one nuance:
+
+The evaluation correctly separates the paper (already covered) from the packaging (LinkedIn post + 8-star repo). The agent challenged whether the score should stay at 2/5 or drop to 1/5 given prior coverage. It stays at 2/5 because:
+- The "write only what the agent can't discover" formulation is genuinely more actionable than current guide wording
+- The audit command concept is novel even if the repo is too young to reference
+
+**Risks of not integrating**: Near zero. The paper is already queued for integration. This post adds no independent value beyond the paper.
+
+**Points missed**: The post misreads the paper slightly — suggesting global CLAUDE.md should be "~30 lines." The paper's recommendation is more nuanced: write only the essential commands and project-specific tooling, not a target line count. The guide's adherence degradation data (lines 132-141 in context-engineering.md) is actually more actionable than the "~30 lines" heuristic.
+
+---
+
+## ✅ Fact-Check
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| ETH Zurich paper exists (arXiv 2602.11988) | ✅ | Prior evaluation 2026-02-19 |
+| AI-generated files: -3% perf | ✅ | arXiv abstract + Perplexity |
+| Human-written files: slight gain | ✅ | +4% confirmed in paper |
+| +20% inference cost claim | ✅ | "over 20%" in arXiv abstract |
+| reizam/claude-md-templates on GitHub | ✅ | 8 stars, 2 forks, created 2026-02-27 |
+| `npx skills add` mechanism | ✅ | GitHub repo confirms this |
+| "~30 lines global / ~40 lines project" | ⚠️ | Author's interpretation, not paper recommendation |
+
+**Corrections**: The "~30 lines / ~40 lines" targets are the author's own heuristic, not a finding from the ETH Zurich paper. The paper recommends minimal context focused on build/test commands and specific tooling, without a line count target.
+
+---
+
+## 🎯 Final Decision
+
+- **Score**: 2/5
+- **Action**: No integration — the underlying paper is already queued (2026-02-19 evaluation). Note the "write only what the agent can't discover" formulation for possible wording improvement in context-engineering.md.
+- **Confidence**: High
+
+**Cross-reference**: `/Users/florianbruniaux/Sites/perso/claude-code-ultimate-guide/docs/resource-evaluations/agents-md-empirical-study-2602-11988.md` — the paper this post summarizes, already evaluated at 4/5 with full integration plan.
+
+---
+
+*Evaluated: 2026-03-16 | Method: text analysis + grepai_search + technical-writer challenge + agent research*
--- a/docs/resource-evaluations/2026-03-16-nick-tune-hook-driven-workflows.md
+++ b/docs/resource-evaluations/2026-03-16-nick-tune-hook-driven-workflows.md
@ -0,0 +1,181 @@
+# Resource Evaluation: Hook-Driven Dev Workflows with Claude Code
+
+**Date**: 2026-03-16
+**Evaluator**: Claude Sonnet 4.6
+**Resource URL**: https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/
+**Resource Type**: Technical blog post
+**Author**: Nick Tune
+**Published**: 2026-02-28
+
+---
+
+## Executive Summary
+
+Nick Tune (already cited 4 times in the guide for his earlier Medium article) presents a new pattern that treats Claude Code hooks as a **workflow enforcement engine** with a typed state machine, JSON persistence, and per-state context injection. The guide covers individual hook types in isolation and has a bash-based single-entry dispatcher (§7.5). This article adds a TypeScript state machine layer on top of hooks that the guide does not cover. The most immediately useful standalone pattern — identity re-injection after compaction — can be integrated now without the full state machine.
+
+**Recommendation**: **MODERATE (Score 3/5)** — Integrate the identity re-injection pattern immediately in §7.5. Stage the full state machine architecture as a Tier 3 workflow guide with explicit prerequisites. Re-evaluate the full integration at 4/5 in 60-90 days once community validation exists.
+
+---
+
+## Scoring Summary
+
+| Criterion | Score | Weight | Weighted Score |
+|-----------|-------|--------|----------------|
+| **Accuracy & Reliability** | 4 | 20% | 0.80 |
+| **Depth & Comprehensiveness** | 5 | 20% | 1.00 |
+| **Practical Value** | 4 | 25% | 1.00 |
+| **Originality & Uniqueness** | 3 | 15% | 0.45 |
+| **Production Readiness** | 2 | 10% | 0.20 |
+| **Community Validation** | 2 | 10% | 0.20 |
+| **TOTAL SCORE** | | | **3.65 → 3/5** |
+
+---
+
+## Content Summary
+
+The article introduces a hook-driven workflow pattern built on five core ideas:
+
+- **Hooks as state machine**: SubagentStart, SubagentStop, PreToolUse, and TeammateIdle hooks feed into a TypeScript workflow engine managing state transitions (SPAWN → PLANNING → RESPAWN → DEVELOPING → REVIEWING → COMMITTING → CR_REVIEW → PR_CREATION → FEEDBACK → COMPLETE)
+- **Single-entrypoint dispatch**: One hook handler for all events via a `HOOK_HANDLERS` map dispatching by `hook_event_name` — similar concept to guide §7.5 bash dispatcher, but TypeScript + stateful
+- **State-specific context injection**: SubagentStart reads `/states/<state>.md` files and injects them into agent context — agents only see instructions relevant to the current state, avoiding bloated system prompts
+- **Respawn pattern**: After each iteration, developer and reviewer agents shut down and fresh instances spawn, giving each iteration a clean context window
+- **Identity re-injection after compaction**: Hooks detect when an agent has forgotten its identity prefix (after compaction) and re-inject identity instructions — the most standalone, transferable pattern in the article
+
+---
+
+## Gap Analysis vs. Claude Code Ultimate Guide
+
+| Pattern | This Article | Guide Coverage |
+|---------|-------------|----------------|
+| Single-entrypoint hook dispatch | ✅ TypeScript, stateful | ⚠️ §7.5 covers bash dispatcher concept |
+| State machine with typed transitions (Zod) | ✅ Full implementation | ❌ Not covered |
+| SubagentStart for context injection | ✅ State-specific file injection | ⚠️ Table mention only ("Subagent initialization") |
+| PreToolUse as per-state operation blocker | ✅ Blocks git commit during DEVELOPING | ⚠️ Covered as security pattern, not workflow state control |
+| Agent respawn for context window management | ✅ Explicit per-iteration respawn | ❌ Not covered |
+| Workflow state persistence (JSON + session ID) | ✅ Full example | ❌ Not covered |
+| Identity re-injection after compaction | ✅ Hook detects missing prefix, re-injects | ❌ Not covered |
+| Agent teams + hooks combined | ✅ Concrete end-to-end | ⚠️ Separate docs, not combined |
+
+**Note on originality**: The guide already references Nick Tune's earlier Medium article ("Coding Agent Development Workflows") 4 times at lines 4962, 8978, 13799, 15091, and 22527. This is a different, newer article. The single-entrypoint dispatch concept also exists in §7.5 as a bash pattern. The actual delta is: TypeScript state machine, per-state SubagentStart injection, respawn, JSON persistence, and identity re-injection.
+
+---
+
+## Detailed Analysis
+
+### Accuracy & Reliability (4/5)
+
+Technical claims check out against Claude Code's documented hook behavior:
+- SubagentStart, SubagentStop, PreToolUse, TeammateIdle hooks are real and documented
+- `hook_event_name` field in hook input matches guide §7.4 docs
+- `exitCode: EXIT_ALLOW` / `EXIT_BLOCK` pattern is valid
+- Zod for schema validation is standard TypeScript practice
+- JSON file persistence keyed to session ID is a practical, correct approach
+
+One significant caveat: the author explicitly states "1 week of experimentation, cannot 100% recommend yet." That honest disclaimer is not a minor qualifier — it means this is unvalidated at any meaningful scale.
+
+### Depth & Comprehensiveness (5/5)
+
+Full TypeScript code for the dispatcher, a worked `/states/developing.md` example, JSON persistence schema with real fields, Zod transition map, DDD framing. GitHub repo with complete code exists. This is the article's strongest dimension — not vaporware, genuinely implementable.
+
+### Practical Value (4/5)
+
+The core problem is real: getting consistent workflows in codebases you don't fully control. The identity re-injection pattern alone is worth the read. The full state machine is more complex but still implementable.
+
+Barrier: the approach requires Node.js + TypeScript runtime (`npx tsx`). The guide's hooks section is bash-first by design. Any integration needs to address this friction explicitly.
+
+### Originality & Uniqueness (3/5)
+
+The single-entrypoint dispatcher already exists in §7.5 as bash. The agent teams feature is already documented. What's genuinely novel: (1) attaching a typed state machine to hooks, (2) per-state SubagentStart context injection from files, (3) respawn for context window hygiene, (4) identity re-injection after compaction. Strong delta on 4 specific patterns, not on the overall approach.
+
+### Production Readiness (2/5)
+
+1 week of testing. Author's own words: "cannot fully recommend." Known wiring complexity ("ugly and fragile at times"). The hook JSON config requires repeating the same entry for each event type — acknowledged as a UX problem. This needs months of community validation before being presented as a recommended pattern.
+
+### Community Validation (2/5)
+
+Nick Tune is a credible practitioner (established author, DDD community). No adoption metrics for this specific article. The GitHub repo exists but engagement data is not available from the article.
+
+---
+
+## Prerequisites (Not Mentioned in Evaluation v1)
+
+Any integration must flag these hard dependencies:
+
+1. **Agent teams experimental flag**: `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` — SubagentStart, SubagentStop, TeammateIdle are agent teams events. Without this flag, most of the patterns in the article are unavailable.
+2. **Opus 4.6 required**: Agent teams require Opus 4.6+, which costs significantly more than Sonnet.
+3. **Node.js + TypeScript toolchain**: `npx tsx` must be available in the project. Not a default Claude Code assumption.
+
+---
+
+## Recommended Integration
+
+### Tier 1 — Integrate Now (Standalone Pattern)
+
+**Identity re-injection after compaction** → `guide/ultimate-guide.md` §7.5 (Hook Examples)
+
+This pattern is immediately useful to anyone using hooks with long sessions — no agent teams flag, no TypeScript required. Compaction-driven identity drift is a known pain point. The hook-based detection + re-injection workaround belongs in §7.5 regardless of the rest of the article.
+
+```
+### Handling Identity Drift After Compaction
+When Claude's context compacts, agents in long sessions can "forget" their
+role. A hook can detect this and re-inject identity instructions.
+```
+
+### Tier 2 — Integrate with Prerequisites Gate (3-4 weeks)
+
+**Per-state SubagentStart context injection** → Add to agent-teams.md as an advanced pattern section. Prerequisite: agent teams flag. Key insight: inject state-specific files at runtime rather than bundling everything in the system prompt — reduces system prompt bloat and keeps agents focused.
+
+### Tier 3 — New Workflow Guide (60-90 days, pending community validation)
+
+**`guide/workflows/hook-driven-workflows.md`** — Full state machine architecture, once the pattern has more community validation. Clear prerequisites header (agent teams flag, Opus 4.6, Node.js + TypeScript). Frame as experimental / advanced.
+
+### What NOT to Document
+
+The specific CodeRabbit + GitHub issue + 10-state workflow is too opinionated. Document the architectural patterns; readers define their own states.
+
+---
+
+## Challenge Findings (Technical Review)
+
+The challenge agent identified several issues with the initial v1 evaluation:
+
+**Score correction**: The initial 4/5 score ("integrate within 1 week") was too aggressive for something the author hasn't validated beyond 1 week. Correct score is 3/5 now, with a scheduled re-evaluation in 60-90 days.
+
+**Originality was overstated**: The single-entrypoint dispatcher already exists in §7.5 as bash. The guide already references this author's earlier work 4 times. The real delta is narrower: state machine, per-state injection, respawn, identity re-injection.
+
+**Missing prerequisites**: Agent teams flag, Opus 4.6, Node.js + TypeScript — none of these were in v1. Any integration without flagging these prerequisites would frustrate readers.
+
+**Identity re-injection is the most urgent pattern**: Standalone, no experimental flag required, directly solves a documented pain point (compaction-driven drift). The v1 evaluation mentioned it but didn't prioritize it as the immediate integration target.
+
+**Integration recommendation was vague**: "Integrate in hooks section + new workflow guide" without defining what goes where. The tiered approach (Tier 1 now / Tier 2 in 3-4 weeks / Tier 3 in 60-90 days) is more actionable.
+
+---
+
+## Fact-Check
+
+| Claim | Verified | Notes |
+|-------|----------|-------|
+| SubagentStart hook exists | ✅ | Confirmed in guide + official docs |
+| SubagentStop hook exists | ✅ | Confirmed |
+| PreToolUse hook exists | ✅ | Confirmed |
+| TeammateIdle hook exists | ✅ | Confirmed in guide hooks table |
+| `hook_event_name` field in hook input | ✅ | Documented in guide §7.4 |
+| Guide §7.5 already has single-entry dispatcher | ✅ | Bash version at line 9655 |
+| Guide cites Nick Tune's earlier article 4 times | ✅ | Lines 4962, 8978, 13799, 15091 (different URL) |
+| CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 required | ✅ | Confirmed in agent-teams.md |
+| Author: Nick Tune, published 2026-02-28 | ✅ | Byline + URL path + © 2026 |
+| GitHub repo with full code exists | ⚠️ | Article states it; not independently verified |
+| "1 week of experimentation" | ✅ | Direct quote from article |
+
+No invented statistics or unverified benchmarks. The author makes no quantitative performance claims — appropriate hedging throughout.
+
+---
+
+## Decision
+
+- **Final Score**: 3/5 — Moderate
+- **Action**: Partial integration (tiered)
+- **Immediate**: Identity re-injection pattern → §7.5
+- **Short-term**: Per-state SubagentStart injection → agent-teams.md
+- **Deferred**: Full state machine workflow guide — re-evaluate 2026-05-16
+- **Confidence**: High (patterns are technically sound, gap analysis is verified, framing is calibrated)
--- a/docs/resource-evaluations/2026-03-16-nick-tune-workflow-dsl-ddd.md
+++ b/docs/resource-evaluations/2026-03-16-nick-tune-workflow-dsl-ddd.md
@ -0,0 +1,111 @@
+# Resource Evaluation: Nick Tune — Workflow DSL: Domain-Driven Claude Code Workflows
+
+**URL:** https://nick-tune.me/blog/2026-03-01-workflow-dsl-domain-driven-claude-code-workflows/
+**Author:** Nick Tune
+**Date:** March 1, 2026
+**Evaluated:** 2026-03-16
+**Score:** 3/5 — Pertinent, selective integration recommended
+
+---
+
+## Content Summary
+
+- Introduces a TypeScript DSL for defining Claude Code workflow states declaratively: each state specifies an emoji identifier, agent instruction file path, allowed state transitions, permitted operations, and transition guard functions.
+- Three-module architecture: `workflow-engine` (executes rules, domain-agnostic), `workflow-dsl` (language for defining steps), `workflow-definition` (aggregate root with actual workflow logic and invariants).
+- Type safety via TypeScript `as const` union types — invalid state transitions fail at compile time, not at runtime.
+- Domain-Driven Design framing: workflow as explicit aggregate root, adapter pattern decouples it from Claude Code infrastructure.
+- Observability built-in: every operation appends to an internal event log (operation type, timestamp, contextual details). Enables replay and audit of workflow runs.
+- "State ownership principle": each state validates its own preconditions rather than relying on defensive checks in the engine.
+- Instruction re-injection on every state transition AND on command failure — distinct mechanism from context-compaction re-injection (compensates for runtime agent context loss, not compression).
+- Black-box testing via public `Workflow` class methods; no internal state testing.
+- References GitHub repo `NTCoding/autonomous-claude-agent-team` for full implementation.
+- Follow-up to: https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/ (already evaluated separately).
+
+---
+
+## Scoring
+
+| Score | Signification |
+|-------|---------------|
+| 5 | Essentiel - Gap majeur dans le guide |
+| 4 | Très pertinent - Amélioration significative |
+| **3** | **Pertinent - Complément utile** |
+| 2 | Marginal - Info secondaire |
+| 1 | Hors scope - Non pertinent |
+
+**Score: 3/5**
+
+**Justification:** Conceptually interesting pattern (DSL + DDD for agent orchestration), but zero community adoption verified, three-module architecture adds real cognitive overhead for most users, and the guide's architecture section already correctly warns against complex state machines as a default. The value is in three extractable patterns — not in adopting the full architecture.
+
+---
+
+## Comparative Analysis
+
+| Aspect | This Resource | Our Guide |
+|--------|--------------|-----------|
+| TypeScript DSL for workflow states | ✅ Full implementation | ❌ Not covered |
+| Compile-time state transition validation | ✅ Via union types | ❌ Not covered |
+| Workflow as DDD aggregate root | ✅ Full pattern | ❌ Not covered |
+| Event log observability for agent runs | ✅ Built-in design | ❌ Not covered |
+| Instruction re-injection on failure | ✅ Explicit pattern | ❌ Not covered (only on compaction) |
+| Hook-driven agent orchestration | Complementary | ✅ Covered (hooks section) |
+| Complex state machine warnings | Not mentioned | ✅ Covered (architecture.md:1227) |
+| Event-driven agent patterns | Partial overlap | ✅ Covered (event-driven-agents.md) |
+
+---
+
+## Integration Recommendations
+
+**Selective extraction — no new top-level file warranted until community adoption exists.**
+
+Three surgical integrations, in priority order:
+
+### 1. Event log observability pattern (Priority: High)
+**Where:** Hooks documentation or `guide/workflows/event-driven-agents.md`
+**What:** The pattern of appending every agent operation to an internal event log (with type + timestamp + context) is a debugging and auditability technique the guide covers nowhere. Extract this as a standalone pattern, independent of the DDD framing.
+
+### 2. Instruction re-injection on failure (Priority: Medium)
+**Where:** `guide/workflows/event-driven-agents.md` or hooks best practices
+**What:** Re-injecting agent instructions on command failure (not just context compaction) addresses runtime context drift. Distinct from the existing compaction re-injection pattern — worth a paragraph with the distinction explicit.
+
+### 3. Compile-time state validation callout (Priority: Low)
+**Where:** `guide/core/architecture.md` near line 1227 (state machines section)
+**What:** The architecture guide warns against complex state machines. A callout acknowledging that TypeScript union types can make simple state transitions compile-safe is a useful nuance — without endorsing the full three-module pattern.
+
+---
+
+## Challenge (technical-writer agent)
+
+**Core finding from the challenge:** The evaluation was initially conflated with the previous Nick Tune article (2026-02-28, hook-driven workflows). These are two distinct posts, and this DSL article is architecturally different — the hook article is about wiring, this one is about state definition language.
+
+**Score assessment:** 3/5 holds. The argument for 4/5 is that compile-time state validation via `as const` union types is genuinely novel. The argument against: zero community validation, real cognitive complexity in the three-module design, and DDD framing is academic overhead for most Claude Code users.
+
+**Key correction from challenge:** The event log observability is underweighted in the initial read. It is the most actionable extraction, worth leading any integration pitch. The DDD framing is secondary.
+
+**Risk of not integrating:** Low overall. The only real cost is missing the event log observability pattern. The full architecture can be safely skipped until adoption grows.
+
+---
+
+## Fact-Check
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| Author: Nick Tune | ✅ | Article byline |
+| Published: March 1, 2026 | ✅ | Article metadata |
+| Three-module architecture (engine/dsl/definition) | ✅ | Article content |
+| TypeScript `as const` for union types | ✅ | Article + code examples |
+| GitHub repo: NTCoding/autonomous-claude-agent-team | ✅ | Article links |
+| Reference to Yves Reynhout (Bluesky, event sourcing) | ✅ | Article attribution |
+| No numerical benchmarks or percentages cited | ✅ | Article contains none |
+| Follow-up to 2026-02-28 hook-driven article | ✅ | Article references it explicitly |
+
+No hallucinated statistics. No unverifiable claims. Article is descriptive (architectural patterns) with no performance benchmarks.
+
+---
+
+## Final Decision
+
+- **Score:** 3/5
+- **Action:** Integrate selectively (3 surgical extractions — event log, failure re-injection, state validation callout)
+- **Confidence:** High
+- **Note:** Do not create a new top-level `workflow-dsl.md` file. Distribute the three patterns into existing sections where they add context without requiring readers to adopt the full DDD architecture.
--- a/docs/resource-evaluations/2026-03-16-paul-rayner-contextflow-refactoring-linkedin.md
+++ b/docs/resource-evaluations/2026-03-16-paul-rayner-contextflow-refactoring-linkedin.md
@ -0,0 +1,110 @@
+# Resource Evaluation: Paul Rayner — "Will AI Kill Refactoring?" (LinkedIn)
+
+**Date**: 2026-03-16
+**Evaluator**: Claude (automated via /eval-resource)
+**Source type**: LinkedIn post (text provided)
+**Author**: Paul Rayner, CEO & Principal Consultant @ Virtual Genius; author of *The EventStorming Handbook*; founder/chair of Explore DDD
+**Published**: ~March 2, 2026 (2 weeks before eval date)
+**Repository**: https://github.com/virtualgenius/contextflow
+**Score**: 3/5
+
+---
+
+## Summary
+
+Paul Rayner built ContextFlow (a DDD context mapping tool) entirely with Claude Code and analyzed 519 commits to answer whether AI makes refactoring obsolete. Key findings:
+
+- Full commit breakdown: 30% feat, 22% fix, 23% docs, 14% tidy (refactoring), 5.4% config, 2.3% test, 2.3% other
+- Code-only commits: 44% feat, 32% fix, 21% refactoring, 3% test — meaning 1 in 5 code commits is pure structural work
+- Main argument: AI doesn't eliminate refactoring, it lowers its cost enough to do it more often, in smaller batches, before problems compound
+- New mechanism: large incoherent files degrade context window quality — refactoring keeps AI productive
+- The design skill AI can't replace: knowing *when* structure no longer fits the problem and what better structure looks like
+- Includes a usable git prompt for analyzing any conventional commits repo by commit type distribution
+
+---
+
+## Comparatif
+
+| Aspect | This resource | Guide coverage |
+|--------|--------------|----------------|
+| Refactoring patterns | Frequency rationale (new angle) | Section at ~line 16990 (incremental, boundary patterns) |
+| Context window degradation via code structure | ✅ Original insight | ❌ Not explicitly linked to refactoring |
+| Real-world Claude Code case study | ✅ Practitioner + data | 4 others (Mergify, Airbnb, Boris Cherny, Fountain) |
+| Commit analysis prompt | ✅ Reusable tool | ❌ Not present |
+| Conventional commits conventions | Referenced | ✅ Covered at lines 8600, 15564 |
+| DDD methodology | Context for the project | Mentioned as semantic anchor at lines 3875, 3908, 16849 |
+
+---
+
+## Score: 3/5
+
+**Justification**: Two distinct artifacts of real value — a context window insight worth adding to the refactoring section, and a git analysis prompt worth adding to git best practices. The case study narrative itself is weaker: n=1, self-reported, no external corroboration, LinkedIn-published. The guide already holds case studies to a higher evidence standard (Mergify has a sourced blog post; Airbnb data is corroborated by academic research). Presenting the commit percentages (44/32/21) without a baseline for non-AI projects also limits what conclusions can be drawn — you can't distinguish "AI accelerates refactoring discipline" from "Rayner is personally disciplined about refactoring."
+
+---
+
+## Integration Recommendations
+
+**Split the two artifacts. Treat them independently.**
+
+### 1. Context window degradation insight → Refactoring section (~line 17025)
+
+Add one paragraph as an additional rationale within the incremental/boundary patterns explanation. The link between code cohesion and context quality is a distinct mechanism not currently in the guide. Attribute as a practitioner observation, note it's a single project.
+
+```
+Example framing:
+"Refactoring also protects your context window. Large, incoherent files that accumulate
+without structural cleanup force Claude to process more irrelevant content per request.
+Keeping modules small and well-scoped is not just a quality practice — it's a practical
+token efficiency strategy."
+```
+
+### 2. Git commit analysis prompt → Git best practices (~line 15564)
+
+Add alongside existing commit conventions as a companion diagnostic tool. This is immediately actionable for any team using conventional commits and has standalone value regardless of the case study narrative.
+
+```
+Example placement: after the commit format section, as a "Analyze your commit distribution" sidebar.
+```
+
+### 3. Case study bullet → Skip
+
+The data quality doesn't support adding it alongside Mergify and Airbnb. If Rayner publishes a proper blog post with methodology, revisit.
+
+**Priority**: Low-Medium. The git prompt is the quickest win (15 minutes to add). The context window paragraph requires more care to integrate without duplicating existing content.
+
+---
+
+## Challenge (technical-writer agent)
+
+The agent pushed back on score (3/5 confirmed, not 4/5) for two reasons:
+
+- **Data provenance**: n=1, self-reported on LinkedIn, no external validation. Bumping to 4 would imply evidence quality it hasn't earned.
+- **Integration plan was misaligned**: original plan proposed adding to case studies section. Agent correctly redirected both artifacts to their natural homes (refactoring section + git best practices), not a case study bullet.
+
+Additional issues flagged:
+- No baseline comparison (are 21% refactoring commits high or low vs. non-AI projects?) — weakens the thesis
+- Git prompt underweighted in original plan — it's the highest-value artifact, needs explicit placement
+- Risk of not integrating: **Low to medium** — context window link is worth capturing, git prompt adds direct reader value, but nothing is irreplaceable given existing guide depth
+
+---
+
+## Fact-Check
+
+| Claim | Status | Notes |
+|-------|--------|-------|
+| Paul Rayner is CEO @ Virtual Genius, EventStorming Handbook author | ✅ Verified | Consistent with LinkedIn bio in the post |
+| ContextFlow built entirely with Claude Code | ⚠️ Unverifiable | Author's stated claim, no commit metadata to confirm |
+| "519 commits" in the repo | ⚠️ Minor discrepancy | GitHub shows 552 commits at eval time (post written ~2 weeks earlier) — timing explains the gap |
+| Commit breakdown percentages (30/22/23/14/5.4/2.3/2.3) | ✅ Internally consistent | Screenshot shows Claude's analysis output; numbers sum to ~99.3% (rounding). Verifiable by running the git prompt on the repo |
+| Code-only breakdown (44/32/21/3) | ✅ Internally consistent | Matches the full-breakdown numbers when non-code commits excluded |
+| ContextFlow is a DDD context mapping tool | ✅ Verified | GitHub confirms: TypeScript/React, 140 stars, MIT, maps bounded contexts/value streams/Wardley |
+
+**No hallucinations detected. Minor discrepancy on commit count explained by post timing.**
+
+---
+
+## Decision
+
+- **Score**: 3/5
+- **Action**: Integrate partially — git prompt (high priority) + context window paragraph (medium priority). Skip case study bullet.
+- **Confidence**: High on scope/placement; medium on data (n=1 limitation acknowledged)
--- a/docs/resource-evaluations/2026-03-16-vickairo-claude-security-audit.md
+++ b/docs/resource-evaluations/2026-03-16-vickairo-claude-security-audit.md
@ -0,0 +1,107 @@
+# Resource Evaluation: claude-security-audit (VicKayro)
+
+**Date**: 2026-03-16
+**Evaluator**: Claude Sonnet 4.6
+**Resource URL**: https://github.com/VicKayro/claude-security-audit
+**Resource Type**: Open-source Claude Code command (GitHub)
+**Author**: VicKayro
+**Published**: 2026-02-26
+**Stars**: 60 | **Forks**: 6 | **License**: MIT (README-declared, no SPDX file)
+
+---
+
+## Executive Summary
+
+A single-file `/security-audit` slash command for Claude Code that runs a 16-section OWASP-mapped web app audit with scoring /10. The repo is 18 days old. The guide already has full OWASP coverage via `security-audit.md`, `security-check.md`, `security-auditor.md` agent, and the 41KB `security-hardening.md`. Two patterns in this resource are genuinely better than what we have: an environment context step (dev/staging/prod) before auditing, and an anti-false-positive factual check before reporting secrets (runs real git history before raising a finding). One area is a genuine gap: paywall/billing logic audit. Everything else overlaps.
+
+---
+
+## Content Summary
+
+- **16 audit sections** with OWASP Top 10 (2021) + CWE IDs: HTTP headers, auth, CSRF, open redirect, injection (SQL/XSS/command), IDOR/access control, secrets and crypto, paywall/billing, vulnerable deps (npm audit + pip-audit), CORS, files/config, WebSocket, SSRF, logging/monitoring, data integrity, software integrity
+- **Context-aware pre-step**: asks dev/staging/prod before starting — avoids false positives on debug flags, CORS `*`, and HTTP-only configs that are normal in local dev
+- **Factual verification requirement**: before reporting any secret, runs `git log --all -p -- '*.env'` and checks `.gitignore` — no finding without concrete proof
+- **Scoring /10** with a structured severity table (CRITIQUE → HAUTE → MOYENNE → BASSE), file:line, and recommended fix with code
+- **Disclaimer**: report includes an explicit reminder that it needs human review before any action
+- **258 lines**, French-language prompts, MIT
+
+---
+
+## Gap Analysis vs. Guide
+
+| Section | VicKayro's Command | Our Coverage |
+|---------|-------------------|--------------|
+| OWASP Top 10 structure | ✅ Full mapping + CWE | ✅ `security-auditor.md` agent, `security-hardening.md` |
+| HTTP headers | ✅ 7 headers with CWE | ⚠️ Mentioned in hardening guide, not in audit command |
+| Auth / JWT | ✅ 10 checks | ✅ `security-audit.md` Phase 2+3 |
+| Secrets / git history | ✅ Factual check pattern | ✅ `security-audit.md` Phase 2 (pattern similar but less strict) |
+| Dep scan (npm/pip) | ✅ | ✅ `security-audit.md` Phase 4 |
+| CSRF / open redirect | ✅ | ⚠️ Not explicit in our commands |
+| IDOR / access control | ✅ | ⚠️ Not explicit in our commands |
+| CORS | ✅ | ⚠️ Not explicit in our commands |
+| **Paywall / billing** | ✅ Full section | ❌ **Not covered anywhere in the guide** |
+| WebSocket | ✅ | ❌ Not covered |
+| SSRF | ✅ | ⚠️ Mentioned in hardening guide, not in audit command |
+| Dev/staging/prod context step | ✅ Explicit pre-step | ❌ Not in our command |
+| Anti-FP factual verification | ✅ Explicit requirement | ⚠️ Partial — our command checks `.gitignore` but doesn't mandate git log proof |
+| Claude Code-specific (hooks, prompt injection, MCP) | ❌ Not covered | ✅ Our unique angle |
+| Score /100 posture model | ❌ Score /10 only | ✅ `security-audit.md` Phase 6 |
+
+**Real gaps**: paywall/billing audit, the strict anti-false-positive pattern, the environment context pre-step.
+
+---
+
+## Score
+
+**Score: 2/5** (Marginal)
+
+The overlap with existing guide content is substantial. Most of the 16 sections are already covered through the combination of `security-audit.md`, `security-auditor.md`, and `security-hardening.md`. The command is 18 days old with 60 stars — insufficient track record for security tooling, where false confidence is worse than no tool. The LinkedIn post framing ("vibe coding security") is accurate marketing but doesn't change the technical substance.
+
+The two extractable patterns (environment context step + factual git-history check before secrets findings) and the paywall/billing gap are worth acting on, but they can be addressed by enhancing our existing command without citing this repo.
+
+---
+
+## Challenge (technical-writer agent)
+
+> "The score is wrong. It should be 2/5, not 3/5. The evaluator confused 'interesting' with 'valuable.'"
+>
+> "The repo is 18 days old with 60 stars. That is not a signal of validation, it is a signal of recency bias. Security tooling needs a longer track record."
+>
+> "The integration recommendation is backwards. 'Mention in security-hardening.md as community resource' means we are directing readers to a 60-star, 18-day-old single-file command. If the patterns are worth having, extract them. If not, don't mention the repo."
+>
+> "The paywall/billing section is the one genuinely novel angle. None of our existing commands audit billing logic or access control around paywalled features. The evaluation mentions it in the feature list and then says nothing about it."
+
+Challenge accepted. Score adjusted to 2/5. Integration plan revised accordingly.
+
+---
+
+## Fact-Check
+
+| Claim | Verified | Source |
+|-------|----------|--------|
+| MIT license | ⚠️ Partial | README says MIT, no SPDX file or LICENSE in repo |
+| 16 audit sections | ✅ | Read from command file directly (gh api) |
+| OWASP Top 10 (2021) mapping | ✅ | Command file contains OWASP reference table with A01-A10 |
+| npm audit + pip-audit | ✅ | Section 9 of the command file |
+| Score /10 | ✅ | Command file output format |
+| 60 stars, created 2026-02-26 | ✅ | GitHub API |
+| VicKayro author profile | ⚠️ | GitHub API returns 404 for user profile — minimal public presence |
+
+No invented stats. The LinkedIn post claims "16 sections" and "OWASP Top 10 (2021) mappé" — both accurate.
+
+---
+
+## Decision
+
+**Score: 2/5 — Ne pas intégrer comme ressource externe**
+
+**Action: Extract two patterns silently into existing commands**
+
+1. Add an environment context pre-step to `examples/commands/security-audit.md` (ask dev/staging/prod before Phase 1)
+2. Strengthen the anti-false-positive requirement in Phase 2 (mandate `git log --all -p` before reporting secrets)
+3. Add a Paywall/Billing audit section to `examples/agents/security-auditor.md`
+4. Revisit the repo in 3 months if it reaches 200+ stars with active issues/PRs
+
+**No guide mention.** The resource does not meet the threshold for a community reference link given its age, minimal author profile, and the absence of a formal LICENSE file. The guide already has security coverage that is broader in several dimensions (Claude Code-specific threats, hook security, prompt injection). Extracting the useful patterns internally is cleaner than sending readers to an immature repo.
+
+**Confidence**: High — full command file reviewed, all claims verified against source.
--- a/docs/resource-evaluations/eval-claude-1m-context-window-jp-caparas.md
+++ b/docs/resource-evaluations/eval-claude-1m-context-window-jp-caparas.md
@ -0,0 +1,86 @@
+# Resource Evaluation: "What Claude's 1M Token Context Window Means for Your Work"
+
+**Source**: https://reading.sh/what-claudes-1m-token-context-window-means-for-your-work-3c9f900f04c6
+**Author**: JP Caparas (Medium)
+**Published**: ~March 15, 2026 (13 min read, 62 claps)
+**Type**: Plain-language explainer / pricing analysis
+**Evaluated**: 2026-03-15
+**Score**: 2/5 (Marginal — do not integrate)
+
+---
+
+## Summary
+
+Plain-language explainer covering: what tokens are and what 1M tokens can hold (codebases, legal docs, academic papers), claim that 1M context went GA on March 13 for Opus 4.6, pricing comparison against OpenAI GPT-5.4 / Google Gemini 3.1 Pro / Meta Llama 4, Claude Code workflow impact (compaction reduction, full-codebase sessions), enterprise use cases (compliance review, codebase migration, incident response), "lost in the middle" effect as ongoing limitation, MRCR v2 score (78.3% Opus 4.6), and quotes from Jon Bell (CPO Codeium) and Anton Biryukov (Ramp).
+
+---
+
+## Fact-Check
+
+| Claim | Status | Notes |
+|-------|--------|-------|
+| 1M context GA March 13, 2026 | **Unconfirmed** | Perplexity shows "beta" still active; March 13 maps to a usage promotion, not a GA announcement |
+| Flat pricing $5/$25 MTok, no surcharge at any length | **FALSE** | 2x input / 1.5x output surcharge confirmed above 200K tokens (awesomeagents.ai, puter.com); this is the article's central thesis |
+| Opus 4.6 base price $5/$25 MTok | Confirmed | Consistent across multiple pricing sources |
+| Sonnet 4.6 $3/$15 MTok | Confirmed | Consistent across multiple pricing sources |
+| Cached reads $0.50/MTok (90% discount) | Plausible | Standard Anthropic caching discount |
+| Cache writes $6.25/MTok | Plausible | Standard Anthropic caching pricing |
+| OpenAI GPT-5.4 2x rate limit above 272K | **Unverified** | No primary source found |
+| Google Gemini 3.1 Pro surcharge above 200K | Partially confirmed | Google does surcharge; exact numbers differ from article |
+| MRCR v2: 78.3% Opus 4.6 | **Discrepancy** | Guide cites 76% from Anthropic blog; article may reference a different variant or a revised benchmark run |
+| Jon Bell (Codeium) — 15% compaction decrease | **Unverifiable** | No independent corroboration found |
+| Anton Biryukov (Ramp) quote | **Unverifiable** | No independent corroboration found |
+| 600 images/PDF per request (up from 100) | **Unverified** | Not confirmed in official API docs |
+| Haiku caps at 200K | Confirmed | Consistent with known specs |
+
+**Critical finding**: The article's differentiating claim is "Anthropic's bet is flat pricing — no surcharge regardless of length." This is factually wrong. Anthropic charges 2x input and 1.5x output above 200K tokens. The entire competitive pricing analysis built on this premise is unreliable.
+
+---
+
+## Comparative Analysis
+
+| Aspect | This resource | Our guide |
+|--------|--------------|-----------|
+| Token basics (what is a token) | Covered, beginner-friendly | Not covered — assumes dev audience |
+| 1M context capabilities | Generic scenarios | Covered with MRCR benchmarks + cost tables |
+| Pricing vs competitors | Covered — but factually wrong on flat pricing | Partially covered (Gemini comparison, line 2053) |
+| Compaction events | Mentions 15% reduction (unverifiable quote) | Covered in depth (architecture.md lines 391-438) |
+| "Lost in the middle" effect | Mentioned with arxiv:2307.03172 reference | Not explicitly covered |
+| Enterprise use cases | Hypothetical, no measured data | Covered across multiple sections |
+| MRCR v2 benchmark | 78.3% (discrepancy with our 76%) | 76% from Anthropic blog |
+| Claude Code workflow impact | Good framing (search phase = 100K+ tokens) | Covered via compaction + context engineering |
+
+---
+
+## Challenge Notes
+
+The technical-writer review agreed score 2/5 is justified. An initial proposal of 3/5 was revised downward because the fact-check demolishes the article's core value proposition — wrong pricing data cannot complement a guide that aims for accuracy. The Povilas Korop anecdote (83% context utilization on a Laravel project) is a nice real-world datapoint, but anecdotal and insufficient to shift the score.
+
+Key insight from review: "Fix stale info in the guide" is the real action item, independent of this resource. If 1M context is actually GA, our guide still says "beta" at lines 2028-2070 — that needs a fix regardless of this article.
+
+---
+
+## Decision
+
+**Do not integrate.**
+
+### Specific exclusions
+
+- Pricing comparison table (central claim on flat pricing is wrong)
+- Enterprise use cases (hypothetical, no measured data)
+- Unverifiable quotes (Jon Bell, Anton Biryukov)
+
+### Independent action items triggered by this review
+
+1. **Verify 1M GA status** via official Anthropic docs / API changelog
+2. **If confirmed GA**: Update `guide/ultimate-guide.md` lines 2028-2070 (currently says "beta")
+3. **If confirmed GA**: Verify whether the 200K surcharge structure changed at all
+4. **Consider adding**: One line on "lost in the middle" effect in `guide/core/context-engineering.md` (well-documented limitation, arxiv:2307.03172 — independent of this article)
+
+### Worth independent verification
+
+The Ramp workflow pattern (search Datadog + Braintrust + DB + source = 100K tokens before writing a single fix) is a useful illustration of why large context windows matter for real engineering workflows — worth adding if an independent source confirms it.
+
+---
+
+**Confidence**: High. The fact-check identified a critical error in the central thesis; no ambiguity on the decision.