feat: smart-suggest ROI script + hook tuning + guide updates (Mar 16)
- Add examples/scripts/smart-suggest-roi.py: stdlib-only analyzer correlating suggestion log with session JSONL files to measure command acceptance rate. 4 acceptance signals, tier breakdown, daily trend, --json/--since/--no-sessions CLI. - Tune Aristote smart-suggest hook: tighten 5 over-firing triggers (/tech:commit, /tech:sonarqube, /tech:dupes, /check-conventions a11y, /tech:worktree) - Guide: identity re-injection hook, context engineering maturity grid, code review workflow, 1M context window GA update, Spring Break promo, security audit patterns - Resource evaluations: Nick Tune hooks (3/5), VicKayro security audit (2/5), Karl Mazier CLAUDE.md templates, Paul Rayner ContextFlow, Siddhant agent trace, Andrew Yng context hub, JP Caparas 1M context window Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
d9cff74d71
commit
da8bc09f2d
19 changed files with 1963 additions and 6 deletions
27
CHANGELOG.md
27
CHANGELOG.md
|
|
@ -8,12 +8,39 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
|||
|
||||
### Added
|
||||
|
||||
- **Failure-triggered context drift pattern** (`guide/core/architecture.md` §Session Degradation Limits): New subsection documenting a distinct degradation mode from compaction drift — repeated tool failures accumulate error noise that dilutes the original intent without filling the context window. Pattern: re-inject core task instructions on every command failure via `PostToolUse` hook, not just after `/compact`. Source: Nick Tune (2026-03-01). Resource evaluation: `docs/resource-evaluations/2026-03-16-nick-tune-workflow-dsl-ddd.md` (score 3/5 — 1 of 3 patterns integrated).
|
||||
|
||||
- **Identity re-injection after compaction** (`guide/ultimate-guide.md` §7.5 + `examples/hooks/bash/identity-reinjection.sh`): New hook pattern from Nick Tune (Feb 2026). Solves agent identity drift after context compaction in long sessions — `UserPromptSubmit` hook reads transcript, detects missing identity marker in last assistant message, re-injects `.claude/agent-identity.txt` as `additionalContext`. Configurable via `CLAUDE_IDENTITY_FILE` and `CLAUDE_IDENTITY_MARKER` env vars. `reference.yaml` updated with `identity_reinjection_hook` + `identity_reinjection_example` keys.
|
||||
|
||||
- **Security audit hardening — 3 patterns** (`examples/commands/security-audit.md`, `examples/agents/security-auditor.md`): (1) Pre-step added to `/security-audit`: asks dev/staging/prod before running — avoids false positives on debug flags and CORS `*` in local dev. (2) Anti-false-positive rule in Phase 2 (secrets scan): mandates running `git log --all -p` and checking `.gitignore` before raising any secret finding — no more findings based on pattern matching alone. (3) Paywall/billing checklist added to `security-auditor.md` under A04 Insecure Design: server-side limit enforcement, subscription status from DB, webhook signature verification, billing bypass endpoints, race conditions on resource creation.
|
||||
|
||||
- **Resource evaluation: VicKayro — claude-security-audit** (`docs/resource-evaluations/2026-03-16-vickairo-claude-security-audit.md`): Score 2/5. Single-file `/security-audit` command, OWASP Top 10 (2021) + 16 sections, MIT, 60 stars (18 days old). Substantial overlap with existing `security-audit.md`, `security-auditor.md`, and `security-hardening.md`. Genuine gaps: paywall/billing audit section (not covered anywhere), environment context pre-step (dev/staging/prod before auditing), and stricter anti-false-positive pattern for secrets (mandate `git log --all -p` proof before raising finding). Decision: extract 3 patterns into existing commands silently, no guide mention, revisit at 200+ stars.
|
||||
|
||||
- **Resource evaluation: Nick Tune — Hook-Driven Dev Workflows** (`docs/resource-evaluations/2026-03-16-nick-tune-hook-driven-workflows.md`): Score 3/5. Covers hooks-as-workflow-engine pattern: typed state machine (Zod), per-state SubagentStart context injection, agent respawn for fresh context windows, identity re-injection after compaction, JSON workflow persistence. Key gap confirmed: guide lacks identity re-injection after compaction + per-state SubagentStart injection. Tiered integration: identity re-injection → §7.5 now; SubagentStart injection → agent-teams.md (3-4 weeks); full state machine guide deferred 60-90 days (1 week of author testing, needs community validation). Prerequisites: CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1, Opus 4.6, Node.js + TypeScript.
|
||||
|
||||
- **1M context window status update** (`guide/ultimate-guide.md` lines ~2021-2070): Updated from "beta" to GA for Max/Team/Enterprise Claude Code plans (v2.1.75, March 13 2026). Preserved distinction: direct API use still requires tier 4 / custom rate limits. Pricing table updated to reflect standard rates for plan users.
|
||||
|
||||
- **Code Review feature** (`guide/workflows/code-review.md` + cross-reference in `guide/ultimate-guide.md`): New workflow guide for Anthropic's Code Review research preview (Teams/Enterprise). Covers: multi-agent architecture and severity levels (🔴/🟡/🟣), full setup flow (admin URL `claude.ai/admin-settings/claude-code`, GitHub App permissions, 3 trigger modes — once/every push/manual), `@claude review` manual trigger, `REVIEW.md` schema with example, pricing model ($15-25 avg, billed via extra usage outside plan, spend cap at `claude.ai/admin-settings/usage`), analytics dashboard, and cross-links to manual CLI workflows + GitLab CI/CD. Verified against official docs at `code.claude.com/docs/en/code-review`.
|
||||
|
||||
- **Context engineering guide — 3 additions** (`guide/core/context-engineering.md`):
|
||||
- **"Most failures are context failures"** framing added to §1 Why It Matters — reframes troubleshooting from "the AI is bad" to "what's missing from context"
|
||||
- **Static vs. Dynamic context** — new subsection distinguishing CLAUDE.md (static) from runtime tool outputs and agent context (dynamic); includes reference to Anthropic's September 2025 engineering post on agent context engineering
|
||||
- **Maturity assessment §9** — Level 0-5 self-assessment grid grounded in Claude Code patterns (no CLAUDE.md → flat config → structured → modular → measured → full system); includes "what to do at each level" action table
|
||||
|
||||
- **Spring Break promotion note** (guide line ~2395): Documented Anthropic's March 13-27, 2026 promotion — 2x usage limits outside 5-11am PT (peak hours) and all weekends, bonus usage doesn't count against weekly limits, applies to Free/Pro/Max/Team. Includes CET timezone conversion for European users (2x from midnight-13h and 19h-24h France time). Source: Anthropic support article.
|
||||
|
||||
- **Smart-Suggest ROI script** (`examples/scripts/smart-suggest-roi.py`): Python stdlib-only analyzer for the `smart-suggest` UserPromptSubmit hook. Correlates suggestion log (`~/.claude/logs/smart-suggest.jsonl`) with session JSONL files to estimate command acceptance rate. Detects 4 acceptance signals: slash command tags, Skill tool use, Agent tool use, and text mention in next 5 user messages. Reports: summary, tier breakdown (Enforcement/Discovery/Contextual/Custom), top suggested/followed commands, never-followed list, and daily trend chart. CLI: `--since Nd`, `--no-sessions` (fast mode), `--json`, `--log PATH`.
|
||||
- **ICM (Infinite Context Memory)**: New MCP memory server section after Kairn (~line 11365) — Rust single binary, zero deps, Homebrew install, dual architecture (episodic decay Memories + permanent knowledge graph Memoirs), 9 typed relation types, auto-extraction 3 layers, 14 editor clients. Score 3/5 — recommended as Rust-native alternative when Python dependency management is a friction point. Includes explicit license callout (Source-Available, free ≤20 people) and vendor-reported benchmark flags.
|
||||
- **Comparison matrix update**: Added ICM column to MCP memory stack matrix (Runtime + License rows added for all tools)
|
||||
|
||||
### Documentation
|
||||
|
||||
- **Resource evaluation** (rejected, no file): LinkedIn post "Five Levels of Context Engineering" by Matthew Alverson (via Addy Osmani) — score 1/5, rejected. Content is a pedagogical reformulation of concepts already covered with more rigor in `guide/core/context-engineering.md`. Alverson's 5-level taxonomy is not empirically grounded and not widely cited in the literature. Evaluation surfaced 3 real gaps now addressed (see Added section above). Better primary sources identified: Anthropic Engineering Blog (Sept 2025), MCP Maturity Model (Mitra, Nov 2025).
|
||||
|
||||
- **Resource evaluation** (no file — text digest): Anthropic weekly recap March 9-15, 2026 (5 Claude Code releases, Code Review launch, 1M GA, Spring Break promo, corporate news) — score 4/5. Two gaps actioned: (1) Code Review product feature added as `guide/workflows/code-review.md`; (2) 1M context status updated from beta to GA in `guide/ultimate-guide.md` lines 2021-2070. Source reliability note: digest incorrectly attributes Claude Code changelog to `anthropics/anthropic-sdk-python` (correct repo: `anthropics/claude-code`); Code Review pricing ($15-25/PR) verified against official docs.
|
||||
|
||||
- **Resource evaluation** (`docs/resource-evaluations/eval-claude-1m-context-window-jp-caparas.md`): JP Caparas article on 1M token context window — score 2/5, do not integrate. Central claim (flat pricing, no surcharge above 200K tokens) is factually wrong; invalidates the competitive pricing analysis. Fact-check table, comparative analysis vs guide, and independent action items (verify 1M GA status, potential update to guide lines 2028-2070 on beta/GA status).
|
||||
|
||||
- **Claude Code Releases**: Updated tracking to v2.1.76
|
||||
- MCP elicitation support — servers request structured input mid-task via interactive dialog
|
||||
- New hooks: `Elicitation`, `ElicitationResult`, `PostCompact`
|
||||
|
|
|
|||
|
|
@ -0,0 +1,93 @@
|
|||
# Evaluation: agent-trace (Siddhant-K-code/agent-trace)
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Source**: https://github.com/Siddhant-K-code/agent-trace
|
||||
**Type**: GitHub repository (Python tool)
|
||||
**Evaluator**: Claude (eval-resource skill)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
`agent-trace` (pip package: `agent-strace`) is a Python tool — zero dependencies, stdlib only — that captures every tool call, user prompt, and assistant response in Claude Code via hooks, then lets you replay sessions in the terminal or export as OpenTelemetry spans. Created 2026-03-15. 7 stars at time of evaluation.
|
||||
|
||||
The "strace for AI agents" framing is apt: it solves the "my agent modified 47 files and I have no idea why" problem by giving you a time-stamped, replayable record of every decision point.
|
||||
|
||||
---
|
||||
|
||||
## Key Points
|
||||
|
||||
- **Claude Code hooks**: Setup via `agent-strace setup`. Registers PreToolUse, PostToolUse, PostToolUseFailure, UserPromptSubmit, Stop, SessionStart, SessionEnd in `.claude/settings.json`
|
||||
- **Session replay**: `agent-strace replay` shows full session with timestamps, durations, tool inputs, errors — the missing layer between JSONL and understanding
|
||||
- **MCP proxy**: Wraps any MCP server (stdio or HTTP/SSE). Works with Cursor, Windsurf, any MCP client
|
||||
- **OpenTelemetry export**: OTLP output → Datadog, Honeycomb, New Relic, Splunk
|
||||
- **Python decorator API**: `@trace_tool`, `@trace_llm_call`, `log_decision()` for custom agents
|
||||
- **Secret redaction**: `--redact` flag strips OpenAI, GitHub, AWS, Anthropic, Slack, JWTs, Bearer tokens, connection strings
|
||||
|
||||
---
|
||||
|
||||
## Relevance Score: 2/5
|
||||
|
||||
**Pertinent but too immature for immediate integration.**
|
||||
|
||||
The session replay angle is real and not covered by existing tools in the guide. But `claude-code-otel` already handles the OTel export use case, and the manual jq queries at `guide/ops/observability.md:519-550` cover most of the audit use case. The unique differentiator — interactive replay — needs production validation before being recommended to readers.
|
||||
|
||||
---
|
||||
|
||||
## Comparison vs Current Guide Coverage
|
||||
|
||||
| Aspect | agent-trace | Guide coverage |
|
||||
|--------|-------------|----------------|
|
||||
| Manual JSONL audit (jq) | ✅ Abstracted as CLI | ✅ observability.md:520 |
|
||||
| Session replay (visual) | ✅ Unique differentiator | ❌ Not covered |
|
||||
| OpenTelemetry export | ✅ OTLP | ✅ claude-code-otel already in table |
|
||||
| Hook setup automation | ✅ `agent-strace setup` | ✅ Documented manually |
|
||||
| MCP proxy (Cursor/Windsurf) | ✅ stdio + HTTP/SSE | ❌ Not covered |
|
||||
| Python decorator API | ✅ Custom agents | ❌ Not covered |
|
||||
| Maturity | ❌ 1 day old, 7 stars | ✅ Table tools have 100-10K stars |
|
||||
|
||||
---
|
||||
|
||||
## Challenge Notes (technical-writer review)
|
||||
|
||||
**Score should be 2/5, not 3/5.** Reasons:
|
||||
|
||||
1. `claude-code-otel` already exports to Datadog/Honeycomb. The OTel angle is not additive.
|
||||
2. The jq queries at observability.md:519-550 cover most of the audit use case already. The "replay niche" is thinner than it appears.
|
||||
3. ICM (1 star) was put on watch list. Agent-trace at 7 stars deserves the same treatment.
|
||||
|
||||
**Missing aspects not in initial analysis**:
|
||||
|
||||
- **MCP proxy = MITM risk**: Routing all MCP traffic through an unaudited HTTP/SSE proxy is a security surface. The guide has a full hardening section — adding this to the monitoring table without flagging would be inconsistent.
|
||||
- **Secret redaction unverified**: Base64-encoded tokens, multi-line .env values, AWS temporary credentials — edge cases not tested. Could create false confidence.
|
||||
- **Python decorator API vs MLflow SDK**: MLflow has versioning + experiment tracking + LLM-as-judge. Agent-trace has lower friction. Real trade-off not mentioned.
|
||||
|
||||
**On placement**: If integrated, not in the External Monitoring Tools table (that's monitoring, not debugging). Better as a footnote in the JSONL section (~observability.md:565) as "a higher-level wrapper for session replay."
|
||||
|
||||
**Risk of NOT integrating**: Near zero. The jq queries + claude-code-otel cover the primary use cases. Real risk runs the other direction: adding a 1-day-old tool that goes unmaintained = dead link in a table readers use for tooling decisions.
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Verified | Source |
|
||||
|-------|----------|--------|
|
||||
| Zero dependencies, Python stdlib only | ✅ | pyproject.toml + README |
|
||||
| Created 2026-03-15 | ✅ | GitHub API: `created_at: 2026-03-15T08:09:45Z` |
|
||||
| MIT licensed | ✅ | GitHub API: `license: MIT License` |
|
||||
| Captures all CC hook events | ✅ | README hooks JSON: all 7 event types |
|
||||
| Export to Datadog, Honeycomb, Splunk | ✅ | README: `export --to otlp` (OTLP compatible) |
|
||||
| 7 stars at evaluation | ✅ | GitHub API 2026-03-16 |
|
||||
|
||||
No hallucinations detected. All stats confirmed against source.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
**Action: Watch list**
|
||||
**Integration trigger**: 100+ stars AND at least one practitioner write-up showing real production use.
|
||||
|
||||
**If triggered**: Add as footnote in observability.md ~line 565 (JSONL section), not in the External Monitoring Tools table. Frame as "higher-level wrapper for session replay/debug" distinct from the monitoring tools.
|
||||
|
||||
**Why watch list and not reject**: Session replay is a real gap. Zero-deps Python is a genuine adoption differentiator. The engineering quality looks solid (automated setup, secret redaction, HTTP/SSE proxy). Just needs time to prove reliability on real sessions.
|
||||
|
|
@ -0,0 +1,76 @@
|
|||
# Resource Evaluation: Context Hub (andrewyng/context-hub)
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Source**: LinkedIn post (text) + https://github.com/andrewyng/context-hub
|
||||
**Type**: Open-source CLI tool
|
||||
**Author**: Andrew Ng (andrewyng)
|
||||
**Score**: 2/5
|
||||
|
||||
---
|
||||
|
||||
## Summary of Content
|
||||
|
||||
- **What it is**: A CLI tool (`chub`) providing coding agents with curated, versioned API documentation as markdown files
|
||||
- **Core commands**: `chub get openai/chat --lang py` to fetch API docs; `chub annotate <id> "note"` for persistent cross-session annotations
|
||||
- **Corpus**: 602+ documentation entries (as of 2026-03-16), covering OpenAI, Anthropic, Stripe, AWS, and others
|
||||
- **Community loop**: Users vote on doc quality (`chub feedback`), surfacing improvements to maintainers
|
||||
- **Claude Code integration**: SKILL.md support for dropping into `~/.claude/skills/`
|
||||
- **License**: MIT, 6,342 stars
|
||||
|
||||
---
|
||||
|
||||
## Score: 2/5
|
||||
|
||||
**Justification**: One genuinely novel feature (cross-session persistent annotations on external API docs) that Context7 cannot replicate. Everything else overlaps with existing guide coverage: Context7 already handles versioned library docs, `@url` natively pulls live documentation into Claude Code context, and anti-hallucination patterns are already documented. The annotation use case is real but solves a narrow problem. No production benchmarks, no independent validation.
|
||||
|
||||
---
|
||||
|
||||
## Comparative Analysis
|
||||
|
||||
| Aspect | Context Hub | Our Guide |
|
||||
|--------|------------|-----------|
|
||||
| Curated API docs for agents | New CLI approach | Not covered as dedicated tool |
|
||||
| Cross-session doc annotations | Unique feature | Not covered |
|
||||
| Official library docs lookup | Overlaps with Context7 | Covered (Section 8, Context7) |
|
||||
| Live URL context | Overlaps with native `@url` | Covered (native Claude Code) |
|
||||
| Agent hallucination prevention | Indirect angle | Covered but scattered |
|
||||
| Maintenance/freshness guarantees | Community-maintained, lag risk | N/A |
|
||||
|
||||
---
|
||||
|
||||
## Challenge Notes (technical-writer agent)
|
||||
|
||||
**Key pushbacks:**
|
||||
|
||||
1. **Stars ≠ adoption**: 6,342 stars driven by Andrew Ng's social amplification, not production validation
|
||||
2. **Context7 overlap not demonstrated**: `chub get openai/chat --lang py` vs Context7's `query-docs` — the evaluation doesn't prove the concrete gap
|
||||
3. **Annotation is the only novel angle**: and it got buried — it's the one feature Context7 cannot replicate
|
||||
4. **Hallucination framing is a stretch**: community-maintained docs introduce a trust problem Context7 avoids (official sources)
|
||||
5. **Missing: `@url` native alternative**: Claude Code already pulls live docs natively, weakening the "gap" case
|
||||
6. **Missing: maintenance risk**: update lag when APIs change vs. Context7's live resolution
|
||||
7. **Risk of not integrating**: Low — existing guide coverage (Context7, `@url`, grepai) handles most use cases
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim (from LinkedIn post) | Verdict | Notes |
|
||||
|---------------------------|---------|-------|
|
||||
| "Andrew Ng just dropped" | Verified | Repo owner is `andrewyng`, not a fork |
|
||||
| "68+ APIs" | False | Actual corpus: 602+ entries as of 2026-03-16 |
|
||||
| "One of the fastest accelerating new repos" | Unverifiable | 6,342 stars in ~5 months; no public velocity data |
|
||||
| "100% free & open source (MIT)" | Verified | MIT confirmed in license file |
|
||||
|
||||
**Corrections**: The "68+ APIs" figure is either from an early snapshot or fabricated. Real coverage is ~9x larger. The LinkedIn post is marketing-inflated.
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Action**: Do not integrate — one-line mention only.
|
||||
|
||||
If mentioned at all, one sentence under the Context7 entry in Section 8 (MCP servers): "For teams requiring persistent annotations on external API docs across sessions, see [context-hub](https://github.com/andrewyng/context-hub)."
|
||||
|
||||
No section, no dedicated coverage, no hallucination-prevention framing. Revisit if production use cases emerge in the community.
|
||||
|
||||
**Confidence**: High (fact-check complete, challenge addressed)
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
# Resource Evaluation: Karl MAZIER — LinkedIn post on CLAUDE.md structure + reizam/claude-md-templates
|
||||
|
||||
**Source**: LinkedIn post (Karl MAZIER) + https://github.com/reizam/claude-md-templates
|
||||
**Author**: Karl MAZIER (co-founder, Open Source & SaaS, YC)
|
||||
**Date**: ~2026-03-02 (LinkedIn post date estimated ~2 weeks before 2026-03-16)
|
||||
**Type**: LinkedIn post + GitHub repo (community toolkit)
|
||||
**Evaluated**: 2026-03-16
|
||||
|
||||
---
|
||||
|
||||
## 📄 Summary
|
||||
|
||||
- Two-level CLAUDE.md structure: `~/.claude/CLAUDE.md` (global, ~30 lines) + `./CLAUDE.md` per repo (~40 lines). Global = coding philosophy/conventions. Project = what the agent can't discover from code alone.
|
||||
- References ETH Zurich paper arXiv 2602.11988: AI-generated context files yield -3% agent success rate and +20% inference cost. Human-written minimal files yield ~+4%.
|
||||
- Core rule: write only what the agent cannot discover independently from the codebase.
|
||||
- GitHub repo (https://github.com/reizam/claude-md-templates): 2 fork-ready templates + 3 slash commands installable via `npx skills add reizam/claude-md-templates` (generate global, generate project, audit existing CLAUDE.md).
|
||||
- The `Philosophy` section highlighted as the most critical part of global CLAUDE.md.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Score
|
||||
|
||||
| Score | Meaning |
|
||||
|-------|---------|
|
||||
| 5 | Essential — major gap in the guide |
|
||||
| 4 | High value — significant improvement |
|
||||
| 3 | Relevant — useful complement |
|
||||
| **2** | **Marginal — secondary info** |
|
||||
| 1 | Out of scope |
|
||||
|
||||
**Score: 2/5**
|
||||
|
||||
**Justification**: The ETH Zurich paper (arXiv 2602.11988) was already evaluated on 2026-02-19 and scored 4/5 with a full integration plan. This LinkedIn post is a community summary of that same paper without original analysis. The GitHub repo templates (8 stars, created 2026-02-27) are redundant with existing `examples/memory/CLAUDE.md.personal-template` (68 lines) and `CLAUDE.md.project-template` (72 lines) in the guide. One element is genuinely new: the `/claude-md-audit` slash command concept — a skill to analyze existing CLAUDE.md for bloat — which the guide doesn't have as an installable command.
|
||||
|
||||
---
|
||||
|
||||
## ⚖️ Comparison
|
||||
|
||||
| Aspect | This resource | Our guide |
|
||||
|--------|--------------|-----------|
|
||||
| ETH Zurich paper data (-3%/+20%) | ✅ Cited | ✅ Already evaluated (2026-02-19) |
|
||||
| Two-level hierarchy (global vs project) | ✅ Described | ✅ Well covered (context-engineering.md) |
|
||||
| Size targets (~30 / ~40 lines) | ✅ Specific numbers | ⚠️ Guide says <200 lines global (more conservative) |
|
||||
| "Write only what agent can't discover" rule | ✅ Clear decision test | ⚠️ Implied but not stated as a test |
|
||||
| Practical templates | ✅ Fork-ready | ✅ Already in examples/memory/ |
|
||||
| CLAUDE.md audit command | ✅ `/claude-md-audit` skill | ❌ Not implemented |
|
||||
| Philosophy section emphasis | ✅ Named as most critical | ⚠️ Covered implicitly |
|
||||
|
||||
---
|
||||
|
||||
## 📍 Recommendations
|
||||
|
||||
**Not worth a standalone integration** given the paper is already evaluated.
|
||||
|
||||
One targeted addition is worth considering:
|
||||
|
||||
**Where**: `guide/core/context-engineering.md`, existing section on CLAUDE.md size/quality.
|
||||
|
||||
**What**: Add the decision test as a one-liner: "Write only what the agent cannot discover from the code itself." This is a sharper formulation than the current guidance ("essentiels au projet") and passes as a practical heuristic.
|
||||
|
||||
**Priority**: Low. The paper integration (already planned from 2026-02-19 evaluation) is the priority. This is a wording improvement at best.
|
||||
|
||||
The `/claude-md-audit` command concept from the GitHub repo is worth tracking for the examples/commands/ directory if the repo gains adoption (currently 8 stars — too early).
|
||||
|
||||
---
|
||||
|
||||
## 🔥 Challenge
|
||||
|
||||
**Score: 2/5 confirmed** by technical-writer agent with one nuance:
|
||||
|
||||
The evaluation correctly separates the paper (already covered) from the packaging (LinkedIn post + 8-star repo). The agent challenged whether the score should stay at 2/5 or drop to 1/5 given prior coverage. It stays at 2/5 because:
|
||||
- The "write only what the agent can't discover" formulation is genuinely more actionable than current guide wording
|
||||
- The audit command concept is novel even if the repo is too young to reference
|
||||
|
||||
**Risks of not integrating**: Near zero. The paper is already queued for integration. This post adds no independent value beyond the paper.
|
||||
|
||||
**Points missed**: The post misreads the paper slightly — suggesting global CLAUDE.md should be "~30 lines." The paper's recommendation is more nuanced: write only the essential commands and project-specific tooling, not a target line count. The guide's adherence degradation data (lines 132-141 in context-engineering.md) is actually more actionable than the "~30 lines" heuristic.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Fact-Check
|
||||
|
||||
| Claim | Verified | Source |
|
||||
|-------|----------|--------|
|
||||
| ETH Zurich paper exists (arXiv 2602.11988) | ✅ | Prior evaluation 2026-02-19 |
|
||||
| AI-generated files: -3% perf | ✅ | arXiv abstract + Perplexity |
|
||||
| Human-written files: slight gain | ✅ | +4% confirmed in paper |
|
||||
| +20% inference cost claim | ✅ | "over 20%" in arXiv abstract |
|
||||
| reizam/claude-md-templates on GitHub | ✅ | 8 stars, 2 forks, created 2026-02-27 |
|
||||
| `npx skills add` mechanism | ✅ | GitHub repo confirms this |
|
||||
| "~30 lines global / ~40 lines project" | ⚠️ | Author's interpretation, not paper recommendation |
|
||||
|
||||
**Corrections**: The "~30 lines / ~40 lines" targets are the author's own heuristic, not a finding from the ETH Zurich paper. The paper recommends minimal context focused on build/test commands and specific tooling, without a line count target.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Final Decision
|
||||
|
||||
- **Score**: 2/5
|
||||
- **Action**: No integration — the underlying paper is already queued (2026-02-19 evaluation). Note the "write only what the agent can't discover" formulation for possible wording improvement in context-engineering.md.
|
||||
- **Confidence**: High
|
||||
|
||||
**Cross-reference**: `/Users/florianbruniaux/Sites/perso/claude-code-ultimate-guide/docs/resource-evaluations/agents-md-empirical-study-2602-11988.md` — the paper this post summarizes, already evaluated at 4/5 with full integration plan.
|
||||
|
||||
---
|
||||
|
||||
*Evaluated: 2026-03-16 | Method: text analysis + grepai_search + technical-writer challenge + agent research*
|
||||
|
|
@ -0,0 +1,181 @@
|
|||
# Resource Evaluation: Hook-Driven Dev Workflows with Claude Code
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Evaluator**: Claude Sonnet 4.6
|
||||
**Resource URL**: https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/
|
||||
**Resource Type**: Technical blog post
|
||||
**Author**: Nick Tune
|
||||
**Published**: 2026-02-28
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Nick Tune (already cited 4 times in the guide for his earlier Medium article) presents a new pattern that treats Claude Code hooks as a **workflow enforcement engine** with a typed state machine, JSON persistence, and per-state context injection. The guide covers individual hook types in isolation and has a bash-based single-entry dispatcher (§7.5). This article adds a TypeScript state machine layer on top of hooks that the guide does not cover. The most immediately useful standalone pattern — identity re-injection after compaction — can be integrated now without the full state machine.
|
||||
|
||||
**Recommendation**: **MODERATE (Score 3/5)** — Integrate the identity re-injection pattern immediately in §7.5. Stage the full state machine architecture as a Tier 3 workflow guide with explicit prerequisites. Re-evaluate the full integration at 4/5 in 60-90 days once community validation exists.
|
||||
|
||||
---
|
||||
|
||||
## Scoring Summary
|
||||
|
||||
| Criterion | Score | Weight | Weighted Score |
|
||||
|-----------|-------|--------|----------------|
|
||||
| **Accuracy & Reliability** | 4 | 20% | 0.80 |
|
||||
| **Depth & Comprehensiveness** | 5 | 20% | 1.00 |
|
||||
| **Practical Value** | 4 | 25% | 1.00 |
|
||||
| **Originality & Uniqueness** | 3 | 15% | 0.45 |
|
||||
| **Production Readiness** | 2 | 10% | 0.20 |
|
||||
| **Community Validation** | 2 | 10% | 0.20 |
|
||||
| **TOTAL SCORE** | | | **3.65 → 3/5** |
|
||||
|
||||
---
|
||||
|
||||
## Content Summary
|
||||
|
||||
The article introduces a hook-driven workflow pattern built on five core ideas:
|
||||
|
||||
- **Hooks as state machine**: SubagentStart, SubagentStop, PreToolUse, and TeammateIdle hooks feed into a TypeScript workflow engine managing state transitions (SPAWN → PLANNING → RESPAWN → DEVELOPING → REVIEWING → COMMITTING → CR_REVIEW → PR_CREATION → FEEDBACK → COMPLETE)
|
||||
- **Single-entrypoint dispatch**: One hook handler for all events via a `HOOK_HANDLERS` map dispatching by `hook_event_name` — similar concept to guide §7.5 bash dispatcher, but TypeScript + stateful
|
||||
- **State-specific context injection**: SubagentStart reads `/states/<state>.md` files and injects them into agent context — agents only see instructions relevant to the current state, avoiding bloated system prompts
|
||||
- **Respawn pattern**: After each iteration, developer and reviewer agents shut down and fresh instances spawn, giving each iteration a clean context window
|
||||
- **Identity re-injection after compaction**: Hooks detect when an agent has forgotten its identity prefix (after compaction) and re-inject identity instructions — the most standalone, transferable pattern in the article
|
||||
|
||||
---
|
||||
|
||||
## Gap Analysis vs. Claude Code Ultimate Guide
|
||||
|
||||
| Pattern | This Article | Guide Coverage |
|
||||
|---------|-------------|----------------|
|
||||
| Single-entrypoint hook dispatch | ✅ TypeScript, stateful | ⚠️ §7.5 covers bash dispatcher concept |
|
||||
| State machine with typed transitions (Zod) | ✅ Full implementation | ❌ Not covered |
|
||||
| SubagentStart for context injection | ✅ State-specific file injection | ⚠️ Table mention only ("Subagent initialization") |
|
||||
| PreToolUse as per-state operation blocker | ✅ Blocks git commit during DEVELOPING | ⚠️ Covered as security pattern, not workflow state control |
|
||||
| Agent respawn for context window management | ✅ Explicit per-iteration respawn | ❌ Not covered |
|
||||
| Workflow state persistence (JSON + session ID) | ✅ Full example | ❌ Not covered |
|
||||
| Identity re-injection after compaction | ✅ Hook detects missing prefix, re-injects | ❌ Not covered |
|
||||
| Agent teams + hooks combined | ✅ Concrete end-to-end | ⚠️ Separate docs, not combined |
|
||||
|
||||
**Note on originality**: The guide already references Nick Tune's earlier Medium article ("Coding Agent Development Workflows") 4 times at lines 4962, 8978, 13799, 15091, and 22527. This is a different, newer article. The single-entrypoint dispatch concept also exists in §7.5 as a bash pattern. The actual delta is: TypeScript state machine, per-state SubagentStart injection, respawn, JSON persistence, and identity re-injection.
|
||||
|
||||
---
|
||||
|
||||
## Detailed Analysis
|
||||
|
||||
### Accuracy & Reliability (4/5)
|
||||
|
||||
Technical claims check out against Claude Code's documented hook behavior:
|
||||
- SubagentStart, SubagentStop, PreToolUse, TeammateIdle hooks are real and documented
|
||||
- `hook_event_name` field in hook input matches guide §7.4 docs
|
||||
- `exitCode: EXIT_ALLOW` / `EXIT_BLOCK` pattern is valid
|
||||
- Zod for schema validation is standard TypeScript practice
|
||||
- JSON file persistence keyed to session ID is a practical, correct approach
|
||||
|
||||
One significant caveat: the author explicitly states "1 week of experimentation, cannot 100% recommend yet." That honest disclaimer is not a minor qualifier — it means this is unvalidated at any meaningful scale.
|
||||
|
||||
### Depth & Comprehensiveness (5/5)
|
||||
|
||||
Full TypeScript code for the dispatcher, a worked `/states/developing.md` example, JSON persistence schema with real fields, Zod transition map, DDD framing. GitHub repo with complete code exists. This is the article's strongest dimension — not vaporware, genuinely implementable.
|
||||
|
||||
### Practical Value (4/5)
|
||||
|
||||
The core problem is real: getting consistent workflows in codebases you don't fully control. The identity re-injection pattern alone is worth the read. The full state machine is more complex but still implementable.
|
||||
|
||||
Barrier: the approach requires Node.js + TypeScript runtime (`npx tsx`). The guide's hooks section is bash-first by design. Any integration needs to address this friction explicitly.
|
||||
|
||||
### Originality & Uniqueness (3/5)
|
||||
|
||||
The single-entrypoint dispatcher already exists in §7.5 as bash. The agent teams feature is already documented. What's genuinely novel: (1) attaching a typed state machine to hooks, (2) per-state SubagentStart context injection from files, (3) respawn for context window hygiene, (4) identity re-injection after compaction. Strong delta on 4 specific patterns, not on the overall approach.
|
||||
|
||||
### Production Readiness (2/5)
|
||||
|
||||
1 week of testing. Author's own words: "cannot fully recommend." Known wiring complexity ("ugly and fragile at times"). The hook JSON config requires repeating the same entry for each event type — acknowledged as a UX problem. This needs months of community validation before being presented as a recommended pattern.
|
||||
|
||||
### Community Validation (2/5)
|
||||
|
||||
Nick Tune is a credible practitioner (established author, DDD community). No adoption metrics for this specific article. The GitHub repo exists but engagement data is not available from the article.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites (Not Mentioned in Evaluation v1)
|
||||
|
||||
Any integration must flag these hard dependencies:
|
||||
|
||||
1. **Agent teams experimental flag**: `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1` — SubagentStart, SubagentStop, TeammateIdle are agent teams events. Without this flag, most of the patterns in the article are unavailable.
|
||||
2. **Opus 4.6 required**: Agent teams require Opus 4.6+, which costs significantly more than Sonnet.
|
||||
3. **Node.js + TypeScript toolchain**: `npx tsx` must be available in the project. Not a default Claude Code assumption.
|
||||
|
||||
---
|
||||
|
||||
## Recommended Integration
|
||||
|
||||
### Tier 1 — Integrate Now (Standalone Pattern)
|
||||
|
||||
**Identity re-injection after compaction** → `guide/ultimate-guide.md` §7.5 (Hook Examples)
|
||||
|
||||
This pattern is immediately useful to anyone using hooks with long sessions — no agent teams flag, no TypeScript required. Compaction-driven identity drift is a known pain point. The hook-based detection + re-injection workaround belongs in §7.5 regardless of the rest of the article.
|
||||
|
||||
```
|
||||
### Handling Identity Drift After Compaction
|
||||
When Claude's context compacts, agents in long sessions can "forget" their
|
||||
role. A hook can detect this and re-inject identity instructions.
|
||||
```
|
||||
|
||||
### Tier 2 — Integrate with Prerequisites Gate (3-4 weeks)
|
||||
|
||||
**Per-state SubagentStart context injection** → Add to agent-teams.md as an advanced pattern section. Prerequisite: agent teams flag. Key insight: inject state-specific files at runtime rather than bundling everything in the system prompt — reduces system prompt bloat and keeps agents focused.
|
||||
|
||||
### Tier 3 — New Workflow Guide (60-90 days, pending community validation)
|
||||
|
||||
**`guide/workflows/hook-driven-workflows.md`** — Full state machine architecture, once the pattern has more community validation. Clear prerequisites header (agent teams flag, Opus 4.6, Node.js + TypeScript). Frame as experimental / advanced.
|
||||
|
||||
### What NOT to Document
|
||||
|
||||
The specific CodeRabbit + GitHub issue + 10-state workflow is too opinionated. Document the architectural patterns; readers define their own states.
|
||||
|
||||
---
|
||||
|
||||
## Challenge Findings (Technical Review)
|
||||
|
||||
The challenge agent identified several issues with the initial v1 evaluation:
|
||||
|
||||
**Score correction**: The initial 4/5 score ("integrate within 1 week") was too aggressive for something the author hasn't validated beyond 1 week. Correct score is 3/5 now, with a scheduled re-evaluation in 60-90 days.
|
||||
|
||||
**Originality was overstated**: The single-entrypoint dispatcher already exists in §7.5 as bash. The guide already references this author's earlier work 4 times. The real delta is narrower: state machine, per-state injection, respawn, identity re-injection.
|
||||
|
||||
**Missing prerequisites**: Agent teams flag, Opus 4.6, Node.js + TypeScript — none of these were in v1. Any integration without flagging these prerequisites would frustrate readers.
|
||||
|
||||
**Identity re-injection is the most urgent pattern**: Standalone, no experimental flag required, directly solves a documented pain point (compaction-driven drift). The v1 evaluation mentioned it but didn't prioritize it as the immediate integration target.
|
||||
|
||||
**Integration recommendation was vague**: "Integrate in hooks section + new workflow guide" without defining what goes where. The tiered approach (Tier 1 now / Tier 2 in 3-4 weeks / Tier 3 in 60-90 days) is more actionable.
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Verified | Notes |
|
||||
|-------|----------|-------|
|
||||
| SubagentStart hook exists | ✅ | Confirmed in guide + official docs |
|
||||
| SubagentStop hook exists | ✅ | Confirmed |
|
||||
| PreToolUse hook exists | ✅ | Confirmed |
|
||||
| TeammateIdle hook exists | ✅ | Confirmed in guide hooks table |
|
||||
| `hook_event_name` field in hook input | ✅ | Documented in guide §7.4 |
|
||||
| Guide §7.5 already has single-entry dispatcher | ✅ | Bash version at line 9655 |
|
||||
| Guide cites Nick Tune's earlier article 4 times | ✅ | Lines 4962, 8978, 13799, 15091 (different URL) |
|
||||
| CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 required | ✅ | Confirmed in agent-teams.md |
|
||||
| Author: Nick Tune, published 2026-02-28 | ✅ | Byline + URL path + © 2026 |
|
||||
| GitHub repo with full code exists | ⚠️ | Article states it; not independently verified |
|
||||
| "1 week of experimentation" | ✅ | Direct quote from article |
|
||||
|
||||
No invented statistics or unverified benchmarks. The author makes no quantitative performance claims — appropriate hedging throughout.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
- **Final Score**: 3/5 — Moderate
|
||||
- **Action**: Partial integration (tiered)
|
||||
- **Immediate**: Identity re-injection pattern → §7.5
|
||||
- **Short-term**: Per-state SubagentStart injection → agent-teams.md
|
||||
- **Deferred**: Full state machine workflow guide — re-evaluate 2026-05-16
|
||||
- **Confidence**: High (patterns are technically sound, gap analysis is verified, framing is calibrated)
|
||||
|
|
@ -0,0 +1,111 @@
|
|||
# Resource Evaluation: Nick Tune — Workflow DSL: Domain-Driven Claude Code Workflows
|
||||
|
||||
**URL:** https://nick-tune.me/blog/2026-03-01-workflow-dsl-domain-driven-claude-code-workflows/
|
||||
**Author:** Nick Tune
|
||||
**Date:** March 1, 2026
|
||||
**Evaluated:** 2026-03-16
|
||||
**Score:** 3/5 — Pertinent, selective integration recommended
|
||||
|
||||
---
|
||||
|
||||
## Content Summary
|
||||
|
||||
- Introduces a TypeScript DSL for defining Claude Code workflow states declaratively: each state specifies an emoji identifier, agent instruction file path, allowed state transitions, permitted operations, and transition guard functions.
|
||||
- Three-module architecture: `workflow-engine` (executes rules, domain-agnostic), `workflow-dsl` (language for defining steps), `workflow-definition` (aggregate root with actual workflow logic and invariants).
|
||||
- Type safety via TypeScript `as const` union types — invalid state transitions fail at compile time, not at runtime.
|
||||
- Domain-Driven Design framing: workflow as explicit aggregate root, adapter pattern decouples it from Claude Code infrastructure.
|
||||
- Observability built-in: every operation appends to an internal event log (operation type, timestamp, contextual details). Enables replay and audit of workflow runs.
|
||||
- "State ownership principle": each state validates its own preconditions rather than relying on defensive checks in the engine.
|
||||
- Instruction re-injection on every state transition AND on command failure — distinct mechanism from context-compaction re-injection (compensates for runtime agent context loss, not compression).
|
||||
- Black-box testing via public `Workflow` class methods; no internal state testing.
|
||||
- References GitHub repo `NTCoding/autonomous-claude-agent-team` for full implementation.
|
||||
- Follow-up to: https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/ (already evaluated separately).
|
||||
|
||||
---
|
||||
|
||||
## Scoring
|
||||
|
||||
| Score | Signification |
|
||||
|-------|---------------|
|
||||
| 5 | Essentiel - Gap majeur dans le guide |
|
||||
| 4 | Très pertinent - Amélioration significative |
|
||||
| **3** | **Pertinent - Complément utile** |
|
||||
| 2 | Marginal - Info secondaire |
|
||||
| 1 | Hors scope - Non pertinent |
|
||||
|
||||
**Score: 3/5**
|
||||
|
||||
**Justification:** Conceptually interesting pattern (DSL + DDD for agent orchestration), but zero community adoption verified, three-module architecture adds real cognitive overhead for most users, and the guide's architecture section already correctly warns against complex state machines as a default. The value is in three extractable patterns — not in adopting the full architecture.
|
||||
|
||||
---
|
||||
|
||||
## Comparative Analysis
|
||||
|
||||
| Aspect | This Resource | Our Guide |
|
||||
|--------|--------------|-----------|
|
||||
| TypeScript DSL for workflow states | ✅ Full implementation | ❌ Not covered |
|
||||
| Compile-time state transition validation | ✅ Via union types | ❌ Not covered |
|
||||
| Workflow as DDD aggregate root | ✅ Full pattern | ❌ Not covered |
|
||||
| Event log observability for agent runs | ✅ Built-in design | ❌ Not covered |
|
||||
| Instruction re-injection on failure | ✅ Explicit pattern | ❌ Not covered (only on compaction) |
|
||||
| Hook-driven agent orchestration | Complementary | ✅ Covered (hooks section) |
|
||||
| Complex state machine warnings | Not mentioned | ✅ Covered (architecture.md:1227) |
|
||||
| Event-driven agent patterns | Partial overlap | ✅ Covered (event-driven-agents.md) |
|
||||
|
||||
---
|
||||
|
||||
## Integration Recommendations
|
||||
|
||||
**Selective extraction — no new top-level file warranted until community adoption exists.**
|
||||
|
||||
Three surgical integrations, in priority order:
|
||||
|
||||
### 1. Event log observability pattern (Priority: High)
|
||||
**Where:** Hooks documentation or `guide/workflows/event-driven-agents.md`
|
||||
**What:** The pattern of appending every agent operation to an internal event log (with type + timestamp + context) is a debugging and auditability technique the guide covers nowhere. Extract this as a standalone pattern, independent of the DDD framing.
|
||||
|
||||
### 2. Instruction re-injection on failure (Priority: Medium)
|
||||
**Where:** `guide/workflows/event-driven-agents.md` or hooks best practices
|
||||
**What:** Re-injecting agent instructions on command failure (not just context compaction) addresses runtime context drift. Distinct from the existing compaction re-injection pattern — worth a paragraph with the distinction explicit.
|
||||
|
||||
### 3. Compile-time state validation callout (Priority: Low)
|
||||
**Where:** `guide/core/architecture.md` near line 1227 (state machines section)
|
||||
**What:** The architecture guide warns against complex state machines. A callout acknowledging that TypeScript union types can make simple state transitions compile-safe is a useful nuance — without endorsing the full three-module pattern.
|
||||
|
||||
---
|
||||
|
||||
## Challenge (technical-writer agent)
|
||||
|
||||
**Core finding from the challenge:** The evaluation was initially conflated with the previous Nick Tune article (2026-02-28, hook-driven workflows). These are two distinct posts, and this DSL article is architecturally different — the hook article is about wiring, this one is about state definition language.
|
||||
|
||||
**Score assessment:** 3/5 holds. The argument for 4/5 is that compile-time state validation via `as const` union types is genuinely novel. The argument against: zero community validation, real cognitive complexity in the three-module design, and DDD framing is academic overhead for most Claude Code users.
|
||||
|
||||
**Key correction from challenge:** The event log observability is underweighted in the initial read. It is the most actionable extraction, worth leading any integration pitch. The DDD framing is secondary.
|
||||
|
||||
**Risk of not integrating:** Low overall. The only real cost is missing the event log observability pattern. The full architecture can be safely skipped until adoption grows.
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Verified | Source |
|
||||
|-------|----------|--------|
|
||||
| Author: Nick Tune | ✅ | Article byline |
|
||||
| Published: March 1, 2026 | ✅ | Article metadata |
|
||||
| Three-module architecture (engine/dsl/definition) | ✅ | Article content |
|
||||
| TypeScript `as const` for union types | ✅ | Article + code examples |
|
||||
| GitHub repo: NTCoding/autonomous-claude-agent-team | ✅ | Article links |
|
||||
| Reference to Yves Reynhout (Bluesky, event sourcing) | ✅ | Article attribution |
|
||||
| No numerical benchmarks or percentages cited | ✅ | Article contains none |
|
||||
| Follow-up to 2026-02-28 hook-driven article | ✅ | Article references it explicitly |
|
||||
|
||||
No hallucinated statistics. No unverifiable claims. Article is descriptive (architectural patterns) with no performance benchmarks.
|
||||
|
||||
---
|
||||
|
||||
## Final Decision
|
||||
|
||||
- **Score:** 3/5
|
||||
- **Action:** Integrate selectively (3 surgical extractions — event log, failure re-injection, state validation callout)
|
||||
- **Confidence:** High
|
||||
- **Note:** Do not create a new top-level `workflow-dsl.md` file. Distribute the three patterns into existing sections where they add context without requiring readers to adopt the full DDD architecture.
|
||||
|
|
@ -0,0 +1,110 @@
|
|||
# Resource Evaluation: Paul Rayner — "Will AI Kill Refactoring?" (LinkedIn)
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Evaluator**: Claude (automated via /eval-resource)
|
||||
**Source type**: LinkedIn post (text provided)
|
||||
**Author**: Paul Rayner, CEO & Principal Consultant @ Virtual Genius; author of *The EventStorming Handbook*; founder/chair of Explore DDD
|
||||
**Published**: ~March 2, 2026 (2 weeks before eval date)
|
||||
**Repository**: https://github.com/virtualgenius/contextflow
|
||||
**Score**: 3/5
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Paul Rayner built ContextFlow (a DDD context mapping tool) entirely with Claude Code and analyzed 519 commits to answer whether AI makes refactoring obsolete. Key findings:
|
||||
|
||||
- Full commit breakdown: 30% feat, 22% fix, 23% docs, 14% tidy (refactoring), 5.4% config, 2.3% test, 2.3% other
|
||||
- Code-only commits: 44% feat, 32% fix, 21% refactoring, 3% test — meaning 1 in 5 code commits is pure structural work
|
||||
- Main argument: AI doesn't eliminate refactoring, it lowers its cost enough to do it more often, in smaller batches, before problems compound
|
||||
- New mechanism: large incoherent files degrade context window quality — refactoring keeps AI productive
|
||||
- The design skill AI can't replace: knowing *when* structure no longer fits the problem and what better structure looks like
|
||||
- Includes a usable git prompt for analyzing any conventional commits repo by commit type distribution
|
||||
|
||||
---
|
||||
|
||||
## Comparatif
|
||||
|
||||
| Aspect | This resource | Guide coverage |
|
||||
|--------|--------------|----------------|
|
||||
| Refactoring patterns | Frequency rationale (new angle) | Section at ~line 16990 (incremental, boundary patterns) |
|
||||
| Context window degradation via code structure | ✅ Original insight | ❌ Not explicitly linked to refactoring |
|
||||
| Real-world Claude Code case study | ✅ Practitioner + data | 4 others (Mergify, Airbnb, Boris Cherny, Fountain) |
|
||||
| Commit analysis prompt | ✅ Reusable tool | ❌ Not present |
|
||||
| Conventional commits conventions | Referenced | ✅ Covered at lines 8600, 15564 |
|
||||
| DDD methodology | Context for the project | Mentioned as semantic anchor at lines 3875, 3908, 16849 |
|
||||
|
||||
---
|
||||
|
||||
## Score: 3/5
|
||||
|
||||
**Justification**: Two distinct artifacts of real value — a context window insight worth adding to the refactoring section, and a git analysis prompt worth adding to git best practices. The case study narrative itself is weaker: n=1, self-reported, no external corroboration, LinkedIn-published. The guide already holds case studies to a higher evidence standard (Mergify has a sourced blog post; Airbnb data is corroborated by academic research). Presenting the commit percentages (44/32/21) without a baseline for non-AI projects also limits what conclusions can be drawn — you can't distinguish "AI accelerates refactoring discipline" from "Rayner is personally disciplined about refactoring."
|
||||
|
||||
---
|
||||
|
||||
## Integration Recommendations
|
||||
|
||||
**Split the two artifacts. Treat them independently.**
|
||||
|
||||
### 1. Context window degradation insight → Refactoring section (~line 17025)
|
||||
|
||||
Add one paragraph as an additional rationale within the incremental/boundary patterns explanation. The link between code cohesion and context quality is a distinct mechanism not currently in the guide. Attribute as a practitioner observation, note it's a single project.
|
||||
|
||||
```
|
||||
Example framing:
|
||||
"Refactoring also protects your context window. Large, incoherent files that accumulate
|
||||
without structural cleanup force Claude to process more irrelevant content per request.
|
||||
Keeping modules small and well-scoped is not just a quality practice — it's a practical
|
||||
token efficiency strategy."
|
||||
```
|
||||
|
||||
### 2. Git commit analysis prompt → Git best practices (~line 15564)
|
||||
|
||||
Add alongside existing commit conventions as a companion diagnostic tool. This is immediately actionable for any team using conventional commits and has standalone value regardless of the case study narrative.
|
||||
|
||||
```
|
||||
Example placement: after the commit format section, as a "Analyze your commit distribution" sidebar.
|
||||
```
|
||||
|
||||
### 3. Case study bullet → Skip
|
||||
|
||||
The data quality doesn't support adding it alongside Mergify and Airbnb. If Rayner publishes a proper blog post with methodology, revisit.
|
||||
|
||||
**Priority**: Low-Medium. The git prompt is the quickest win (15 minutes to add). The context window paragraph requires more care to integrate without duplicating existing content.
|
||||
|
||||
---
|
||||
|
||||
## Challenge (technical-writer agent)
|
||||
|
||||
The agent pushed back on score (3/5 confirmed, not 4/5) for two reasons:
|
||||
|
||||
- **Data provenance**: n=1, self-reported on LinkedIn, no external validation. Bumping to 4 would imply evidence quality it hasn't earned.
|
||||
- **Integration plan was misaligned**: original plan proposed adding to case studies section. Agent correctly redirected both artifacts to their natural homes (refactoring section + git best practices), not a case study bullet.
|
||||
|
||||
Additional issues flagged:
|
||||
- No baseline comparison (are 21% refactoring commits high or low vs. non-AI projects?) — weakens the thesis
|
||||
- Git prompt underweighted in original plan — it's the highest-value artifact, needs explicit placement
|
||||
- Risk of not integrating: **Low to medium** — context window link is worth capturing, git prompt adds direct reader value, but nothing is irreplaceable given existing guide depth
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Status | Notes |
|
||||
|-------|--------|-------|
|
||||
| Paul Rayner is CEO @ Virtual Genius, EventStorming Handbook author | ✅ Verified | Consistent with LinkedIn bio in the post |
|
||||
| ContextFlow built entirely with Claude Code | ⚠️ Unverifiable | Author's stated claim, no commit metadata to confirm |
|
||||
| "519 commits" in the repo | ⚠️ Minor discrepancy | GitHub shows 552 commits at eval time (post written ~2 weeks earlier) — timing explains the gap |
|
||||
| Commit breakdown percentages (30/22/23/14/5.4/2.3/2.3) | ✅ Internally consistent | Screenshot shows Claude's analysis output; numbers sum to ~99.3% (rounding). Verifiable by running the git prompt on the repo |
|
||||
| Code-only breakdown (44/32/21/3) | ✅ Internally consistent | Matches the full-breakdown numbers when non-code commits excluded |
|
||||
| ContextFlow is a DDD context mapping tool | ✅ Verified | GitHub confirms: TypeScript/React, 140 stars, MIT, maps bounded contexts/value streams/Wardley |
|
||||
|
||||
**No hallucinations detected. Minor discrepancy on commit count explained by post timing.**
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
- **Score**: 3/5
|
||||
- **Action**: Integrate partially — git prompt (high priority) + context window paragraph (medium priority). Skip case study bullet.
|
||||
- **Confidence**: High on scope/placement; medium on data (n=1 limitation acknowledged)
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
# Resource Evaluation: claude-security-audit (VicKayro)
|
||||
|
||||
**Date**: 2026-03-16
|
||||
**Evaluator**: Claude Sonnet 4.6
|
||||
**Resource URL**: https://github.com/VicKayro/claude-security-audit
|
||||
**Resource Type**: Open-source Claude Code command (GitHub)
|
||||
**Author**: VicKayro
|
||||
**Published**: 2026-02-26
|
||||
**Stars**: 60 | **Forks**: 6 | **License**: MIT (README-declared, no SPDX file)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
A single-file `/security-audit` slash command for Claude Code that runs a 16-section OWASP-mapped web app audit with scoring /10. The repo is 18 days old. The guide already has full OWASP coverage via `security-audit.md`, `security-check.md`, `security-auditor.md` agent, and the 41KB `security-hardening.md`. Two patterns in this resource are genuinely better than what we have: an environment context step (dev/staging/prod) before auditing, and an anti-false-positive factual check before reporting secrets (runs real git history before raising a finding). One area is a genuine gap: paywall/billing logic audit. Everything else overlaps.
|
||||
|
||||
---
|
||||
|
||||
## Content Summary
|
||||
|
||||
- **16 audit sections** with OWASP Top 10 (2021) + CWE IDs: HTTP headers, auth, CSRF, open redirect, injection (SQL/XSS/command), IDOR/access control, secrets and crypto, paywall/billing, vulnerable deps (npm audit + pip-audit), CORS, files/config, WebSocket, SSRF, logging/monitoring, data integrity, software integrity
|
||||
- **Context-aware pre-step**: asks dev/staging/prod before starting — avoids false positives on debug flags, CORS `*`, and HTTP-only configs that are normal in local dev
|
||||
- **Factual verification requirement**: before reporting any secret, runs `git log --all -p -- '*.env'` and checks `.gitignore` — no finding without concrete proof
|
||||
- **Scoring /10** with a structured severity table (CRITIQUE → HAUTE → MOYENNE → BASSE), file:line, and recommended fix with code
|
||||
- **Disclaimer**: report includes an explicit reminder that it needs human review before any action
|
||||
- **258 lines**, French-language prompts, MIT
|
||||
|
||||
---
|
||||
|
||||
## Gap Analysis vs. Guide
|
||||
|
||||
| Section | VicKayro's Command | Our Coverage |
|
||||
|---------|-------------------|--------------|
|
||||
| OWASP Top 10 structure | ✅ Full mapping + CWE | ✅ `security-auditor.md` agent, `security-hardening.md` |
|
||||
| HTTP headers | ✅ 7 headers with CWE | ⚠️ Mentioned in hardening guide, not in audit command |
|
||||
| Auth / JWT | ✅ 10 checks | ✅ `security-audit.md` Phase 2+3 |
|
||||
| Secrets / git history | ✅ Factual check pattern | ✅ `security-audit.md` Phase 2 (pattern similar but less strict) |
|
||||
| Dep scan (npm/pip) | ✅ | ✅ `security-audit.md` Phase 4 |
|
||||
| CSRF / open redirect | ✅ | ⚠️ Not explicit in our commands |
|
||||
| IDOR / access control | ✅ | ⚠️ Not explicit in our commands |
|
||||
| CORS | ✅ | ⚠️ Not explicit in our commands |
|
||||
| **Paywall / billing** | ✅ Full section | ❌ **Not covered anywhere in the guide** |
|
||||
| WebSocket | ✅ | ❌ Not covered |
|
||||
| SSRF | ✅ | ⚠️ Mentioned in hardening guide, not in audit command |
|
||||
| Dev/staging/prod context step | ✅ Explicit pre-step | ❌ Not in our command |
|
||||
| Anti-FP factual verification | ✅ Explicit requirement | ⚠️ Partial — our command checks `.gitignore` but doesn't mandate git log proof |
|
||||
| Claude Code-specific (hooks, prompt injection, MCP) | ❌ Not covered | ✅ Our unique angle |
|
||||
| Score /100 posture model | ❌ Score /10 only | ✅ `security-audit.md` Phase 6 |
|
||||
|
||||
**Real gaps**: paywall/billing audit, the strict anti-false-positive pattern, the environment context pre-step.
|
||||
|
||||
---
|
||||
|
||||
## Score
|
||||
|
||||
**Score: 2/5** (Marginal)
|
||||
|
||||
The overlap with existing guide content is substantial. Most of the 16 sections are already covered through the combination of `security-audit.md`, `security-auditor.md`, and `security-hardening.md`. The command is 18 days old with 60 stars — insufficient track record for security tooling, where false confidence is worse than no tool. The LinkedIn post framing ("vibe coding security") is accurate marketing but doesn't change the technical substance.
|
||||
|
||||
The two extractable patterns (environment context step + factual git-history check before secrets findings) and the paywall/billing gap are worth acting on, but they can be addressed by enhancing our existing command without citing this repo.
|
||||
|
||||
---
|
||||
|
||||
## Challenge (technical-writer agent)
|
||||
|
||||
> "The score is wrong. It should be 2/5, not 3/5. The evaluator confused 'interesting' with 'valuable.'"
|
||||
>
|
||||
> "The repo is 18 days old with 60 stars. That is not a signal of validation, it is a signal of recency bias. Security tooling needs a longer track record."
|
||||
>
|
||||
> "The integration recommendation is backwards. 'Mention in security-hardening.md as community resource' means we are directing readers to a 60-star, 18-day-old single-file command. If the patterns are worth having, extract them. If not, don't mention the repo."
|
||||
>
|
||||
> "The paywall/billing section is the one genuinely novel angle. None of our existing commands audit billing logic or access control around paywalled features. The evaluation mentions it in the feature list and then says nothing about it."
|
||||
|
||||
Challenge accepted. Score adjusted to 2/5. Integration plan revised accordingly.
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Verified | Source |
|
||||
|-------|----------|--------|
|
||||
| MIT license | ⚠️ Partial | README says MIT, no SPDX file or LICENSE in repo |
|
||||
| 16 audit sections | ✅ | Read from command file directly (gh api) |
|
||||
| OWASP Top 10 (2021) mapping | ✅ | Command file contains OWASP reference table with A01-A10 |
|
||||
| npm audit + pip-audit | ✅ | Section 9 of the command file |
|
||||
| Score /10 | ✅ | Command file output format |
|
||||
| 60 stars, created 2026-02-26 | ✅ | GitHub API |
|
||||
| VicKayro author profile | ⚠️ | GitHub API returns 404 for user profile — minimal public presence |
|
||||
|
||||
No invented stats. The LinkedIn post claims "16 sections" and "OWASP Top 10 (2021) mappé" — both accurate.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
**Score: 2/5 — Ne pas intégrer comme ressource externe**
|
||||
|
||||
**Action: Extract two patterns silently into existing commands**
|
||||
|
||||
1. Add an environment context pre-step to `examples/commands/security-audit.md` (ask dev/staging/prod before Phase 1)
|
||||
2. Strengthen the anti-false-positive requirement in Phase 2 (mandate `git log --all -p` before reporting secrets)
|
||||
3. Add a Paywall/Billing audit section to `examples/agents/security-auditor.md`
|
||||
4. Revisit the repo in 3 months if it reaches 200+ stars with active issues/PRs
|
||||
|
||||
**No guide mention.** The resource does not meet the threshold for a community reference link given its age, minimal author profile, and the absence of a formal LICENSE file. The guide already has security coverage that is broader in several dimensions (Claude Code-specific threats, hook security, prompt injection). Extracting the useful patterns internally is cleaner than sending readers to an immature repo.
|
||||
|
||||
**Confidence**: High — full command file reviewed, all claims verified against source.
|
||||
|
|
@ -0,0 +1,86 @@
|
|||
# Resource Evaluation: "What Claude's 1M Token Context Window Means for Your Work"
|
||||
|
||||
**Source**: https://reading.sh/what-claudes-1m-token-context-window-means-for-your-work-3c9f900f04c6
|
||||
**Author**: JP Caparas (Medium)
|
||||
**Published**: ~March 15, 2026 (13 min read, 62 claps)
|
||||
**Type**: Plain-language explainer / pricing analysis
|
||||
**Evaluated**: 2026-03-15
|
||||
**Score**: 2/5 (Marginal — do not integrate)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Plain-language explainer covering: what tokens are and what 1M tokens can hold (codebases, legal docs, academic papers), claim that 1M context went GA on March 13 for Opus 4.6, pricing comparison against OpenAI GPT-5.4 / Google Gemini 3.1 Pro / Meta Llama 4, Claude Code workflow impact (compaction reduction, full-codebase sessions), enterprise use cases (compliance review, codebase migration, incident response), "lost in the middle" effect as ongoing limitation, MRCR v2 score (78.3% Opus 4.6), and quotes from Jon Bell (CPO Codeium) and Anton Biryukov (Ramp).
|
||||
|
||||
---
|
||||
|
||||
## Fact-Check
|
||||
|
||||
| Claim | Status | Notes |
|
||||
|-------|--------|-------|
|
||||
| 1M context GA March 13, 2026 | **Unconfirmed** | Perplexity shows "beta" still active; March 13 maps to a usage promotion, not a GA announcement |
|
||||
| Flat pricing $5/$25 MTok, no surcharge at any length | **FALSE** | 2x input / 1.5x output surcharge confirmed above 200K tokens (awesomeagents.ai, puter.com); this is the article's central thesis |
|
||||
| Opus 4.6 base price $5/$25 MTok | Confirmed | Consistent across multiple pricing sources |
|
||||
| Sonnet 4.6 $3/$15 MTok | Confirmed | Consistent across multiple pricing sources |
|
||||
| Cached reads $0.50/MTok (90% discount) | Plausible | Standard Anthropic caching discount |
|
||||
| Cache writes $6.25/MTok | Plausible | Standard Anthropic caching pricing |
|
||||
| OpenAI GPT-5.4 2x rate limit above 272K | **Unverified** | No primary source found |
|
||||
| Google Gemini 3.1 Pro surcharge above 200K | Partially confirmed | Google does surcharge; exact numbers differ from article |
|
||||
| MRCR v2: 78.3% Opus 4.6 | **Discrepancy** | Guide cites 76% from Anthropic blog; article may reference a different variant or a revised benchmark run |
|
||||
| Jon Bell (Codeium) — 15% compaction decrease | **Unverifiable** | No independent corroboration found |
|
||||
| Anton Biryukov (Ramp) quote | **Unverifiable** | No independent corroboration found |
|
||||
| 600 images/PDF per request (up from 100) | **Unverified** | Not confirmed in official API docs |
|
||||
| Haiku caps at 200K | Confirmed | Consistent with known specs |
|
||||
|
||||
**Critical finding**: The article's differentiating claim is "Anthropic's bet is flat pricing — no surcharge regardless of length." This is factually wrong. Anthropic charges 2x input and 1.5x output above 200K tokens. The entire competitive pricing analysis built on this premise is unreliable.
|
||||
|
||||
---
|
||||
|
||||
## Comparative Analysis
|
||||
|
||||
| Aspect | This resource | Our guide |
|
||||
|--------|--------------|-----------|
|
||||
| Token basics (what is a token) | Covered, beginner-friendly | Not covered — assumes dev audience |
|
||||
| 1M context capabilities | Generic scenarios | Covered with MRCR benchmarks + cost tables |
|
||||
| Pricing vs competitors | Covered — but factually wrong on flat pricing | Partially covered (Gemini comparison, line 2053) |
|
||||
| Compaction events | Mentions 15% reduction (unverifiable quote) | Covered in depth (architecture.md lines 391-438) |
|
||||
| "Lost in the middle" effect | Mentioned with arxiv:2307.03172 reference | Not explicitly covered |
|
||||
| Enterprise use cases | Hypothetical, no measured data | Covered across multiple sections |
|
||||
| MRCR v2 benchmark | 78.3% (discrepancy with our 76%) | 76% from Anthropic blog |
|
||||
| Claude Code workflow impact | Good framing (search phase = 100K+ tokens) | Covered via compaction + context engineering |
|
||||
|
||||
---
|
||||
|
||||
## Challenge Notes
|
||||
|
||||
The technical-writer review agreed score 2/5 is justified. An initial proposal of 3/5 was revised downward because the fact-check demolishes the article's core value proposition — wrong pricing data cannot complement a guide that aims for accuracy. The Povilas Korop anecdote (83% context utilization on a Laravel project) is a nice real-world datapoint, but anecdotal and insufficient to shift the score.
|
||||
|
||||
Key insight from review: "Fix stale info in the guide" is the real action item, independent of this resource. If 1M context is actually GA, our guide still says "beta" at lines 2028-2070 — that needs a fix regardless of this article.
|
||||
|
||||
---
|
||||
|
||||
## Decision
|
||||
|
||||
**Do not integrate.**
|
||||
|
||||
### Specific exclusions
|
||||
|
||||
- Pricing comparison table (central claim on flat pricing is wrong)
|
||||
- Enterprise use cases (hypothetical, no measured data)
|
||||
- Unverifiable quotes (Jon Bell, Anton Biryukov)
|
||||
|
||||
### Independent action items triggered by this review
|
||||
|
||||
1. **Verify 1M GA status** via official Anthropic docs / API changelog
|
||||
2. **If confirmed GA**: Update `guide/ultimate-guide.md` lines 2028-2070 (currently says "beta")
|
||||
3. **If confirmed GA**: Verify whether the 200K surcharge structure changed at all
|
||||
4. **Consider adding**: One line on "lost in the middle" effect in `guide/core/context-engineering.md` (well-documented limitation, arxiv:2307.03172 — independent of this article)
|
||||
|
||||
### Worth independent verification
|
||||
|
||||
The Ramp workflow pattern (search Datadog + Braintrust + DB + source = 100K tokens before writing a single fix) is a useful illustration of why large context windows matter for real engineering workflows — worth adding if an independent source confirms it.
|
||||
|
||||
---
|
||||
|
||||
**Confidence**: High. The fact-check identified a critical error in the central thesis; no ambiguity on the decision.
|
||||
|
|
@ -35,6 +35,11 @@ Perform security audits with isolated context, focusing on vulnerability detecti
|
|||
- [ ] Threat modeling considered
|
||||
- [ ] Security requirements defined
|
||||
- [ ] Principle of least privilege
|
||||
- [ ] Paywall/billing limits enforced server-side (not client-side)
|
||||
- [ ] Subscription status read from DB, not from a client-supplied token or claim
|
||||
- [ ] Payment webhook signatures verified (Stripe `stripe.webhooks.constructEvent`, Paddle equivalent)
|
||||
- [ ] No endpoint bypasses billing verification (e.g., admin routes that skip plan checks)
|
||||
- [ ] No race condition on session/resource creation that could allow free usage beyond limits (CWE-362)
|
||||
|
||||
### A05: Security Misconfiguration
|
||||
- [ ] Default credentials changed
|
||||
|
|
|
|||
|
|
@ -17,6 +17,22 @@ You are a senior application security engineer. Perform a 6-phase security audit
|
|||
|
||||
---
|
||||
|
||||
### Pre-Step: Establish Audit Context
|
||||
|
||||
**Before running any checks**, use `AskUserQuestion` to ask:
|
||||
|
||||
1. **Environment**: Is this code running in production, staging, or local development?
|
||||
2. **Scope**: Full audit or specific areas to prioritize?
|
||||
|
||||
This is critical for accurate findings:
|
||||
- **Local dev**: `DEBUG=True`, CORS `*`, HTTP without TLS, `.env` files — all normal. Do NOT flag as vulnerabilities. Mention in an "Before going to production" informational section instead.
|
||||
- **Staging**: Configs should mirror production. Flag deviations as MEDIUM.
|
||||
- **Production**: Any misconfiguration is a real finding with full severity.
|
||||
|
||||
If the user doesn't answer or is unsure, default to **production** (conservative).
|
||||
|
||||
---
|
||||
|
||||
### Phase 1: Configuration Security (via /security-check)
|
||||
|
||||
Execute all checks from `/security-check` (the `examples/commands/security-check.md` command). This covers:
|
||||
|
|
@ -59,6 +75,23 @@ find . -name ".env*" -not -path "*/node_modules/*" -not -path "*/.git/*" -type f
|
|||
}
|
||||
```
|
||||
|
||||
**Anti-false-positive rule — MANDATORY before reporting any secret finding:**
|
||||
|
||||
Before raising a secrets finding, run these verification commands:
|
||||
|
||||
```bash
|
||||
# 1. Verify .env is actually in .gitignore (if yes, local .env is NOT a finding)
|
||||
grep -n '\.env' .gitignore 2>/dev/null || echo ".env NOT in .gitignore"
|
||||
|
||||
# 2. Verify secrets were actually committed (empty output = no finding)
|
||||
git log --all -p -- '*.env' '*.key' '*.pem' '*.secret' 2>/dev/null | grep -E '^\+.*(password|secret|api_key|token)' | head -20
|
||||
|
||||
# 3. Check git history for provider-specific patterns
|
||||
git log --all -p 2>/dev/null | grep -E '^\+(sk-[a-zA-Z0-9]{20,}|AKIA[A-Z0-9]{16}|ghp_[a-zA-Z0-9]{36})' | head -10
|
||||
```
|
||||
|
||||
Only report a secret finding if you have **concrete proof from these commands**. A `.env` file present locally is not a finding if it's in `.gitignore`. Never report "secrets may be exposed" based on pattern matching alone.
|
||||
|
||||
**Scoring:**
|
||||
- 0 secrets found → +20 points
|
||||
- 1-3 secrets → +10 points
|
||||
|
|
|
|||
78
examples/hooks/bash/identity-reinjection.sh
Executable file
78
examples/hooks/bash/identity-reinjection.sh
Executable file
|
|
@ -0,0 +1,78 @@
|
|||
#!/bin/bash
|
||||
# .claude/hooks/identity-reinjection.sh
|
||||
# Event: UserPromptSubmit
|
||||
# Non-blocking guard: re-injects agent identity if context compaction erased it
|
||||
#
|
||||
# Problem: When Claude compacts context during a long session, agents configured
|
||||
# with a specific role (team lead, reviewer, etc.) can "forget" their identity.
|
||||
# The compacted transcript no longer contains the original system instructions.
|
||||
#
|
||||
# Solution: Store identity in a file. After each user message, check whether the
|
||||
# last assistant response includes the expected identity marker. If not, inject
|
||||
# the identity as additionalContext so the next response re-establishes the role.
|
||||
#
|
||||
# Setup:
|
||||
# 1. Create .claude/agent-identity.txt with your agent's identity instructions
|
||||
# 2. Set IDENTITY_MARKER to a short string that should appear in agent responses
|
||||
# (e.g. "LEAD:", "REVIEWER:", "DEVELOPER:")
|
||||
# 3. Wire this hook to UserPromptSubmit in settings.json
|
||||
#
|
||||
# Output format: UserPromptSubmit additionalContext injection
|
||||
# Properties:
|
||||
# - Silent no-op when identity is present (zero overhead)
|
||||
# - Silent no-op when no identity file configured
|
||||
# - Never blocks — exits 0 in all cases
|
||||
# - Compatible with agent teams (SubagentStart, SubagentStop) and solo sessions
|
||||
#
|
||||
# Based on pattern from Nick Tune: https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/
|
||||
|
||||
set -uo pipefail
|
||||
|
||||
INPUT=$(cat)
|
||||
|
||||
# Identity file: customize path or override via env
|
||||
IDENTITY_FILE="${CLAUDE_IDENTITY_FILE:-.claude/agent-identity.txt}"
|
||||
IDENTITY_MARKER="${CLAUDE_IDENTITY_MARKER:-}"
|
||||
|
||||
# No-op: identity file not configured
|
||||
if [[ ! -f "$IDENTITY_FILE" ]]; then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
IDENTITY=$(cat "$IDENTITY_FILE")
|
||||
|
||||
# No-op: empty identity
|
||||
if [[ -z "$IDENTITY" ]]; then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Default marker: first non-empty, non-comment line of the identity file
|
||||
if [[ -z "$IDENTITY_MARKER" ]]; then
|
||||
IDENTITY_MARKER=$(grep -m1 -v '^#' "$IDENTITY_FILE" | head -c 40 || true)
|
||||
fi
|
||||
|
||||
# Read transcript to check last assistant message
|
||||
TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty' 2>/dev/null || true)
|
||||
|
||||
if [[ -z "$TRANSCRIPT_PATH" || ! -f "$TRANSCRIPT_PATH" ]]; then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Extract the last assistant message from the transcript
|
||||
LAST_ASSISTANT=$(jq -r '
|
||||
[.[] | select(.role == "assistant")] | last | .content |
|
||||
if type == "array" then map(select(.type == "text") | .text) | join("") else . end
|
||||
' "$TRANSCRIPT_PATH" 2>/dev/null || true)
|
||||
|
||||
# Identity is intact: no action needed
|
||||
if echo "$LAST_ASSISTANT" | grep -qF "$IDENTITY_MARKER" 2>/dev/null; then
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Identity marker missing from last response — re-inject
|
||||
# This happens after context compaction strips the original system instructions
|
||||
jq -n \
|
||||
--arg context "[Identity reminder — your role and instructions follow. Resume your role immediately.]\n\n$IDENTITY" \
|
||||
'{"additionalContext": $context}'
|
||||
|
||||
exit 0
|
||||
|
|
@ -27,6 +27,7 @@ Utility scripts for Claude Code power users.
|
|||
| `sync-claude-config.sh` | Sync Claude config files across machines |
|
||||
| `sonnetplan.sh` | Run Claude with Sonnet replacing Opus (cost optimization alias) |
|
||||
| `test-prompt-caching.ts` | Verify Anthropic prompt caching is active (no deps, fetch only) |
|
||||
| `smart-suggest-roi.py` | Analyze acceptance rate of smart-suggest hook suggestions vs session activity |
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
621
examples/scripts/smart-suggest-roi.py
Executable file
621
examples/scripts/smart-suggest-roi.py
Executable file
|
|
@ -0,0 +1,621 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
smart-suggest-roi.py — Analyze acceptance rate of smart-suggest hook suggestions.
|
||||
|
||||
Usage:
|
||||
./smart-suggest-roi.py # Full report
|
||||
./smart-suggest-roi.py --json # Machine-readable JSON
|
||||
./smart-suggest-roi.py --since 7d # Last N days
|
||||
./smart-suggest-roi.py --no-sessions # Suggestion stats only (fast)
|
||||
./smart-suggest-roi.py --log PATH # Custom log path
|
||||
|
||||
Methodology: "Followed" = the suggested command/agent was used later in the
|
||||
same session. Proxy metric — user may have used it independently of the
|
||||
suggestion, or in a different session.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import bisect
|
||||
import json
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from datetime import datetime, timezone, timedelta
|
||||
from pathlib import Path
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Tier classification (extensible mapping)
|
||||
# ---------------------------------------------------------------------------
|
||||
TIER_MAP = {
|
||||
# Tier 0 — Enforcement (high-stakes, process gates)
|
||||
"pnpm changelog:add": 0,
|
||||
"/pr": 0,
|
||||
"/plan": 0,
|
||||
"/tech:plan": 0,
|
||||
"/tech:pr": 0,
|
||||
"/tech:commit": 0,
|
||||
# Tier 1 — Discovery (specialized workflows rarely triggered organically)
|
||||
"/test-loop": 1,
|
||||
"/retex": 1,
|
||||
"/tech:retex": 1,
|
||||
"/dupes": 1,
|
||||
"/tech:dupes": 1,
|
||||
"/loop": 1,
|
||||
"security-auditor": 1,
|
||||
"/release": 1,
|
||||
"/tech:ralph-loop": 1,
|
||||
"/tech:scaffold": 1,
|
||||
"/tech:sonarqube": 1,
|
||||
"complexity-estimator": 1,
|
||||
"/tech:diagram": 1,
|
||||
"/tech:handoff": 1,
|
||||
"/tech:daily": 1,
|
||||
"/tech:bilan-hebdo": 1,
|
||||
"/tech:worktree": 1,
|
||||
"/tech:sentry-triage": 1,
|
||||
"skill-creator": 1,
|
||||
"/tech:create-release": 1,
|
||||
"/tech:tests": 1,
|
||||
"/tech:diagnose": 1,
|
||||
# Tier 2 — Contextual (common helpers, lower novelty)
|
||||
"code-reviewer": 2,
|
||||
"debugger": 2,
|
||||
"architect-review": 2,
|
||||
"/resume": 2,
|
||||
"/tech:resume": 2,
|
||||
"ui-designer": 2,
|
||||
"requirements-analyst": 2,
|
||||
"backend-architect": 2,
|
||||
"/tech:ship": 2,
|
||||
"/critique-plan": 2,
|
||||
}
|
||||
|
||||
TIER_LABELS = {0: "Tier 0 (Enforcement)", 1: "Tier 1 (Discovery)", 2: "Tier 2 (Contextual)", -1: "Custom"}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_ts(ts_str: str) -> float:
|
||||
"""Parse ISO 8601 timestamp to Unix epoch float."""
|
||||
if not ts_str:
|
||||
return 0.0
|
||||
ts_str = ts_str.rstrip("Z")
|
||||
for fmt in ("%Y-%m-%dT%H:%M:%S.%f", "%Y-%m-%dT%H:%M:%S", "%Y-%m-%d %H:%M:%S"):
|
||||
try:
|
||||
dt = datetime.strptime(ts_str, fmt).replace(tzinfo=timezone.utc)
|
||||
return dt.timestamp()
|
||||
except ValueError:
|
||||
continue
|
||||
return 0.0
|
||||
|
||||
|
||||
def first_token(cmd: str) -> str:
|
||||
"""Return first whitespace-delimited token (for commands like '/loop [interval]')."""
|
||||
return cmd.split()[0] if cmd else cmd
|
||||
|
||||
|
||||
def get_tier(cmd: str) -> int:
|
||||
"""Classify a command into a tier. Returns -1 for unknown (Custom)."""
|
||||
return TIER_MAP.get(cmd, TIER_MAP.get(first_token(cmd), -1))
|
||||
|
||||
|
||||
def parse_since(since_str: str) -> float:
|
||||
"""Parse '7d', '24h', '30m' into a Unix timestamp cutoff."""
|
||||
unit = since_str[-1]
|
||||
value = int(since_str[:-1])
|
||||
now = datetime.now(tz=timezone.utc).timestamp()
|
||||
if unit == "d":
|
||||
return now - value * 86400
|
||||
if unit == "h":
|
||||
return now - value * 3600
|
||||
if unit == "m":
|
||||
return now - value * 60
|
||||
raise ValueError(f"Unsupported time unit: {unit}. Use d/h/m (e.g. 7d, 24h).")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 1 — Parse suggestions log
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def parse_suggestions(log_path: Path, since_ts: float = 0.0):
|
||||
"""
|
||||
Returns list of suggestion dicts and skip count.
|
||||
Each dict: {ts, suggested, prompt_len, cmd (first token)}
|
||||
"""
|
||||
suggestions = []
|
||||
skip_count = 0
|
||||
|
||||
if not log_path.exists():
|
||||
return suggestions, skip_count
|
||||
|
||||
with log_path.open("r", encoding="utf-8", errors="replace") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
ts = parse_ts(entry.get("ts", ""))
|
||||
if ts == 0.0:
|
||||
skip_count += 1
|
||||
continue
|
||||
if ts < since_ts:
|
||||
continue
|
||||
suggested = entry.get("suggested", "")
|
||||
if not suggested:
|
||||
skip_count += 1
|
||||
continue
|
||||
suggestions.append({
|
||||
"ts": ts,
|
||||
"suggested": suggested,
|
||||
"cmd": first_token(suggested),
|
||||
"prompt_len": entry.get("prompt_len", 0),
|
||||
})
|
||||
except (json.JSONDecodeError, KeyError, TypeError):
|
||||
skip_count += 1
|
||||
|
||||
suggestions.sort(key=lambda x: x["ts"])
|
||||
return suggestions, skip_count
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 2 — Build session index & detect acceptance
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _read_first_last_ts(path: Path):
|
||||
"""Read first and last timestamp from a session JSONL file efficiently."""
|
||||
first_ts = None
|
||||
last_ts = None
|
||||
session_id = None
|
||||
|
||||
try:
|
||||
with path.open("r", encoding="utf-8", errors="replace") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
ts = parse_ts(entry.get("timestamp", ""))
|
||||
if ts == 0.0:
|
||||
continue
|
||||
if first_ts is None:
|
||||
first_ts = ts
|
||||
session_id = entry.get("sessionId", "")
|
||||
last_ts = ts
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
continue
|
||||
except (PermissionError, OSError):
|
||||
pass
|
||||
|
||||
return first_ts, last_ts, session_id
|
||||
|
||||
|
||||
def build_session_index(projects_dir: Path):
|
||||
"""
|
||||
Walk all project JSONL session files and build a sorted index for lookup.
|
||||
|
||||
Returns:
|
||||
- sessions: list of {start_ts, end_ts, session_id, path} sorted by start_ts
|
||||
- start_ts_list: just start timestamps for bisect
|
||||
"""
|
||||
sessions = []
|
||||
|
||||
if not projects_dir.exists():
|
||||
return sessions, []
|
||||
|
||||
for jsonl_file in projects_dir.glob("*/*.jsonl"):
|
||||
# Skip activity logs and smart-suggest logs (not session files)
|
||||
if "activity-" in jsonl_file.name or "smart-suggest" in jsonl_file.name:
|
||||
continue
|
||||
first_ts, last_ts, session_id = _read_first_last_ts(jsonl_file)
|
||||
if first_ts is None:
|
||||
continue
|
||||
sessions.append({
|
||||
"start_ts": first_ts,
|
||||
"end_ts": last_ts or first_ts,
|
||||
"session_id": session_id,
|
||||
"path": jsonl_file,
|
||||
})
|
||||
|
||||
sessions.sort(key=lambda x: x["start_ts"])
|
||||
start_ts_list = [s["start_ts"] for s in sessions]
|
||||
return sessions, start_ts_list
|
||||
|
||||
|
||||
def find_sessions_for_ts(ts: float, sessions: list, start_ts_list: list, window_before: float = 120.0):
|
||||
"""
|
||||
Find sessions that were active at timestamp ts.
|
||||
A session is "active" if ts is between start and end (+ small buffer).
|
||||
"""
|
||||
if not sessions:
|
||||
return []
|
||||
|
||||
# Binary search: find sessions that started before ts + window_before
|
||||
hi = bisect.bisect_right(start_ts_list, ts + window_before)
|
||||
candidates = sessions[:hi]
|
||||
|
||||
active = []
|
||||
for s in candidates:
|
||||
if s["start_ts"] <= ts + window_before and s["end_ts"] >= ts - 30:
|
||||
active.append(s)
|
||||
return active
|
||||
|
||||
|
||||
def _check_acceptance_in_session(path: Path, cmd_token: str, suggestion_ts: float, time_window: float = 600.0):
|
||||
"""
|
||||
Scan a session JSONL file for evidence the suggested command was followed.
|
||||
|
||||
Acceptance signals (in priority order):
|
||||
1. <command-name>cmd</command-name> in user message content
|
||||
2. Skill tool use with skill = cmd
|
||||
3. Agent tool use with subagent_type = cmd
|
||||
4. cmd appears in next 5 user messages within time_window seconds
|
||||
"""
|
||||
entries_after = []
|
||||
|
||||
try:
|
||||
with path.open("r", encoding="utf-8", errors="replace") as f:
|
||||
for line in f:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
ts = parse_ts(entry.get("timestamp", ""))
|
||||
if ts >= suggestion_ts:
|
||||
entries_after.append((ts, entry))
|
||||
except (json.JSONDecodeError, TypeError):
|
||||
continue
|
||||
except (PermissionError, OSError):
|
||||
return None # Cannot read file
|
||||
|
||||
if not entries_after:
|
||||
return None # No entries after suggestion — cannot determine
|
||||
|
||||
user_message_count = 0
|
||||
|
||||
for ts, entry in entries_after:
|
||||
msg_type = entry.get("type", "")
|
||||
msg = entry.get("message", {})
|
||||
if not isinstance(msg, dict):
|
||||
continue
|
||||
|
||||
role = msg.get("role", "")
|
||||
content = msg.get("content", "")
|
||||
|
||||
# Signal 1: slash command invocation in user message
|
||||
if msg_type == "user" or role == "user":
|
||||
user_message_count += 1
|
||||
content_str = content if isinstance(content, str) else json.dumps(content)
|
||||
# Check for <command-name> tag
|
||||
if f"<command-name>{cmd_token}</command-name>" in content_str:
|
||||
return True
|
||||
# Check for skill invocation pattern
|
||||
if f'"skill": "{cmd_token}"' in content_str or f"'skill': '{cmd_token}'" in content_str:
|
||||
return True
|
||||
# Text mention in first 5 user messages within window
|
||||
if user_message_count <= 5 and ts - suggestion_ts <= time_window:
|
||||
if cmd_token in content_str:
|
||||
return True
|
||||
|
||||
# Signal 2 & 3: tool use in assistant messages
|
||||
if msg_type == "assistant" or role == "assistant":
|
||||
content_list = content if isinstance(content, list) else []
|
||||
for block in content_list:
|
||||
if not isinstance(block, dict):
|
||||
continue
|
||||
if block.get("type") != "tool_use":
|
||||
continue
|
||||
tool_name = block.get("name", "")
|
||||
tool_input = block.get("input", {}) or {}
|
||||
# Signal 2: Skill tool
|
||||
if tool_name == "Skill" and tool_input.get("skill") == cmd_token:
|
||||
return True
|
||||
# Signal 3: Agent tool
|
||||
if tool_name == "Agent" and tool_input.get("subagent_type") == cmd_token:
|
||||
return True
|
||||
|
||||
return False # No signals found
|
||||
|
||||
|
||||
def compute_acceptance(suggestions: list, sessions: list, start_ts_list: list):
|
||||
"""
|
||||
For each suggestion, find matching sessions and check acceptance.
|
||||
Mutates each suggestion dict in-place, adding 'followed' key.
|
||||
"""
|
||||
for s in suggestions:
|
||||
active = find_sessions_for_ts(s["ts"], sessions, start_ts_list)
|
||||
if not active:
|
||||
s["followed"] = None # No session context
|
||||
continue
|
||||
|
||||
# Check all active sessions — accepted if ANY matches
|
||||
result = False
|
||||
any_data = False
|
||||
for sess in active:
|
||||
check = _check_acceptance_in_session(sess["path"], s["cmd"], s["ts"])
|
||||
if check is True:
|
||||
result = True
|
||||
any_data = True
|
||||
break
|
||||
if check is False:
|
||||
any_data = True
|
||||
# check is None: no data in this file
|
||||
|
||||
if not any_data:
|
||||
s["followed"] = None
|
||||
else:
|
||||
s["followed"] = result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Phase 3 — Compute stats
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def compute_stats(suggestions: list):
|
||||
"""Build stats dict from annotated suggestions."""
|
||||
stats = {
|
||||
"total": len(suggestions),
|
||||
"sessions_matched": sum(1 for s in suggestions if s.get("followed") is not None),
|
||||
"followed": sum(1 for s in suggestions if s.get("followed") is True),
|
||||
"by_cmd": defaultdict(lambda: {"total": 0, "followed": 0, "unmatched": 0}),
|
||||
"by_tier": defaultdict(lambda: {"total": 0, "followed": 0}),
|
||||
"by_day": defaultdict(lambda: {"total": 0, "followed": 0}),
|
||||
}
|
||||
|
||||
for s in suggestions:
|
||||
cmd = s["cmd"]
|
||||
tier = get_tier(s["suggested"])
|
||||
day = datetime.fromtimestamp(s["ts"], tz=timezone.utc).strftime("%b %d")
|
||||
|
||||
stats["by_cmd"][cmd]["total"] += 1
|
||||
stats["by_tier"][tier]["total"] += 1
|
||||
stats["by_day"][day]["total"] += 1
|
||||
|
||||
if s.get("followed") is True:
|
||||
stats["by_cmd"][cmd]["followed"] += 1
|
||||
stats["by_tier"][tier]["followed"] += 1
|
||||
stats["by_day"][day]["followed"] += 1
|
||||
elif s.get("followed") is None:
|
||||
stats["by_cmd"][cmd]["unmatched"] += 1
|
||||
|
||||
# Compute unique commands
|
||||
stats["unique_cmds"] = len(stats["by_cmd"])
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Output helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def pct(num: int, den: int) -> str:
|
||||
if den == 0:
|
||||
return "n/a"
|
||||
return f"{round(100 * num / den)}%"
|
||||
|
||||
|
||||
def bar(count: int, max_count: int, width: int = 16) -> str:
|
||||
if max_count == 0:
|
||||
return ""
|
||||
filled = round(width * count / max_count)
|
||||
return "█" * filled + " " * (width - filled)
|
||||
|
||||
|
||||
def print_report(stats: dict, suggestions: list, skip_count: int,
|
||||
log_path: Path, projects_dir: Path, no_sessions: bool, since_str: str | None):
|
||||
sep = "═" * 51
|
||||
print(sep)
|
||||
since_label = f" ({since_str})" if since_str else f" ({_date_range(suggestions)})"
|
||||
print(f" Smart-Suggest ROI Report{since_label}")
|
||||
print(sep)
|
||||
|
||||
print()
|
||||
print("Summary")
|
||||
print(f" Suggestions emitted: {stats['total']}")
|
||||
print(f" Unique commands: {stats['unique_cmds']}")
|
||||
|
||||
if not no_sessions:
|
||||
matched = stats["sessions_matched"]
|
||||
total = stats["total"]
|
||||
followed = stats["followed"]
|
||||
print(f" Sessions matched: {matched} / {total} ({pct(matched, total)})")
|
||||
print(f" Followed: {followed} / {matched} ({pct(followed, matched)})")
|
||||
|
||||
# By tier
|
||||
if not no_sessions:
|
||||
print()
|
||||
print(f"{'By Tier':<38} {'followed / total'}")
|
||||
for tier_id in sorted(stats["by_tier"].keys()):
|
||||
t = stats["by_tier"][tier_id]
|
||||
label = TIER_LABELS.get(tier_id, "Custom")
|
||||
rate = pct(t["followed"], t["total"])
|
||||
print(f" {label + ':':34} {rate:<8} {t['followed']:>4} / {t['total']}")
|
||||
|
||||
# Top 10 most suggested
|
||||
by_cmd = stats["by_cmd"]
|
||||
sorted_by_total = sorted(by_cmd.items(), key=lambda x: x[1]["total"], reverse=True)
|
||||
print()
|
||||
print("Top 10 Most Suggested")
|
||||
for cmd, data in sorted_by_total[:10]:
|
||||
rate = f"{pct(data['followed'], data['total'])} followed" if not no_sessions else ""
|
||||
print(f" {data['total']:>4} {cmd:<34} {rate}")
|
||||
|
||||
# Top 10 most followed (only if session data available)
|
||||
if not no_sessions and stats["followed"] > 0:
|
||||
sorted_by_followed = sorted(
|
||||
[(cmd, d) for cmd, d in by_cmd.items() if d["followed"] > 0],
|
||||
key=lambda x: x[1]["followed"],
|
||||
reverse=True,
|
||||
)
|
||||
print()
|
||||
print("Top 10 Most Followed")
|
||||
for cmd, data in sorted_by_followed[:10]:
|
||||
rate = pct(data["followed"], data["total"])
|
||||
print(f" {data['followed']:>4} {cmd:<34} {rate} of {data['total']}")
|
||||
|
||||
# Never followed
|
||||
never = [(cmd, d) for cmd, d in by_cmd.items()
|
||||
if d["followed"] == 0 and d["total"] - d["unmatched"] > 0]
|
||||
if never:
|
||||
print()
|
||||
print("Never Followed (always ignored)")
|
||||
for cmd, data in sorted(never, key=lambda x: x[1]["total"], reverse=True)[:10]:
|
||||
print(f" {cmd:<36} ({data['total']} suggestions)")
|
||||
|
||||
# Daily trend
|
||||
by_day = stats["by_day"]
|
||||
if by_day:
|
||||
print()
|
||||
print("Daily Trend")
|
||||
max_day_total = max(d["total"] for d in by_day.values())
|
||||
for day in sorted(by_day.keys()):
|
||||
d = by_day[day]
|
||||
b = bar(d["total"], max_day_total)
|
||||
followed_str = f" ({d['followed']} followed)" if not no_sessions else ""
|
||||
print(f" {day} {b} {d['total']}{followed_str}")
|
||||
|
||||
print()
|
||||
if not no_sessions:
|
||||
print("Note: \"Followed\" means the suggested command/agent was used later in the")
|
||||
print("same session. Proxy metric — the user may have used it independently of")
|
||||
print("the suggestion, or followed it in a different session.")
|
||||
print()
|
||||
|
||||
if skip_count > 0:
|
||||
print(f" [{skip_count} malformed lines skipped]")
|
||||
|
||||
print(sep)
|
||||
print(f" Log: {log_path}")
|
||||
if not no_sessions:
|
||||
from pathlib import Path as _P
|
||||
project_count = sum(1 for _ in projects_dir.glob("*/"))
|
||||
print(f" Sessions: {projects_dir} ({project_count} projects)")
|
||||
print(sep)
|
||||
|
||||
|
||||
def _date_range(suggestions: list) -> str:
|
||||
if not suggestions:
|
||||
return "no data"
|
||||
first = datetime.fromtimestamp(suggestions[0]["ts"], tz=timezone.utc)
|
||||
last = datetime.fromtimestamp(suggestions[-1]["ts"], tz=timezone.utc)
|
||||
delta = last - first
|
||||
days = max(1, delta.days + 1)
|
||||
return f"{days} days"
|
||||
|
||||
|
||||
def print_json(stats: dict, suggestions: list, skip_count: int):
|
||||
output = {
|
||||
"summary": {
|
||||
"total": stats["total"],
|
||||
"unique_cmds": stats["unique_cmds"],
|
||||
"sessions_matched": stats["sessions_matched"],
|
||||
"followed": stats["followed"],
|
||||
"follow_rate": round(stats["followed"] / stats["sessions_matched"], 3)
|
||||
if stats["sessions_matched"] > 0 else None,
|
||||
},
|
||||
"by_cmd": {
|
||||
cmd: {
|
||||
"total": d["total"],
|
||||
"followed": d["followed"],
|
||||
"unmatched": d["unmatched"],
|
||||
"follow_rate": round(d["followed"] / (d["total"] - d["unmatched"]), 3)
|
||||
if (d["total"] - d["unmatched"]) > 0 else None,
|
||||
}
|
||||
for cmd, d in stats["by_cmd"].items()
|
||||
},
|
||||
"by_tier": {
|
||||
TIER_LABELS.get(t, "Custom"): {
|
||||
"total": d["total"],
|
||||
"followed": d["followed"],
|
||||
"follow_rate": round(d["followed"] / d["total"], 3) if d["total"] > 0 else None,
|
||||
}
|
||||
for t, d in stats["by_tier"].items()
|
||||
},
|
||||
"by_day": dict(stats["by_day"]),
|
||||
"skip_count": skip_count,
|
||||
}
|
||||
print(json.dumps(output, indent=2))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze smart-suggest hook ROI from suggestion and session logs."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--log",
|
||||
type=Path,
|
||||
default=Path.home() / ".claude" / "logs" / "smart-suggest.jsonl",
|
||||
help="Path to smart-suggest.jsonl log (default: ~/.claude/logs/smart-suggest.jsonl)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--projects-dir",
|
||||
type=Path,
|
||||
default=Path.home() / ".claude" / "projects",
|
||||
help="Path to Claude projects directory (default: ~/.claude/projects)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--since",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Filter to last N days/hours/minutes (e.g. 7d, 24h, 30m)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--no-sessions",
|
||||
action="store_true",
|
||||
help="Skip session scanning — show suggestion stats only (fast mode)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--json",
|
||||
action="store_true",
|
||||
help="Output machine-readable JSON",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Resolve since cutoff
|
||||
since_ts = 0.0
|
||||
if args.since:
|
||||
try:
|
||||
since_ts = parse_since(args.since)
|
||||
except ValueError as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
# Phase 1: parse suggestions
|
||||
suggestions, skip_count = parse_suggestions(args.log, since_ts)
|
||||
|
||||
if not suggestions:
|
||||
print(f"No suggestions found in {args.log}", file=sys.stderr)
|
||||
if since_ts > 0:
|
||||
print(f"(filtered to last {args.since})", file=sys.stderr)
|
||||
sys.exit(0)
|
||||
|
||||
# Phase 2: session index + acceptance (unless --no-sessions)
|
||||
if not args.no_sessions:
|
||||
sessions, start_ts_list = build_session_index(args.projects_dir)
|
||||
compute_acceptance(suggestions, sessions, start_ts_list)
|
||||
else:
|
||||
# Mark all as unmatched so stats are computed correctly
|
||||
for s in suggestions:
|
||||
s["followed"] = None
|
||||
|
||||
# Phase 3: stats
|
||||
stats = compute_stats(suggestions)
|
||||
|
||||
# Output
|
||||
if args.json:
|
||||
print_json(stats, suggestions, skip_count)
|
||||
else:
|
||||
print_report(
|
||||
stats, suggestions, skip_count,
|
||||
args.log, args.projects_dir, args.no_sessions, args.since
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -467,6 +467,23 @@ Claude Code's effectiveness degrades predictably under certain conditions:
|
|||
3. **Scope tightly**: Break large tasks into focused sub-tasks
|
||||
4. **Use sub-agents**: Delegate exploration to `Task` tool to preserve main context
|
||||
|
||||
### Failure-Triggered Context Drift
|
||||
|
||||
A separate degradation mode that does not depend on context size: repeated tool failures. When a tool call fails and Claude retries, error output accumulates in the context window. Stack traces, retry noise, and error messages dilute the original intent — subsequent attempts follow the error narrative rather than the task goal. The context window is not full, but the signal-to-noise ratio has degraded.
|
||||
|
||||
This is distinct from compaction drift. Compaction addresses context *size*; failure re-injection addresses context *quality* within a bounded window.
|
||||
|
||||
**Pattern**: re-inject the core task instruction on every command failure, not just after `/compact`. A `PostToolUse` hook can prefix retried prompts with a condensed version of the original task and constraints:
|
||||
|
||||
```bash
|
||||
# PostToolUse hook: re-inject intent after failures
|
||||
if [[ "$CLAUDE_TOOL_EXIT_CODE" != "0" ]]; then
|
||||
echo "REMINDER: The current task is: $ORIGINAL_TASK_SUMMARY. Ignore the above error if non-blocking and continue toward that goal."
|
||||
fi
|
||||
```
|
||||
|
||||
Source: [Nick Tune — Workflow DSL: Domain-Driven Claude Code Workflows](https://nick-tune.me/blog/2026-03-01-workflow-dsl-domain-driven-claude-code-workflows/) (2026-03-01)
|
||||
|
||||
---
|
||||
|
||||
## 4. Sub-Agent Architecture
|
||||
|
|
|
|||
|
|
@ -26,6 +26,7 @@ This guide covers everything from the token math behind context budgets to build
|
|||
6. [Context Lifecycle](#6-context-lifecycle)
|
||||
7. [Quality Measurement](#7-quality-measurement)
|
||||
8. [Context Reduction Techniques](#8-context-reduction-techniques)
|
||||
9. [Maturity Assessment](#9-maturity-assessment)
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -67,6 +68,8 @@ LLMs are context-window computers. The quality of output is bounded by the quali
|
|||
|
||||
Teams that invest in context engineering consistently report fewer revision cycles, better adherence to conventions, and more predictable outputs. The investment is front-loaded (building the system), but the returns compound across every interaction.
|
||||
|
||||
A useful diagnostic reframe: **most AI output failures are context failures, not model failures.** When Claude generates a generic response, ignores a convention, or produces code that doesn't match your stack, the model is almost never broken — the context it received was incomplete, contradictory, or missing the right information at the right time. This reframe shifts troubleshooting from "the AI is bad at this" to "what is missing from the context?"
|
||||
|
||||
### The Three Layers
|
||||
|
||||
Context engineering in Claude Code operates across three distinct layers:
|
||||
|
|
@ -81,6 +84,21 @@ Each layer has different tradeoffs. Global config is always-on but cannot refere
|
|||
|
||||
Good context engineering means putting each piece of information in the right layer — not cramming everything into one file, and not leaving critical knowledge in the session layer where it evaporates after every conversation.
|
||||
|
||||
### Static vs. Dynamic Context
|
||||
|
||||
The three-layer system above is *static context* — configuration files that are assembled before a session begins and remain stable throughout. Claude Code is primarily a static context system, which is why CLAUDE.md structure and path-scoping matter so much.
|
||||
|
||||
As you move toward agent workflows, a second category appears: *dynamic context*, assembled at inference time as the agent operates.
|
||||
|
||||
| Type | How assembled | Examples in Claude Code |
|
||||
|------|--------------|-------------------------|
|
||||
| **Static** | Before session, from files | CLAUDE.md, path-scoped modules, skills |
|
||||
| **Dynamic** | At runtime, from tools | Tool outputs, file reads, web fetches, MCP data |
|
||||
|
||||
In practice, every Claude Code session uses both. The static context (your configuration) sets the behavioral envelope; the dynamic context (files Claude reads, tool results it processes) provides the specific information for each task. Context engineering covers both, but the failure modes differ: static context problems manifest as consistent convention violations; dynamic context problems manifest as Claude acting on stale or incomplete information mid-task.
|
||||
|
||||
For teams building automated pipelines and agents, Anthropic's September 2025 engineering post ["Effective context engineering for AI agents"](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) covers the dynamic side in depth.
|
||||
|
||||
---
|
||||
|
||||
## 2. The Context Budget
|
||||
|
|
@ -1174,6 +1192,50 @@ The highest-leverage sequence for a project with context debt:
|
|||
|
||||
---
|
||||
|
||||
## 9. Maturity Assessment
|
||||
|
||||
Context engineering capability develops in stages. Most teams reach Level 2 and stop — not because higher levels are complex, but because the failures at Level 2 are invisible. Output quality is acceptable, so the pressure to go further never appears. This assessment makes the gap visible.
|
||||
|
||||
### The Six Levels
|
||||
|
||||
| Level | Name | What exists | Failure mode |
|
||||
|-------|------|-------------|--------------|
|
||||
| **0** | No configuration | LLM with no CLAUDE.md | Generic outputs, zero project awareness |
|
||||
| **1** | Flat config | Single CLAUDE.md, no structure | Rules pile up, adherence degrades after ~100 lines |
|
||||
| **2** | Structured config | Sections, clear organization, global/project separation | Works solo, breaks at team scale |
|
||||
| **3** | Modular config | Path-scoped modules, deliberate layering | Rules maintained but no verification |
|
||||
| **4** | Measured config | Canary tests, adherence tracking, lifecycle management | System works but drifts silently over time |
|
||||
| **5** | Engineered system | Profiles, CI drift detection, ACE pipeline, quarterly audit rhythm | — |
|
||||
|
||||
### Self-Assessment
|
||||
|
||||
Answer each question. Stop at the first "No" — that is your current level.
|
||||
|
||||
**Level 0 → 1**: Do you have a CLAUDE.md file in your project?
|
||||
|
||||
**Level 1 → 2**: Does your configuration distinguish between global conventions (in `~/.claude/CLAUDE.md`) and project-specific rules (in `./CLAUDE.md`)? Are sections clearly separated?
|
||||
|
||||
**Level 2 → 3**: Are subsystem-specific rules in path-scoped modules rather than the root CLAUDE.md? Does your root CLAUDE.md stay under 150 lines?
|
||||
|
||||
**Level 3 → 4**: Do you have canary checks that verify key conventions? Do you track violation rates for your most important rules? Do you run a context audit after major milestones?
|
||||
|
||||
**Level 4 → 5**: Do team members assemble their CLAUDE.md from profiles rather than editing it directly? Is there CI drift detection that alerts when configuration diverges from source modules? Do you run session retrospectives to feed new patterns back into configuration?
|
||||
|
||||
### What to Do at Each Level
|
||||
|
||||
| Your level | Next action |
|
||||
|------------|-------------|
|
||||
| 0 | Create a minimal CLAUDE.md with 5-10 rules. See §3 for what belongs there. |
|
||||
| 1 | Split global and project config. Move cross-project preferences to `~/.claude/CLAUDE.md`. |
|
||||
| 2 | Identify the 2-3 highest-traffic subsystems. Create path-scoped modules for them. |
|
||||
| 3 | Write 3-5 canary prompts for your most violated rules. Automate them. |
|
||||
| 4 | Introduce profiles for team members. Add CI drift detection. Start session retrospectives. |
|
||||
| 5 | Maintain quarterly audits. The system is built — the work is ongoing calibration. |
|
||||
|
||||
Most teams move from Level 0 to Level 2 in a single afternoon. Moving from Level 3 to Level 4 requires a measurement habit, not more configuration. The bottleneck at the higher levels is not knowledge — it is the discipline to treat configuration as a living system rather than a one-time setup.
|
||||
|
||||
---
|
||||
|
||||
## Cross-References
|
||||
|
||||
- Architecture and project structure patterns: `guide/core/architecture.md`
|
||||
|
|
|
|||
|
|
@ -2018,7 +2018,7 @@ The default model depends on your subscription: **Max/Team Premium** subscribers
|
|||
| **Sonnet 4.6** | $3.00 | $15.00 | 200K tokens | Default model (Feb 2026) |
|
||||
| Sonnet 4.5 | $3.00 | $15.00 | 200K tokens | Legacy (same price) |
|
||||
| Opus 4.6 (standard) | $5.00 | $25.00 | 200K tokens | Released Feb 2026 |
|
||||
| Opus 4.6 (1M context beta) | $10.00 | $37.50 | 1M tokens | Requests >200K context |
|
||||
| Opus 4.6 (1M context) | $5.00 | $25.00 | 1M tokens | GA for Max/Team/Enterprise; API requires tier 4 |
|
||||
| Opus 4.6 (fast mode) | $30.00 | $150.00 | 200K tokens | 2.5x faster, 6x price |
|
||||
| Haiku 4.5 | $0.80 | $4.00 | 200K tokens | Budget option |
|
||||
|
||||
|
|
@ -2028,7 +2028,7 @@ The default model depends on your subscription: **Max/Team Premium** subscribers
|
|||
|
||||
#### 200K vs 1M Context: Performance, Cost & Use Cases
|
||||
|
||||
The 1M context window (beta, API + usage tier 4 required) is a significant capability jump — but community feedback consistently frames it as a **niche premium tool**, not a default.
|
||||
The 1M context window (GA for Max/Team/Enterprise plans; API tier 4 still required for direct API use) is a significant capability jump — but community feedback consistently frames it as a **niche premium tool**, not a default.
|
||||
|
||||
**Retrieval accuracy at scale (MRCR v2 8-needle 1M variant)**
|
||||
|
||||
|
|
@ -2042,13 +2042,13 @@ The benchmark is the "8-needle 1M variant" — finding 8 specific facts in a 1M-
|
|||
|
||||
**Cost per session (approximate)**
|
||||
|
||||
Above 200K input tokens, **all tokens** in the request are charged at premium rates — not just the excess. Applies to both Sonnet 4.6 and Opus 4.6.
|
||||
Above 200K input tokens on direct API, **all tokens** in the request are charged at premium rates — not just the excess. Note: on Max/Team/Enterprise Claude Code plans, Opus 4.6 1M is the default at standard rates (no premium) as of v2.1.75 (March 2026).
|
||||
|
||||
| Session type | ~Tokens in | ~Tokens out | Sonnet 4.6 | Opus 4.6 |
|
||||
|---|---|---|---|---|
|
||||
| Bug fix / PR review (≤200K) | 50K | 5K | ~$0.23 | ~$0.38 |
|
||||
| Module refactoring (≤200K) | 150K | 20K | ~$0.75 | ~$1.25 |
|
||||
| Full service analysis (>200K, 1M beta) | 500K | 50K | ~$4.13 | ~$6.88 |
|
||||
| Full service analysis (>200K, 1M context) | 500K | 50K | ~$4.13 | ~$6.88 |
|
||||
|
||||
For comparison: Gemini 1.5 Pro offers a 2M context window at $3.50/$10.50/MTok — significantly cheaper for pure long-context RAG. Community advice: use Gemini for large-document RAG, Claude for reasoning quality and agentic workflows.
|
||||
|
||||
|
|
@ -2065,8 +2065,8 @@ For comparison: Gemini 1.5 Pro offers a 2M context window at $3.50/$10.50/MTok
|
|||
**Key facts**
|
||||
- Opus 4.6 max output: **128K tokens**; Sonnet 4.6 max output: **64K tokens**
|
||||
- 1M context ≈ 30,000 lines of code / 750,000 words
|
||||
- 1M context is **beta** — requires `anthropic-beta: context-1m-2025-08-07` header, usage tier 4 or custom rate limits
|
||||
- Above 200K input tokens: Sonnet 4.6 doubles to $6/$22.50/MTok; Opus 4.6 doubles to $10/$37.50/MTok
|
||||
- 1M context is **GA for Max/Team/Enterprise Claude Code plans** (v2.1.75, March 2026) — API direct use still requires tier 4 or custom rate limits
|
||||
- API direct use above 200K input tokens: Sonnet 4.6 doubles to $6/$22.50/MTok; Opus 4.6 doubles to $10/$37.50/MTok (standard rate applies for Claude Code Max/Team/Enterprise plans)
|
||||
- If input stays ≤200K, standard pricing applies even with the beta flag enabled
|
||||
- **Practical workaround**: check context at ~70% and open a new session rather than hitting compaction ([HN pattern](https://news.ycombinator.com/item?id=46902427))
|
||||
- Community consensus: 200K + RAG is the default; 1M Opus is reserved for cases where loading everything at once is genuinely necessary
|
||||
|
|
@ -10170,6 +10170,85 @@ echo '{"session_id":"test","cwd":"'$(pwd)'"}' | .claude/hooks/session-summary.sh
|
|||
|
||||
---
|
||||
|
||||
### Identity Re-injection After Compaction
|
||||
|
||||
**The problem**: When Claude compacts context during a long session, agents configured with a specific role — team lead, developer, reviewer — can "forget" their identity. The compacted transcript no longer contains the original system instructions, so the next response drops the role entirely and starts behaving generically.
|
||||
|
||||
This is most visible in agent teams with explicit identity prefixes. A developer agent that was consistently marking messages with `🔨 DEVELOPER:` suddenly stops after compaction and starts responding as a generic assistant.
|
||||
|
||||
**The pattern**: Store the agent's identity in a file (`.claude/agent-identity.txt`). After each user message, a `UserPromptSubmit` hook checks whether the last assistant response includes the expected identity marker. If not — which happens after compaction — it injects the identity file contents as `additionalContext`. The next response re-establishes the role without human intervention.
|
||||
|
||||
```bash
|
||||
# .claude/agent-identity.txt
|
||||
# Your agent's identity instructions — anything that should survive compaction
|
||||
|
||||
You are the feature team lead. You coordinate the team — you do not write code
|
||||
and you do not review code.
|
||||
|
||||
Prefix every message with the current state:
|
||||
SPAWN / PLANNING / DEVELOPING / REVIEWING / COMMITTING / COMPLETE
|
||||
```
|
||||
|
||||
```bash
|
||||
# .claude/hooks/identity-reinjection.sh
|
||||
# UserPromptSubmit hook — re-injects identity after compaction
|
||||
|
||||
IDENTITY_FILE="${CLAUDE_IDENTITY_FILE:-.claude/agent-identity.txt}"
|
||||
IDENTITY_MARKER="${CLAUDE_IDENTITY_MARKER:-}"
|
||||
|
||||
[[ ! -f "$IDENTITY_FILE" ]] && exit 0
|
||||
|
||||
IDENTITY=$(cat "$IDENTITY_FILE")
|
||||
[[ -z "$IDENTITY" ]] && exit 0
|
||||
|
||||
# Default marker: first non-empty line of the identity file
|
||||
[[ -z "$IDENTITY_MARKER" ]] && IDENTITY_MARKER=$(grep -m1 -v '^#' "$IDENTITY_FILE" | head -c 40)
|
||||
|
||||
TRANSCRIPT_PATH=$(echo "$INPUT" | jq -r '.transcript_path // empty')
|
||||
[[ -z "$TRANSCRIPT_PATH" || ! -f "$TRANSCRIPT_PATH" ]] && exit 0
|
||||
|
||||
LAST_ASSISTANT=$(jq -r '
|
||||
[.[] | select(.role == "assistant")] | last | .content |
|
||||
if type == "array" then map(select(.type == "text") | .text) | join("") else . end
|
||||
' "$TRANSCRIPT_PATH" 2>/dev/null)
|
||||
|
||||
# Identity intact: no action
|
||||
echo "$LAST_ASSISTANT" | grep -qF "$IDENTITY_MARKER" && exit 0
|
||||
|
||||
# Identity missing: re-inject
|
||||
jq -n --arg ctx "[Identity reminder]\n\n$IDENTITY" '{"additionalContext": $ctx}'
|
||||
exit 0
|
||||
```
|
||||
|
||||
**Configuration** (`settings.json`):
|
||||
|
||||
```json
|
||||
{
|
||||
"hooks": {
|
||||
"UserPromptSubmit": [{
|
||||
"hooks": [{
|
||||
"type": "command",
|
||||
"command": ".claude/hooks/identity-reinjection.sh"
|
||||
}]
|
||||
}]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**How it behaves**:
|
||||
- Zero overhead when identity marker is present (exits immediately on match)
|
||||
- Silent no-op when no `.claude/agent-identity.txt` file exists
|
||||
- Triggers automatically after compaction — no manual intervention needed
|
||||
- Works in both solo sessions (long-running agents) and agent team configurations
|
||||
|
||||
**Customization**: Set `CLAUDE_IDENTITY_MARKER` in your environment to a short, distinctive string from the agent's standard output (e.g. `"LEAD:"`, `"DEVELOPER:"`, `"🔨"`). If not set, the hook uses the first 40 characters of the identity file as the marker.
|
||||
|
||||
> **Full implementation**: [`examples/hooks/bash/identity-reinjection.sh`](../../examples/hooks/bash/identity-reinjection.sh)
|
||||
>
|
||||
> **Origin**: Pattern sourced from Nick Tune's [hook-driven dev workflows](https://nick-tune.me/blog/2026-02-28-hook-driven-dev-workflows-with-claude-code/) (2026-02-28). The broader article covers state machine workflows with agent teams — see [Agent Teams Workflow](../workflows/agent-teams.md) for context.
|
||||
|
||||
---
|
||||
|
||||
# 8. MCP Servers
|
||||
|
||||
_Quick jump:_ [What is MCP](#81-what-is-mcp) · [Available Servers](#82-available-servers) · [Configuration](#83-configuration) · [Server Selection Guide](#84-server-selection-guide) · [Plugin System](#85-plugin-system) · [MCP Security](#86-mcp-security)
|
||||
|
|
@ -13391,6 +13470,8 @@ For critical work, combine everything:
|
|||
|
||||
> **📖 Complete Workflow Guide**: See [GitHub Actions Workflows](./workflows/github-actions.md) for 5 production-ready patterns using the official `anthropics/claude-code-action` (PR review, triage, security, scheduled maintenance).
|
||||
|
||||
> **Code Review (Teams/Enterprise)**: For automated PR review without manual prompting, see [Code Review](./workflows/code-review.md) — Anthropic's multi-agent review feature that posts inline GitHub comments on every PR.
|
||||
|
||||
### Headless Mode
|
||||
|
||||
Run Claude Code without interactive prompts:
|
||||
|
|
|
|||
158
guide/workflows/code-review.md
Normal file
158
guide/workflows/code-review.md
Normal file
|
|
@ -0,0 +1,158 @@
|
|||
---
|
||||
title: "Code Review (Claude Code feature)"
|
||||
description: "Automated multi-agent PR review for Teams and Enterprise — setup, triggers, REVIEW.md configuration, and cost management"
|
||||
tags: [feature, teams, enterprise, github, code-review]
|
||||
---
|
||||
|
||||
# Code Review
|
||||
|
||||
> **Availability**: Research preview — Teams and Enterprise plans only. Not available on Free/Pro accounts, nor for organizations with Zero Data Retention (ZDR) enabled.
|
||||
> **Launched**: March 9, 2026
|
||||
|
||||
Claude Code's Code Review feature runs a multi-agent review on every GitHub pull request. A fleet of specialized agents examines the diff in the context of the full codebase, each looking for a different class of issue (logic errors, security vulnerabilities, edge cases, regressions), followed by a verification pass that filters false positives.
|
||||
|
||||
Findings are posted as inline PR comments on the specific lines where issues were found, tagged by severity. Reviews don't approve or block PRs, so existing review workflows stay intact.
|
||||
|
||||
---
|
||||
|
||||
## How it works
|
||||
|
||||
1. Trigger fires (PR opened, push, or manual `@claude review` comment)
|
||||
2. Multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure
|
||||
3. Each agent targets a different class of issue
|
||||
4. A verification step checks candidates against actual code behavior to remove false positives
|
||||
5. Results are deduplicated, ranked by severity, and posted as inline PR comments
|
||||
6. If no issues are found, Claude posts a short confirmation comment
|
||||
|
||||
Reviews complete in **20 minutes on average**, scaling with PR size and complexity.
|
||||
|
||||
### Severity levels
|
||||
|
||||
| Marker | Severity | Meaning |
|
||||
|:-------|:---------|:--------|
|
||||
| 🔴 | Normal | A bug that should be fixed before merging |
|
||||
| 🟡 | Nit | Minor issue, worth fixing but not blocking |
|
||||
| 🟣 | Pre-existing | A bug in the codebase not introduced by this PR |
|
||||
|
||||
Each finding includes a collapsible extended reasoning section explaining why Claude flagged the issue and how it verified the problem.
|
||||
|
||||
---
|
||||
|
||||
## Setup
|
||||
|
||||
An admin enables Code Review once for the organization and selects which repositories to include.
|
||||
|
||||
### 1. Open admin settings
|
||||
|
||||
Go to [claude.ai/admin-settings/claude-code](https://claude.ai/admin-settings/claude-code) and find the **Code Review** section. Requires admin access to both your Claude organization and permission to install GitHub Apps in your GitHub organization.
|
||||
|
||||
### 2. Click Setup
|
||||
|
||||
This begins the GitHub App installation flow.
|
||||
|
||||
### 3. Install the Claude GitHub App
|
||||
|
||||
Follow the prompts to install the Claude GitHub App on your GitHub organization. The app requests:
|
||||
|
||||
- **Contents**: read and write
|
||||
- **Issues**: read and write
|
||||
- **Pull requests**: read and write
|
||||
|
||||
Code Review uses read access to contents and write access to pull requests. This permission set also supports [GitHub Actions](/en/github-actions) if you enable that later.
|
||||
|
||||
### 4. Select repositories
|
||||
|
||||
Choose which repositories to enable. If a repo is missing, ensure you granted the GitHub App access during installation. You can add more repositories later from the admin settings table.
|
||||
|
||||
### 5. Set review triggers per repo
|
||||
|
||||
For each repository, choose when reviews run:
|
||||
|
||||
| Trigger | When it runs | Cost profile |
|
||||
|---------|-------------|--------------|
|
||||
| **Once after PR creation** | Once when PR opens or is marked ready | Lowest |
|
||||
| **After every push** | On every push to the PR branch | Highest (multiplied by push count) |
|
||||
| **Manual** | Only when someone comments `@claude review` | Controlled |
|
||||
|
||||
After the `@claude review` comment, subsequent pushes to that PR trigger reviews automatically regardless of the configured trigger.
|
||||
|
||||
**Manual mode** is useful for high-traffic repos where you want to opt specific PRs into review, or only start reviewing when the PR is ready for review.
|
||||
|
||||
---
|
||||
|
||||
## Manual trigger
|
||||
|
||||
Comment `@claude review` on any open, non-draft PR to start a review immediately. Requirements:
|
||||
|
||||
- Top-level PR comment (not an inline diff comment)
|
||||
- `@claude review` at the start of the comment
|
||||
- Owner, member, or collaborator access on the repository
|
||||
|
||||
If a review is already running, the request queues until the in-progress review completes.
|
||||
|
||||
---
|
||||
|
||||
## Configure reviews
|
||||
|
||||
Two files control what Claude flags. Both are additive on top of the default correctness checks.
|
||||
|
||||
### CLAUDE.md
|
||||
|
||||
Claude reads all `CLAUDE.md` files in your directory hierarchy. Newly-introduced violations are flagged as nit-level findings. Bidirectional: if a PR makes a `CLAUDE.md` statement outdated, Claude flags that the docs need updating too.
|
||||
|
||||
Use `CLAUDE.md` for guidance that also applies to interactive Claude Code sessions.
|
||||
|
||||
### REVIEW.md
|
||||
|
||||
Add `REVIEW.md` to your **repository root** for review-only rules. Auto-discovered, no configuration needed.
|
||||
|
||||
```markdown
|
||||
# Code Review Guidelines
|
||||
|
||||
## Always check
|
||||
- New API endpoints have corresponding integration tests
|
||||
- Database migrations are backward-compatible
|
||||
- Error messages don't leak internal details to users
|
||||
|
||||
## Style
|
||||
- Prefer early returns over nested conditionals
|
||||
- Use structured logging, not f-string interpolation in log calls
|
||||
|
||||
## Skip
|
||||
- Generated files under `src/gen/`
|
||||
- Formatting-only changes in `*.lock` files
|
||||
- Migration files in `db/migrations/`
|
||||
```
|
||||
|
||||
Use `REVIEW.md` for rules that would clutter `CLAUDE.md` for general sessions (linter conventions, skip lists, team-specific patterns).
|
||||
|
||||
---
|
||||
|
||||
## Pricing
|
||||
|
||||
Code Review is billed on token usage, **separately from your plan's included usage** (via [extra usage](https://support.claude.com/en/articles/12429409-extra-usage-for-paid-claude-plans)).
|
||||
|
||||
- Average cost: **$15–25 per review**, scaling with PR size, codebase complexity, and the number of issues requiring verification
|
||||
- "After every push" multiplies cost by push count
|
||||
- To set a monthly spend cap: [claude.ai/admin-settings/usage](https://claude.ai/admin-settings/usage) → configure limit for the "Claude Code Review" service
|
||||
- Monitor spend: [claude.ai/analytics/code-review](https://claude.ai/analytics/code-review) (daily PR count, weekly spend, per-repo breakdown)
|
||||
|
||||
---
|
||||
|
||||
## Cross-reference
|
||||
|
||||
For manual code review workflows (CLI, no Teams/Enterprise required):
|
||||
- [Multi-agent code review workflow](../ultimate-guide.md#multi-agent-code-review) — DIY agent teams via CLI
|
||||
- [GitHub Actions integration](./github-actions.md) — custom CI/CD automation (self-hosted alternative to this managed service)
|
||||
- [GitLab CI/CD](/en/gitlab-ci-cd) — self-hosted Claude integration for GitLab pipelines
|
||||
- Code Review plugin — on-demand local reviews before pushing (available in the plugin marketplace)
|
||||
|
||||
---
|
||||
|
||||
## Known limitations (research preview)
|
||||
|
||||
- Teams and Enterprise only — no Free/Pro access
|
||||
- Not available for organizations with Zero Data Retention (ZDR) enabled
|
||||
- GitHub only for the managed service (GitLab supported via CI/CD integration, not this feature)
|
||||
- Full-repo indexing latency on first activation for large repos
|
||||
- Anthropic internal stats: ~7.5 issues found per PR >1000 lines, <1% false positive rate — self-reported, not independently verified
|
||||
|
|
@ -113,6 +113,9 @@ deep_dive:
|
|||
changelog_fragments_ci: "guide/workflows/changelog-fragments.md:169" # Independent CI migration check job
|
||||
# Smart-Suggest Hook
|
||||
smart_suggest_hook: "examples/hooks/bash/smart-suggest.sh" # UserPromptSubmit behavioral coach, 3-tier priority, ROI logging
|
||||
# Identity Re-injection After Compaction (Nick Tune pattern, Feb 2026)
|
||||
identity_reinjection_hook: "guide/ultimate-guide.md:10171" # UserPromptSubmit guard: re-injects agent identity lost after context compaction
|
||||
identity_reinjection_example: "examples/hooks/bash/identity-reinjection.sh" # Checks last assistant message for identity marker, injects additionalContext if missing
|
||||
# Template Installation
|
||||
install_templates_script: "scripts/install-templates.sh"
|
||||
# Session management
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue