claude-code-ultimate-guide/examples/plugins/se-cove.md
Florian BRUNIAUX ee5791668a docs: add SE-CoVe plugin example + resource evaluation workflow (v3.11.6)
- First plugin example: SE-CoVe (Chain-of-Verification, Meta AI ACL 2024)
- Academic approach: cite paper metrics, not marketing claims
- Performance table: +23-112% accuracy (task-dependent, trade-offs disclosed)
- Resource evaluation template established (Perplexity fact-check workflow)
- Curation policy: Academic validation + Claims verified + Costs transparent
- Templates count: 82 → 83
- Architecture diagram added (visual overview of Claude Code internals)

Files:
- examples/plugins/se-cove.md (new plugin documentation)
- claudedocs/resource-evaluations/2026-01-24-se-cove-plugin.md (evaluation report)
- README.md, CHANGELOG.md, VERSION, reference.yaml (version bump 3.11.5 → 3.11.6)
- guide/architecture.md + image (visual overview)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-24 17:40:54 +01:00

3.6 KiB

plugin marketplace version license research
chain-of-verification vertti/se-cove-claude-plugin 1.1.1 MIT arXiv:2309.11495 (ACL 2024 Findings)

SE-CoVe: Chain-of-Verification

Software Engineering adaptation of Meta's Chain-of-Verification methodology for Claude Code.

Research Foundation

Paper: "Chain-of-Verification Reduces Hallucination in Large Language Models" Authors: Dhuliawala et al. (Meta AI) Published: ACL 2024 Findings Sources: arXiv:2309.11495 | ACL Anthology

How It Works

5-stage pipeline ensuring independent verification:

  1. Baseline: Generate initial solution
  2. Planner: Create verification questions from solution claims
  3. Executor: Answer questions independently (never sees baseline)
  4. Synthesizer: Compare findings, identify discrepancies
  5. Output: Produce verified solution

Critical innovation: Verifier operates without access to draft code, preventing confirmation bias.

Performance Metrics

Results from Meta's research paper (Llama 65B model):

Task Type Metric Improvement Computational Cost
Biography generation FACTSCORE +28% (55.9→71.4) -26% output volume (16.6→12.3 facts)
Closed-book QA F1 Score +23% (0.39→0.48) ~2x token consumption
List-based questions Precision +112% (0.17→0.36) Fewer total answers

Source: Dhuliawala et al., ACL 2024 Findings (Table 1, Section 4.3)

Key insight: Higher accuracy comes at cost of increased computation and reduced output volume.

When to Use

  • Critical code review: Architectural decisions, security-sensitive code
  • Complex debugging: Multi-component failure analysis
  • API/library integration: When correctness > speed
  • Acceptable 2x cost: Token budget allows for quality premium
  • Trivial changes: Simple fixes, formatting, typos
  • Exploratory coding: Rapid prototyping, experimentation
  • Tight token budgets: When cost is primary constraint
  • Need comprehensive output: When you need all facts, not just accurate subset

Installation

# Add plugin marketplace
/plugin marketplace add vertti/se-cove-claude-plugin

# Install plugin (in separate command)
/plugin install chain-of-verification

Note: Commands must be pasted separately (Claude Code marketplace limitation).

Usage

# Invoke verification
/chain-of-verification:verify <your question>

# Autocomplete available
/ver<Tab>

Limitations

From the research paper (Section 6):

  1. Not a silver bullet: Reduces hallucinations but does not eliminate them
  2. Computational cost: ~2x token usage vs baseline generation (estimated from implementation)
  3. Output volume trade-off: Generates fewer but more accurate results
  4. Model-specific: Tested on Llama 65B; generalization to GPT-4/Claude/Sonnet unverified
  5. Task dependency: Performance varies significantly by task type (23-112%)
  6. Factual hallucinations only: Does not address incorrect reasoning steps or opinions

Source Code