docs: add SE-CoVe plugin example + resource evaluation workflow (v3.11.6)
- First plugin example: SE-CoVe (Chain-of-Verification, Meta AI ACL 2024) - Academic approach: cite paper metrics, not marketing claims - Performance table: +23-112% accuracy (task-dependent, trade-offs disclosed) - Resource evaluation template established (Perplexity fact-check workflow) - Curation policy: Academic validation + Claims verified + Costs transparent - Templates count: 82 → 83 - Architecture diagram added (visual overview of Claude Code internals) Files: - examples/plugins/se-cove.md (new plugin documentation) - claudedocs/resource-evaluations/2026-01-24-se-cove-plugin.md (evaluation report) - README.md, CHANGELOG.md, VERSION, reference.yaml (version bump 3.11.5 → 3.11.6) - guide/architecture.md + image (visual overview) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
f7e1254b06
commit
ee5791668a
9 changed files with 342 additions and 20 deletions
106
examples/plugins/se-cove.md
Normal file
106
examples/plugins/se-cove.md
Normal file
|
|
@ -0,0 +1,106 @@
|
|||
---
|
||||
plugin: chain-of-verification
|
||||
marketplace: vertti/se-cove-claude-plugin
|
||||
version: 1.1.1
|
||||
license: MIT
|
||||
research: arXiv:2309.11495 (ACL 2024 Findings)
|
||||
---
|
||||
|
||||
# SE-CoVe: Chain-of-Verification
|
||||
|
||||
Software Engineering adaptation of Meta's Chain-of-Verification methodology for Claude Code.
|
||||
|
||||
## Research Foundation
|
||||
|
||||
**Paper**: "Chain-of-Verification Reduces Hallucination in Large Language Models"
|
||||
**Authors**: Dhuliawala et al. (Meta AI)
|
||||
**Published**: ACL 2024 Findings
|
||||
**Sources**: [arXiv:2309.11495](https://arxiv.org/abs/2309.11495) | [ACL Anthology](https://aclanthology.org/2024.findings-acl.212/)
|
||||
|
||||
## How It Works
|
||||
|
||||
5-stage pipeline ensuring independent verification:
|
||||
|
||||
1. **Baseline**: Generate initial solution
|
||||
2. **Planner**: Create verification questions from solution claims
|
||||
3. **Executor**: Answer questions independently (never sees baseline)
|
||||
4. **Synthesizer**: Compare findings, identify discrepancies
|
||||
5. **Output**: Produce verified solution
|
||||
|
||||
**Critical innovation**: Verifier operates without access to draft code, preventing confirmation bias.
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
Results from Meta's research paper (Llama 65B model):
|
||||
|
||||
| Task Type | Metric | Improvement | Computational Cost |
|
||||
|-----------|--------|-------------|-------------------|
|
||||
| Biography generation | FACTSCORE | +28% (55.9→71.4) | -26% output volume (16.6→12.3 facts) |
|
||||
| Closed-book QA | F1 Score | +23% (0.39→0.48) | ~2x token consumption |
|
||||
| List-based questions | Precision | +112% (0.17→0.36) | Fewer total answers |
|
||||
|
||||
**Source**: Dhuliawala et al., ACL 2024 Findings (Table 1, Section 4.3)
|
||||
|
||||
**Key insight**: Higher accuracy comes at cost of increased computation and reduced output volume.
|
||||
|
||||
## When to Use
|
||||
|
||||
### ✅ Recommended
|
||||
|
||||
- **Critical code review**: Architectural decisions, security-sensitive code
|
||||
- **Complex debugging**: Multi-component failure analysis
|
||||
- **API/library integration**: When correctness > speed
|
||||
- **Acceptable 2x cost**: Token budget allows for quality premium
|
||||
|
||||
### ❌ Not Recommended
|
||||
|
||||
- **Trivial changes**: Simple fixes, formatting, typos
|
||||
- **Exploratory coding**: Rapid prototyping, experimentation
|
||||
- **Tight token budgets**: When cost is primary constraint
|
||||
- **Need comprehensive output**: When you need all facts, not just accurate subset
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Add plugin marketplace
|
||||
/plugin marketplace add vertti/se-cove-claude-plugin
|
||||
|
||||
# Install plugin (in separate command)
|
||||
/plugin install chain-of-verification
|
||||
```
|
||||
|
||||
**Note**: Commands must be pasted separately (Claude Code marketplace limitation).
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Invoke verification
|
||||
/chain-of-verification:verify <your question>
|
||||
|
||||
# Autocomplete available
|
||||
/ver<Tab>
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
From the research paper (Section 6):
|
||||
|
||||
1. **Not a silver bullet**: Reduces hallucinations but does not eliminate them
|
||||
2. **Computational cost**: ~2x token usage vs baseline generation (estimated from implementation)
|
||||
3. **Output volume trade-off**: Generates fewer but more accurate results
|
||||
4. **Model-specific**: Tested on Llama 65B; generalization to GPT-4/Claude/Sonnet unverified
|
||||
5. **Task dependency**: Performance varies significantly by task type (23-112%)
|
||||
6. **Factual hallucinations only**: Does not address incorrect reasoning steps or opinions
|
||||
|
||||
## Source Code
|
||||
|
||||
- **GitHub**: [vertti/se-cove-claude-plugin](https://github.com/vertti/se-cove-claude-plugin)
|
||||
- **Version**: 1.1.1 (2026-01-23)
|
||||
- **License**: MIT
|
||||
- **Author**: Janne Sinivirta
|
||||
|
||||
## Related
|
||||
|
||||
- Main guide section: [Plugin System](../guide/ultimate-guide.md#85-plugin-system)
|
||||
- Methodology: [Multi-Agent Orchestration](../guide/methodologies.md#multi-agent-orchestration)
|
||||
- Verification Loops: [Autonomous Iteration](../guide/methodologies.md#verification-loops)
|
||||
Loading…
Add table
Add a link
Reference in a new issue