claude-code-ultimate-guide/docs/resource-evaluations/se-cove-plugin.md
Florian BRUNIAUX 1136dc683f docs: add resource-evaluations to tracked docs
- Create docs/resource-evaluations/ with 15 evaluation files
- Standardize filenames (remove date prefixes)
- Keep working docs and private audits in claudedocs/ (gitignored)
- Add resource evaluation workflow to CLAUDE.md

Files migrated:
- gsd, worktrunk, boris-cowork-video, wooldridge-productivity-stack
- remotion, nick-jensen, se-cove, self-improve-skill
- astgrep, clawdbot, prompt-repetition, uml-diagrams
- vibe-coding-rusitschka, anthropic-releases

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-26 14:02:05 +01:00

312 lines
11 KiB
Markdown

# Resource Evaluation: SE-CoVe Plugin
**Date**: 2026-01-24
**Evaluator**: Claude Code Ultimate Guide (via /eval-resource skill)
**Resource**: SE-CoVe (Chain-of-Verification) Claude Code Plugin
## Sources
- **LinkedIn Post**: https://www.linkedin.com/posts/vertti_github-verttise-cove-claude-plugin-se-cove-activity-7420735428607197184-IfOq
- **GitHub Repo**: https://github.com/vertti/se-cove-claude-plugin
- **Research Paper**: https://arxiv.org/abs/2309.11495 (ACL 2024 Findings)
- **ACL Anthology**: https://aclanthology.org/2024.findings-acl.212/
---
## Executive Summary
**Decision**: ✅ **INTEGRATED** (with academic corrections)
**Score**: 3/5 (Pertinent avec réserves majeures)
**Approach**: B (Neutral Academic) - Factual presentation without marketing bias
**Rationale**: SE-CoVe implements Meta's Chain-of-Verification methodology (ACL 2024 validated), combling le gap "plugin examples" dans notre guide. MAIS: LinkedIn marketing claim de "28% improvement" est cherry-picked (réalité: 23-112% selon tâche), et omet coûts computationnels (~2x tokens) et réduction output (-26% facts).
**Actions taken**:
1. ✅ Created `examples/plugins/se-cove.md` with academic citations
2. ✅ Added to README.md "Examples Library" section
3. ✅ Updated `machine-readable/reference.yaml`
---
## Content Summary
### What is SE-CoVe?
Software Engineering adaptation of Meta's Chain-of-Verification for Claude Code.
**Pipeline**:
1. Baseline: Generate initial solution
2. Planner: Create verification questions from claims
3. Executor: Answer questions independently (never sees baseline)
4. Synthesizer: Compare findings, identify discrepancies
5. Output: Produce verified solution
**Critical innovation**: Verifier operates without draft code access (prevents confirmation bias).
### Author & Maintenance
- **Author**: Janne Sinivirta (LinkedIn: vertti)
- **Version**: 1.1.1 (2026-01-23)
- **License**: MIT
- **GitHub Stars**: ~78 (low community validation)
---
## Fact-Check Results
### ✅ Verified Claims
| Claim | Status | Source |
|-------|--------|--------|
| **Meta AI research** | ✅ Verified | arXiv:2309.11495, ACL 2024 Findings |
| **5-stage pipeline** | ✅ Verified | GitHub README matches paper methodology |
| **Independent verifier** | ✅ Verified | Paper Section 3: "verifier never sees draft" |
| **Installation commands** | ✅ Verified | `/plugin marketplace add` + `/plugin install` |
| **Use cases documented** | ✅ Verified | README lists recommended/avoid scenarios |
### ⚠️ Misleading Claims
| Claim | Reality | Severity |
|-------|---------|----------|
| **"28% accuracy improvement"** | True for biography FACTSCORE only; 23% for QA, 112% for lists | 🔴 Critical cherry-picking |
| **Computational cost omitted** | ~2x token consumption (undisclosed) | 🟡 Material omission |
| **Output reduction omitted** | -26% facts generated (16.6→12.3) | 🟡 Material omission |
| **"Improves accuracy"** | True but hallucinations NOT eliminated | 🟡 Oversimplification |
### ❌ Unverified Claims
| Claim | Issue | Resolution |
|-------|-------|------------|
| **"28% improvement"** | NOT found in arXiv abstract | Perplexity research: Found in paper Section 4.3, Table 1 (FACTSCORE metric, biography task only) |
---
## Performance Metrics (from Research Paper)
**Source**: Dhuliawala et al., "Chain-of-Verification Reduces Hallucination in Large Language Models", ACL 2024 Findings.
| Task Type | Metric | Improvement | Computational Cost |
|-----------|--------|-------------|-------------------|
| Biography generation | FACTSCORE | +28% (55.9→71.4) | -26% output volume (16.6→12.3 facts) |
| Closed-book QA | F1 Score | +23% (0.39→0.48) | ~2x token consumption |
| List-based questions | Precision | +112% (0.17→0.36) | Fewer total answers |
**Model**: Llama 65B (generalization to GPT-4/Claude/Sonnet unverified)
---
## Gap Analysis
### ✅ Gaps SE-CoVe Fills
1. **Plugin examples**: Guide has 233 lines on Plugin System (6863-7096) but ZERO concrete examples
2. **CoVe methodology**: Multi-Agent Orchestration mentioned (methodologies.md:165) but CoVe specifically absent
3. **Independent verification**: Verification Loops documented (methodologies.md:145) but no implementation example
### 🔄 Overlap with Existing Content
| Concept | Existing Section | SE-CoVe Contribution |
|---------|------------------|---------------------|
| Code Review | `examples/agents/code-reviewer.md` | Adds independent verification pattern |
| Multi-Agent | `guide/methodologies.md:165` | Concrete CoVe implementation |
| Verification Loops | `guide/methodologies.md:145` | Automated verification pipeline |
| Plugin System | `guide/ultimate-guide.md:6863` | First practical example |
---
## Technical Writer Challenge (Agent aa5c1fd)
### Original Evaluation Issues Identified
1.**Factual error**: Claimed "guide has NO plugin section" → FALSE (233 lines exist)
2.**Correctly spotted**: Gap = theoretical docs without examples
3. ⚠️ **Underestimated**: Importance of "theory without practice" anti-pattern
4.**Cherry-picking not flagged**: Original eval didn't catch 28% selectivity
### Score Adjustment
| Phase | Score | Rationale |
|-------|-------|-----------|
| **Initial** | 3/5 | Pertinent - Complément utile |
| **Post-challenge** | 4/5 | Très pertinent - Comble gap pratique |
| **Post-fact-check** | **3/5** | Downgrade due to marketing misleadingness |
**Reason for downgrade**: Marketing claim cherry-picking + material omissions (2x cost, -26% output) reduce trustworthiness despite valid methodology.
---
## Integration Approach
### Selected: Approach B (Neutral Academic)
**Rejected approaches**:
-**Approach A (Heavy disclaimers)**: Too negative, disclaimer longer than content
-**Approach C (Don't include)**: Too conservative, misses opportunity to fill gap
**Why Approach B**:
1. ✅ Factual without being accusatory
2. ✅ Presents gains AND costs equitably (table format)
3. ✅ Professional tone (academic citation, not "warning")
4. ✅ Educates users on trade-offs without alarming
### Documentation Format
```markdown
## Performance Metrics
Results from Meta's research paper (Llama 65B model):
[Table with Improvement + Computational Cost columns]
**Source**: Dhuliawala et al., ACL 2024 Findings
```
**Key principle**: Cite the paper, not the marketing.
---
## Curation Policy Established
To avoid amplifying marketing bias in future evaluations:
### Inclusion Criteria
| Criterion | Requirement | SE-CoVe Status |
|-----------|-------------|----------------|
| **Academic validation** | Published conference/journal | ✅ ACL 2024 Findings |
| **Claims fact-checked** | Verified via Perplexity/paper | ⚠️ Cherry-picked but true |
| **Trade-offs disclosed** | Cost/limitations documented | ❌ Omitted → we added |
| **Community validation** | Tested internally OR 1K+ stars | ❌ Neither (78 stars, untested) |
| **Active maintenance** | Update < 6 months | v1.1.1 (2026-01-23) |
**Verdict**: Include with academic disclaimers.
---
## Files Created
### 1. `examples/plugins/se-cove.md`
**Content**:
- Research foundation (Meta AI, ACL 2024)
- 5-stage pipeline explanation
- Performance metrics table (with trade-offs)
- When to use / When NOT to use
- Installation instructions
- Limitations (from paper Section 6)
- Source links (GitHub, arXiv, ACL Anthology)
**Citations**:
- Paper: Dhuliawala et al., arXiv:2309.11495
- Conference: ACL 2024 Findings
- Implementation: GitHub vertti/se-cove-claude-plugin v1.1.1
### 2. `README.md` (updated)
**Line 238**: Added "**Plugins** (1): [SE-CoVe](./examples/plugins/se-cove.md) Chain-of-Verification for independent code review (Meta AI, ACL 2024)"
### 3. `machine-readable/reference.yaml` (updated)
**Lines 124-132**: Added section:
```yaml
# Plugin System & Recommended Plugins (added 2026-01-24)
plugins_system: 6863
plugins_se_cove: "examples/plugins/se-cove.md"
chain_of_verification_paper: "https://arxiv.org/abs/2309.11495"
chain_of_verification_acl: "https://aclanthology.org/2024.findings-acl.212/"
```
---
## Lessons Learned
### For Future Evaluations
1. **Fact-check via Perplexity**: Essential for academic claims (28% found in paper p.7, not abstract)
2. **Challenge initial assessment**: technical-writer agent caught factual errors
3. **Check for omissions**: Marketing often presents gains without costs
4. **Verify source credibility**: ACL 2024 > random blog post
5.**Approach B (neutral academic)** > heavy disclaimers or rejection
### Red Flags Detected
| Marketing Pattern | SE-CoVe Example | Mitigation |
|-------------------|-----------------|------------|
| **Cherry-picking best metric** | "28%" (ignores 23%/112% on other tasks) | Present full results table |
| **Omitting computational costs** | No mention of 2x tokens | Add "Computational Cost" column |
| **Oversimplifying limitations** | "Improves accuracy" (hallucinations not eliminated) | Include paper's Limitations section |
| **Lack of context** | "Independent verification" (model-specific) | Note "Tested on Llama 65B only" |
---
## Confidence Assessment
| Aspect | Confidence | Evidence |
|--------|-----------|----------|
| **Methodology validity** | 🟢 High | ACL 2024 peer-reviewed paper |
| **Performance metrics** | 🟢 High | Verified in paper Section 4.3, Table 1 |
| **Plugin functionality** | 🟡 Medium | README documented, but untested by us |
| **Generalization** | 🟡 Medium | Tested on Llama 65B, not SOTA models |
| **Marketing accuracy** | 🔴 Low | Cherry-picked metrics, material omissions |
---
## Recommendations for Users
### When to Trust SE-CoVe
✅ Use for:
- Critical code review (architectural decisions)
- Security-sensitive code verification
- Complex debugging requiring independent analysis
- When 2x computational cost is acceptable
### When to Be Skeptical
⚠️ Avoid expecting:
- Universal 28% improvement (task-dependent: 23-112%)
- Zero hallucinations (reduces, not eliminates)
- Fast processing (5+ minutes per verification)
- Comprehensive output (generates fewer but more accurate results)
---
## Meta: Evaluation Process
### Workflow Used
1. **Fetch & Summarize**: WebFetch LinkedIn + GitHub README
2. **Context Check**: Read `machine-readable/reference.yaml`
3. **Gap Analysis**: Grep for verification/multi-agent/code review
4. **Challenge**: Task tool (technical-writer agent)
5. **Fact-Check**: Perplexity research on 28% claim
6. **Document**: Create files with academic approach
### Tools Used
- WebFetch (LinkedIn, GitHub, arXiv abstract)
- Perplexity Pro (fact-check 28% claim in full paper)
- Task tool (technical-writer challenge)
- Grep/Read (gap analysis)
- Write/Edit (documentation)
### Time Investment
- Research & fact-check: ~20 minutes
- Challenge & revision: ~10 minutes
- Documentation: ~15 minutes
- **Total**: ~45 minutes
---
## Conclusion
**SE-CoVe plugin integrated successfully with academic rigor.**
**Key achievement**: First concrete plugin example in guide, combling le gap "theory without practice" dans la section Plugin System (6863-7096).
**Critical correction**: Marketing claim "28% improvement" → Documented reality "23-112% task-dependent, 2x cost, -26% output".
**Precedent established**: Future plugins evaluated with Approach B (neutral academic), fact-checked via Perplexity, trade-offs disclosed transparently.
**Next evaluation**: Use this report as template (format réutilisable).