claude-code-ultimate-guide/docs/resource-evaluations/prompt-repetition-paper.md
Florian BRUNIAUX 1136dc683f docs: add resource-evaluations to tracked docs
- Create docs/resource-evaluations/ with 15 evaluation files
- Standardize filenames (remove date prefixes)
- Keep working docs and private audits in claudedocs/ (gitignored)
- Add resource evaluation workflow to CLAUDE.md

Files migrated:
- gsd, worktrunk, boris-cowork-video, wooldridge-productivity-stack
- remotion, nick-jensen, se-cove, self-improve-skill
- astgrep, clawdbot, prompt-repetition, uml-diagrams
- vibe-coding-rusitschka, anthropic-releases

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-26 14:02:05 +01:00

173 lines
5.6 KiB
Markdown

# Evaluation: Prompt Repetition Paper (arXiv:2512.14982)
**Date**: 2026-01-25
**Paper**: "Prompt Repetition Improves Non-Reasoning LLMs"
**Authors**: Yaniv Leviathan, Matan Kalman, Yossi Matias (Google Research)
**Published**: 17 Dec 2025
**arXiv**: https://arxiv.org/abs/2512.14982
---
## 1. Findings Summary
### Core Claim
Repeating the input prompt 2x improves accuracy for LLMs **without reasoning mode**, without increasing output length or latency.
### Tested Models (directly from paper)
- Gemini 2.0 Flash / Flash Lite
- GPT-4o / GPT-4o-mini
- **Claude 3 Haiku**
- **Claude 3.7 Sonnet**
- Deepseek V3
### Benchmarks
ARC (Challenge), OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch
### Key Results
| Metric | Value |
|--------|-------|
| Wins (no reasoning) | 47/70 benchmark-model combinations |
| Losses | 0 |
| With CoT/reasoning | 5 wins, 1 loss, 22 neutral |
### Claude-Specific Notes (from paper)
- Tested on Claude 3 Haiku and Claude 3.7 Sonnet
- **Latency increase** observed for Claude models on very long requests (repeat x3 or custom benchmarks)
- Likely due to prefill stage taking longer
---
## 2. Relevance to Claude Code
### Model Situation (Jan 2026)
| Model | Thinking Mode | Prompt Repetition Applicable? |
|-------|---------------|-------------------------------|
| Opus 4.5 | ON by default (max budget) | NO - thinking already maximizes reasoning |
| Sonnet 4 | Not available | YES - could benefit |
| Haiku 3.5 | Not available | YES - could benefit |
### The Problem
Claude Code uses:
- **Sonnet as default** (85% of usage per guide stats)
- **Haiku for simple tasks** (cost optimization)
- **Opus for complex tasks** (already has thinking mode)
The paper's technique is specifically for **non-reasoning** scenarios. This makes it potentially relevant for Sonnet/Haiku in Claude Code.
### The Catch
1. **Input token cost doubles**: Repeating prompt = 2x input tokens
2. **Claude Code context is already under pressure**: Guide emphasizes context management (100K practical limit)
3. **Gain magnitude unclear**: Paper shows wins/losses but not absolute improvement %
4. **Claude-specific latency issue**: Paper notes increased latency for Claude on long prompts
---
## 3. Community Reception
### Academic Impact (as of 2026-01-25)
- **Citations**: 0 (paper is 5 weeks old)
- **Semantic Scholar**: Listed, no citations
- **Replications**: None found
### Community Discussion
- **Hacker News**: 5+ submissions, max 3 points, 0 comments
- **Reddit r/MachineLearning**: No relevant posts
- **Reddit r/LocalLLaMA**: No relevant posts
- **Twitter/X**: No significant discussion found
### Assessment
Extremely low community engagement. No independent validation. No practical adoption reports.
---
## 4. Practical Considerations for Claude Code
### Hypothetical Hook Implementation
```bash
# pre-prompt-hook.sh (EXPERIMENTAL)
#!/bin/bash
# Double the prompt for Sonnet/Haiku
if [[ "$CLAUDE_MODEL" != "opus"* ]]; then
echo "${1}
---
(Repeated for accuracy)
${1}"
else
echo "$1"
fi
```
### Problems with This Approach
1. **No API access to modify prompts in Claude Code** - hooks can't intercept user input
2. **Would need SDK-level changes** - not a user-configurable feature
3. **Cost doubling** - doubles input tokens, may offset any accuracy gains
4. **Context bloat** - directly contradicts the guide's context hygiene principles
---
## 5. Evaluation Matrix
| Criterion | Score | Notes |
|-----------|-------|-------|
| **Validity** | 3/5 | Google Research paper, but no replications yet |
| **Applicability to Claude Code** | 2/5 | Relevant only to Sonnet/Haiku, not implementable by users |
| **Community Adoption** | 1/5 | Zero adoption, zero discussion |
| **Practical Implementation** | 1/5 | Can't intercept prompts in Claude Code |
| **Cost/Benefit** | 2/5 | 2x input tokens for uncertain gain |
| **Documentation Value** | 2/5 | Too niche, too experimental |
---
## 6. Recommendation
### Score: 2/5 - DO NOT INTEGRATE
### Rationale
1. **Wrong target**: The technique targets non-reasoning LLMs, but Claude Code's complex tasks already use Opus (with thinking). Simple tasks on Sonnet/Haiku don't need accuracy optimization - they need speed.
2. **Not user-implementable**: Users can't intercept their own prompts in Claude Code. This would require SDK changes, not documentation.
3. **Zero validation**: No replications, no community adoption, no real-world usage reports after 5 weeks.
4. **Cost-prohibitive**: Doubling input tokens contradicts Claude Code's emphasis on context efficiency and cost management.
5. **Niche application**: Even if valid, it only helps on specific benchmark-style tasks (multiple choice, math) - not the open-ended coding tasks Claude Code handles.
### What Could Change This
- Independent replications with Claude Sonnet 4
- Real-world adoption reports from Claude Code users
- Anthropic acknowledgment or integration
- Evidence that accuracy gains outweigh 2x input cost
### Alternative Recommendation
If users want better accuracy on Sonnet:
- Use **OpusPlan** (Opus for planning, Sonnet for execution) - already documented
- Switch to Opus for critical decisions - already documented
- Use structured prompting (XML tags) - already documented
These are proven techniques in the guide that don't double costs.
---
## 7. Files to NOT Update
- `guide/ultimate-guide.md` - No integration
- `examples/hooks/` - No experimental hook
- `machine-readable/reference.yaml` - No reference
---
## 8. Archive Decision
**Action**: Keep this evaluation in `claudedocs/resource-evaluations/` for future reference.
If the paper gains traction (citations, replications, Anthropic mention), re-evaluate in Q2 2026.