claude-code-ultimate-guide/docs/resource-evaluations/prompt-repetition-paper.md
Florian BRUNIAUX 1136dc683f docs: add resource-evaluations to tracked docs
- Create docs/resource-evaluations/ with 15 evaluation files
- Standardize filenames (remove date prefixes)
- Keep working docs and private audits in claudedocs/ (gitignored)
- Add resource evaluation workflow to CLAUDE.md

Files migrated:
- gsd, worktrunk, boris-cowork-video, wooldridge-productivity-stack
- remotion, nick-jensen, se-cove, self-improve-skill
- astgrep, clawdbot, prompt-repetition, uml-diagrams
- vibe-coding-rusitschka, anthropic-releases

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-26 14:02:05 +01:00

5.6 KiB

Evaluation: Prompt Repetition Paper (arXiv:2512.14982)

Date: 2026-01-25 Paper: "Prompt Repetition Improves Non-Reasoning LLMs" Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias (Google Research) Published: 17 Dec 2025 arXiv: https://arxiv.org/abs/2512.14982


1. Findings Summary

Core Claim

Repeating the input prompt 2x improves accuracy for LLMs without reasoning mode, without increasing output length or latency.

Tested Models (directly from paper)

  • Gemini 2.0 Flash / Flash Lite
  • GPT-4o / GPT-4o-mini
  • Claude 3 Haiku
  • Claude 3.7 Sonnet
  • Deepseek V3

Benchmarks

ARC (Challenge), OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch

Key Results

Metric Value
Wins (no reasoning) 47/70 benchmark-model combinations
Losses 0
With CoT/reasoning 5 wins, 1 loss, 22 neutral

Claude-Specific Notes (from paper)

  • Tested on Claude 3 Haiku and Claude 3.7 Sonnet
  • Latency increase observed for Claude models on very long requests (repeat x3 or custom benchmarks)
  • Likely due to prefill stage taking longer

2. Relevance to Claude Code

Model Situation (Jan 2026)

Model Thinking Mode Prompt Repetition Applicable?
Opus 4.5 ON by default (max budget) NO - thinking already maximizes reasoning
Sonnet 4 Not available YES - could benefit
Haiku 3.5 Not available YES - could benefit

The Problem

Claude Code uses:

  • Sonnet as default (85% of usage per guide stats)
  • Haiku for simple tasks (cost optimization)
  • Opus for complex tasks (already has thinking mode)

The paper's technique is specifically for non-reasoning scenarios. This makes it potentially relevant for Sonnet/Haiku in Claude Code.

The Catch

  1. Input token cost doubles: Repeating prompt = 2x input tokens
  2. Claude Code context is already under pressure: Guide emphasizes context management (100K practical limit)
  3. Gain magnitude unclear: Paper shows wins/losses but not absolute improvement %
  4. Claude-specific latency issue: Paper notes increased latency for Claude on long prompts

3. Community Reception

Academic Impact (as of 2026-01-25)

  • Citations: 0 (paper is 5 weeks old)
  • Semantic Scholar: Listed, no citations
  • Replications: None found

Community Discussion

  • Hacker News: 5+ submissions, max 3 points, 0 comments
  • Reddit r/MachineLearning: No relevant posts
  • Reddit r/LocalLLaMA: No relevant posts
  • Twitter/X: No significant discussion found

Assessment

Extremely low community engagement. No independent validation. No practical adoption reports.


4. Practical Considerations for Claude Code

Hypothetical Hook Implementation

# pre-prompt-hook.sh (EXPERIMENTAL)
#!/bin/bash
# Double the prompt for Sonnet/Haiku
if [[ "$CLAUDE_MODEL" != "opus"* ]]; then
  echo "${1}

---
(Repeated for accuracy)
${1}"
else
  echo "$1"
fi

Problems with This Approach

  1. No API access to modify prompts in Claude Code - hooks can't intercept user input
  2. Would need SDK-level changes - not a user-configurable feature
  3. Cost doubling - doubles input tokens, may offset any accuracy gains
  4. Context bloat - directly contradicts the guide's context hygiene principles

5. Evaluation Matrix

Criterion Score Notes
Validity 3/5 Google Research paper, but no replications yet
Applicability to Claude Code 2/5 Relevant only to Sonnet/Haiku, not implementable by users
Community Adoption 1/5 Zero adoption, zero discussion
Practical Implementation 1/5 Can't intercept prompts in Claude Code
Cost/Benefit 2/5 2x input tokens for uncertain gain
Documentation Value 2/5 Too niche, too experimental

6. Recommendation

Score: 2/5 - DO NOT INTEGRATE

Rationale

  1. Wrong target: The technique targets non-reasoning LLMs, but Claude Code's complex tasks already use Opus (with thinking). Simple tasks on Sonnet/Haiku don't need accuracy optimization - they need speed.

  2. Not user-implementable: Users can't intercept their own prompts in Claude Code. This would require SDK changes, not documentation.

  3. Zero validation: No replications, no community adoption, no real-world usage reports after 5 weeks.

  4. Cost-prohibitive: Doubling input tokens contradicts Claude Code's emphasis on context efficiency and cost management.

  5. Niche application: Even if valid, it only helps on specific benchmark-style tasks (multiple choice, math) - not the open-ended coding tasks Claude Code handles.

What Could Change This

  • Independent replications with Claude Sonnet 4
  • Real-world adoption reports from Claude Code users
  • Anthropic acknowledgment or integration
  • Evidence that accuracy gains outweigh 2x input cost

Alternative Recommendation

If users want better accuracy on Sonnet:

  • Use OpusPlan (Opus for planning, Sonnet for execution) - already documented
  • Switch to Opus for critical decisions - already documented
  • Use structured prompting (XML tags) - already documented

These are proven techniques in the guide that don't double costs.


7. Files to NOT Update

  • guide/ultimate-guide.md - No integration
  • examples/hooks/ - No experimental hook
  • machine-readable/reference.yaml - No reference

8. Archive Decision

Action: Keep this evaluation in claudedocs/resource-evaluations/ for future reference.

If the paper gains traction (citations, replications, Anthropic mention), re-evaluate in Q2 2026.