marketing-shibata50/claude-code-ultimate-guide

Florian BRUNIAUX 1136dc683f docs: add resource-evaluations to tracked docs

- Create docs/resource-evaluations/ with 15 evaluation files
- Standardize filenames (remove date prefixes)
- Keep working docs and private audits in claudedocs/ (gitignored)
- Add resource evaluation workflow to CLAUDE.md

Files migrated:
- gsd, worktrunk, boris-cowork-video, wooldridge-productivity-stack
- remotion, nick-jensen, se-cove, self-improve-skill
- astgrep, clawdbot, prompt-repetition, uml-diagrams
- vibe-coding-rusitschka, anthropic-releases

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-26 14:02:05 +01:00

5.6 KiB

Raw Blame History

Evaluation: Prompt Repetition Paper (arXiv:2512.14982)

Date: 2026-01-25 Paper: "Prompt Repetition Improves Non-Reasoning LLMs" Authors: Yaniv Leviathan, Matan Kalman, Yossi Matias (Google Research) Published: 17 Dec 2025 arXiv: https://arxiv.org/abs/2512.14982

1. Findings Summary

Core Claim

Repeating the input prompt 2x improves accuracy for LLMs without reasoning mode, without increasing output length or latency.

Tested Models (directly from paper)

Gemini 2.0 Flash / Flash Lite
GPT-4o / GPT-4o-mini
Claude 3 Haiku
Claude 3.7 Sonnet
Deepseek V3

Benchmarks

ARC (Challenge), OpenBookQA, GSM8K, MMLU-Pro, MATH, NameIndex, MiddleMatch

Key Results

Metric	Value
Wins (no reasoning)	47/70 benchmark-model combinations
Losses	0
With CoT/reasoning	5 wins, 1 loss, 22 neutral

Claude-Specific Notes (from paper)

Tested on Claude 3 Haiku and Claude 3.7 Sonnet
Latency increase observed for Claude models on very long requests (repeat x3 or custom benchmarks)
Likely due to prefill stage taking longer

2. Relevance to Claude Code

Model Situation (Jan 2026)

Model	Thinking Mode	Prompt Repetition Applicable?
Opus 4.5	ON by default (max budget)	NO - thinking already maximizes reasoning
Sonnet 4	Not available	YES - could benefit
Haiku 3.5	Not available	YES - could benefit

The Problem

Claude Code uses:

Sonnet as default (85% of usage per guide stats)
Haiku for simple tasks (cost optimization)
Opus for complex tasks (already has thinking mode)

The paper's technique is specifically for non-reasoning scenarios. This makes it potentially relevant for Sonnet/Haiku in Claude Code.

The Catch

Input token cost doubles: Repeating prompt = 2x input tokens
Claude Code context is already under pressure: Guide emphasizes context management (100K practical limit)
Gain magnitude unclear: Paper shows wins/losses but not absolute improvement %
Claude-specific latency issue: Paper notes increased latency for Claude on long prompts

3. Community Reception

Academic Impact (as of 2026-01-25)

Citations: 0 (paper is 5 weeks old)
Semantic Scholar: Listed, no citations
Replications: None found

Community Discussion

Hacker News: 5+ submissions, max 3 points, 0 comments
Reddit r/MachineLearning: No relevant posts
Reddit r/LocalLLaMA: No relevant posts
Twitter/X: No significant discussion found

Assessment

Extremely low community engagement. No independent validation. No practical adoption reports.

4. Practical Considerations for Claude Code

Hypothetical Hook Implementation

# pre-prompt-hook.sh (EXPERIMENTAL)
#!/bin/bash
# Double the prompt for Sonnet/Haiku
if [[ "$CLAUDE_MODEL" != "opus"* ]]; then
  echo "${1}

---
(Repeated for accuracy)
${1}"
else
  echo "$1"
fi

Problems with This Approach

No API access to modify prompts in Claude Code - hooks can't intercept user input
Would need SDK-level changes - not a user-configurable feature
Cost doubling - doubles input tokens, may offset any accuracy gains
Context bloat - directly contradicts the guide's context hygiene principles

5. Evaluation Matrix

Criterion	Score	Notes
Validity	3/5	Google Research paper, but no replications yet
Applicability to Claude Code	2/5	Relevant only to Sonnet/Haiku, not implementable by users
Community Adoption	1/5	Zero adoption, zero discussion
Practical Implementation	1/5	Can't intercept prompts in Claude Code
Cost/Benefit	2/5	2x input tokens for uncertain gain
Documentation Value	2/5	Too niche, too experimental

6. Recommendation

Score: 2/5 - DO NOT INTEGRATE

Rationale

Wrong target: The technique targets non-reasoning LLMs, but Claude Code's complex tasks already use Opus (with thinking). Simple tasks on Sonnet/Haiku don't need accuracy optimization - they need speed.
Not user-implementable: Users can't intercept their own prompts in Claude Code. This would require SDK changes, not documentation.
Zero validation: No replications, no community adoption, no real-world usage reports after 5 weeks.
Cost-prohibitive: Doubling input tokens contradicts Claude Code's emphasis on context efficiency and cost management.
Niche application: Even if valid, it only helps on specific benchmark-style tasks (multiple choice, math) - not the open-ended coding tasks Claude Code handles.

What Could Change This

Independent replications with Claude Sonnet 4
Real-world adoption reports from Claude Code users
Anthropic acknowledgment or integration
Evidence that accuracy gains outweigh 2x input cost

Alternative Recommendation

If users want better accuracy on Sonnet:

Use OpusPlan (Opus for planning, Sonnet for execution) - already documented
Switch to Opus for critical decisions - already documented
Use structured prompting (XML tags) - already documented

These are proven techniques in the guide that don't double costs.

7. Files to NOT Update

guide/ultimate-guide.md - No integration
examples/hooks/ - No experimental hook
machine-readable/reference.yaml - No reference

8. Archive Decision

Action: Keep this evaluation in claudedocs/resource-evaluations/ for future reference.

If the paper gains traction (citations, replications, Anthropic mention), re-evaluate in Q2 2026.

5.6 KiB Raw Blame History