From 895ace49f75646bc27ab25b1cbac002a2da718ae Mon Sep 17 00:00:00 2001 From: Florian BRUNIAUX Date: Thu, 19 Feb 2026 09:59:50 +0100 Subject: [PATCH] docs: add Borg et al. 2025 RCT on AI code maintainability (v3.27.7) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Resource eval: arXiv:2507.00788 "Echoes of AI" (151 devs, 95% pros, 2-phase blind RCT) — 30.7% faster median, ~55.9% habitual users, no significant downstream maintainability impact - guide/learning-with-ai.md: citation + "On maintainability fear" note - guide/ultimate-guide.md: nuance blockquote in §1.7 Trust Calibration - machine-readable/reference.yaml: 4 new RCT/maintainability entries - docs/resource-evaluations/: evaluation file with technical-writer audit Co-Authored-By: Claude Sonnet 4.6 --- CHANGELOG.md | 20 +++ VERSION | 2 +- ...2-19-echoes-of-ai-maintainability-study.md | 166 ++++++++++++++++++ guide/learning-with-ai.md | 3 + guide/ultimate-guide.md | 2 + machine-readable/reference.yaml | 5 + 6 files changed, 197 insertions(+), 1 deletion(-) create mode 100644 docs/resource-evaluations/2026-02-19-echoes-of-ai-maintainability-study.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 3a1c127..56e1f54 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). - v2.1.47: VS Code plan preview auto-updates, `ctrl+f` kills all background agents, `last_assistant_message` hook field, 70+ bug fixes - v2.1.46: claude.ai MCP connectors support, orphaned process fix on macOS +## [3.27.7] - 2026-02-19 + +### Added + +- **Resource evaluation**: Borg et al. "Echoes of AI" RCT (arXiv:2507.00788) + - 2-phase blind controlled experiment, 151 participants (95% professional developers) + - AI users 30.7% faster (median), habitual users ~55.9% faster + - No significant maintainability impact for downstream developers — first RCT to explicitly target this question + - Fact-checked against primary source; v2 (Dec 2025) confirmed via Perplexity + - Co-authored by Dave Farley ("Continuous Delivery") + - Evaluation file: `docs/resource-evaluations/2026-02-19-echoes-of-ai-maintainability-study.md` + +### Changed + +- `guide/learning-with-ai.md`: Added Borg et al. 2025 RCT citation in Productivity Research bibliography (revised to factual/neutral wording after technical-writer audit) +- `guide/learning-with-ai.md`: Added "On maintainability fear" note in "Why Some Teams Get Results" section — the real risks are skill atrophy and over-delegation, not downstream quality degradation +- `guide/ultimate-guide.md`: Added downstream maintainability nuance blockquote in §1.7 Trust Calibration — defect rates ≠ maintenance burden (Borg et al. 2025 blind RCT) +- `machine-readable/reference.yaml`: Added 4 entries — `productivity_rct_metr`, `productivity_rct_echoes`, `productivity_maintainability_empirical`, `trust_calibration_maintainability_nuance` +- Landing `faq/index.astro`: Updated "How much should I trust AI-generated code?" — added maintainability nuance (HTML visible answer + JSON-LD structured data) + ## [3.27.6] - 2026-02-18 ### Added diff --git a/VERSION b/VERSION index c21a588..1ecf2a6 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -3.27.6 +3.27.7 diff --git a/docs/resource-evaluations/2026-02-19-echoes-of-ai-maintainability-study.md b/docs/resource-evaluations/2026-02-19-echoes-of-ai-maintainability-study.md new file mode 100644 index 0000000..19350e7 --- /dev/null +++ b/docs/resource-evaluations/2026-02-19-echoes-of-ai-maintainability-study.md @@ -0,0 +1,166 @@ +# Resource Evaluation: "Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability" + +**Date:** 2026-02-19 +**Evaluator:** Claude Code (eval-resource skill) +**Status:** Integrated (section Productivity Research) + +--- + +## Resource Details + +| Field | Value | +|-------|-------| +| **Title** | Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability | +| **Authors** | Markus Borg, Dave Hewett, Nadim Hagatulah, Noric Couderc, Emma Söderberg, Donald Graham, Uttam Kini, Dave Farley | +| **Dave Farley credentials** | Co-author of "Continuous Delivery" (with Jez Humble), Jolt Award winner | +| **Publication Date** | July 2025 | +| **URL** | https://arxiv.org/abs/2507.00788 | +| **Type** | Academic preprint (arXiv, not yet peer-reviewed in conference proceedings as of 2026-02-19) | +| **LinkedIn context** | Post by Olivier LOVERDE (Co-founder & CPTO @Innovorder) summarizing the study | +| **LinkedIn URL** | https://www.linkedin.com/posts/loverdeolivier_investigating-the-downstream-effects-of-ai-ugcPost-7426914640300802048 | + +--- + +## Summary + +Two-phase controlled experiment investigating whether AI-assisted code creation impacts maintainability for downstream developers: + +- **Phase 1**: 151 participants add features to a Java web app (with or without AI: GitHub Copilot / Cursor) +- **Phase 2**: A *different* group of developers evolves those solutions **without AI** (blind review — reviewers don't know if code was AI-assisted) + +**Key findings:** +1. AI users completed tasks **30.7% faster** (median) than non-AI users +2. Habitual AI users showed an estimated **55.9% speedup** +3. **No significant differences** in downstream evolution time or code quality — the "AI code is unmaintainable" myth is not supported empirically +4. Researchers recommend future investigation into excessive code generation and cognitive debt risks +5. Dave Farley's explicit takeaway: developers must guide AI (not autopilot), think about the business problem, and decompose complexity + +--- + +## Evaluation Score: **4/5** (Très pertinent — amélioration significative) + +### Scoring Breakdown + +| Criterion | Score | Justification | +|-----------|-------|---------------| +| **Content novelty** | 4/5 | Directly addresses the #1 FUD against AI-assisted coding ("unmaintainable code") with empirical data | +| **Research rigor** | 4/5 | 151 participants, 95% professional developers, 2-phase blind design — solid for this domain. Caveat: arXiv preprint, not yet peer-reviewed in proceedings | +| **Guide specificity** | 4/5 | Complements METR 2025 (already in guide) — provides the counter-evidence the guide currently lacks | +| **Credibility** | 5/5 | Dave Farley authorship = exceptional signal for the software engineering community | +| **Actionability** | 3/5 | Results are confirmatory, not prescriptive — validates approach but doesn't change workflow | + +**Overall: 4/5** + +--- + +## Gap Analysis + +### What the guide already covers + +| Topic | Coverage | Location | +|-------|----------|----------| +| AI productivity gains | ✅ General stats (Copilot, McKinsey) | `learning-with-ai.md:921-924` | +| METR RCT (19% slower) | ✅ Present | `learning-with-ai.md:925` | +| Vibe coding risks | ✅ Full section | Multiple locations | +| Skill atrophy concern | ✅ Present | `learning-with-ai.md:925` | +| AI code maintainability myth | ❌ **ABSENT** | **Gap identified** | +| Productivity curve (habitual users) | ⚠️ Partially | `learning-with-ai.md:~100` | + +### What this study adds + +| Contribution | Value | +|--------------|-------| +| Empirical refutation of "AI code is unmaintainable" | **High** — directly debunks the most common objection | +| 55.9% speedup for habitual users | **High** — validates learning curve section | +| Blind review methodology | **Medium** — demonstrates scientific rigor of the finding | +| Balance to METR 2025 results | **High** — METR = complex codebases, AI slower; this study = mixed tasks, AI faster → complete picture | + +--- + +## Recommendations + +**Where to integrate**: `guide/learning-with-ai.md` — section "Productivity Research" (~line 925) + +**What to add** (1-2 lines): +```markdown +- **Borg et al. "Echoes of AI" RCT (2025)** — [arXiv:2507.00788](https://arxiv.org/abs/2507.00788) — Controlled experiment (151 participants, 95% professional developers, 2-phase blind design): AI users 30.7% faster (median), habitual users ~55.9% faster. **Key finding**: no significant maintainability impact for downstream developers. Directly refutes the "AI code is unmaintainable" myth. Caveat: arXiv preprint (July 2025), not yet peer-reviewed in conference proceedings. +``` + +**Priority**: Medium-High — completes the empirical picture alongside METR 2025. + +--- + +## Challenge Summary (technical-writer agent) + +**Initial score:** 4/5 +**Challenged score:** 3.5 → 4/5 confirmed with corrections + +**Key points from challenge:** + +1. **Score justified** — but "peer-reviewed" was overstated. Corrected to "arXiv preprint." +2. **Blind review design** (phase 2 reviewers don't know if code is AI-assisted) = most important methodological detail, absent from initial eval. Added. +3. **55.9% habitual users** more actionable than 30.7% median — validates learning curve section. +4. **Limitations not flagged by the post**: tâches bornées en labo ≠ 12-month production codebase drift; potential selection bias (volunteer participants likely pro-AI); knowledge debt not measured. +5. **Risk of non-integration**: Guide would retain pro-METR bias (AI slower on complex tasks) without empirical counter-balance on maintainability. + +--- + +## Fact-Check + +### LinkedIn Post Claims + +| Claim (Olivier LOVERDE's post) | Verified | Source | Notes | +|--------------------------------|----------|--------|-------| +| Dave Farley = co-auteur de Continuous Delivery | ✅ | Perplexity, continuous-delivery.co.uk | Co-author with **Jez Humble** (not alone — minor omission in post) | +| 151 développeurs | ✅ | arXiv abstract | Exact | +| 95% professionnels | ✅ | arXiv abstract | Not mentioned in post but verified | +| Un groupe crée, un autre reprend (sans IA) | ✅ | arXiv methodology | 2-phase blind design confirmed | +| Code IA = aucun problème de maintenance | ✅ | arXiv abstract | "No systematic maintainability advantages or disadvantages" | +| 30% de temps gagné | ✅ | arXiv: 30.7% median | Rounded, correct | +| 50% pour ceux qui maîtrisent | ⚠️ | arXiv: 55.9% | Slight underestimate — actual is 55.9% | +| Devs n'ont pas débranché leur cerveau (qualifier) | ✅ | Study design | Phase 1 participants guided AI, did not use autopilot | + +### arXiv Paper Claims + +| Claim | Verified | Source | +|-------|----------|--------| +| 30.7% median reduction in completion time | ✅ | WebFetch arXiv abstract | +| 55.9% speedup for habitual users | ✅ | WebFetch arXiv abstract | +| No significant differences in Phase 2 | ✅ | WebFetch arXiv findings | +| 151 participants | ✅ | WebFetch arXiv abstract | +| 95% professional developers | ✅ | WebFetch arXiv abstract | + +**Corrections applied:** +- "50%" → actual figure is **55.9%** (LinkedIn slight understatement) +- "peer-reviewed" → **arXiv preprint** (July 2025, not yet peer-reviewed in proceedings) + +**Confidence**: High — primary source directly fetched and cross-validated. + +--- + +## Decision finale + +- **Score final**: 4/5 +- **Action**: Intégrer (1-2 lignes dans section Productivity Research) +- **Confiance**: Haute (primary source verified, methodology solid, gap confirmed) +- **Nuance à conserver**: Limitations du design labo (tâches bornées, biais de sélection, knowledge debt non mesuré) + +--- + +## Integration Log + +**Date integrated**: 2026-02-19 +**Post-audit corrections applied**: 2026-02-19 (technical-writer audit + Perplexity v2 check) + +| File | Change | Line | +|------|--------|------| +| `guide/learning-with-ai.md` | Citation réécriture — retrait claims éditoriaux, "July 2025" → "v2 Dec 2025", restructuration factuelle | ~926 | +| `guide/ultimate-guide.md` | Ajout blockquote nuance downstream maintainability dans section 1.7 Trust Calibration | ~1092 | +| `guide/learning-with-ai.md` | Ajout note "On maintainability fear" dans "Why Teams Get Results" | ~151 | +| `machine-readable/reference.yaml` | Ajout 4 entrées: productivity_rct_metr, productivity_rct_echoes, productivity_maintainability_empirical, trust_calibration_maintainability_nuance | ~94 | + +**Corrections post-audit:** +- "peer-reviewed" → "arXiv preprint (v2 Dec 2025), not yet published in peer-reviewed proceedings" — Perplexity confirmé +- Retrait formulation éditoriale "directly refutes the myth" → description factuelle neutre +- Ajout "First RCT to explicitly target maintainability of AI-assisted code" (Perplexity: arXiv v2 HTML confirmed wording) +- Séparation bibliographie / analyse : la comparaison METR déplacée dans le corps du guide, pas dans la biblio diff --git a/guide/learning-with-ai.md b/guide/learning-with-ai.md index a8cdc14..8cad503 100644 --- a/guide/learning-with-ai.md +++ b/guide/learning-with-ai.md @@ -150,6 +150,8 @@ The pattern: **AI excels at well-defined, repeatable tasks**. It struggles with The difference isn't the tool — it's the organizational discipline around it. +**On maintainability fear**: The concern that AI-generated code creates unmaintainable codebases is not empirically supported — downstream developers show no significant difference in evolution time or code quality (Borg et al., 2025, n=151). The real risks are skill atrophy and over-delegation, not inherent quality degradation for the next developer. ([arXiv:2507.00788](https://arxiv.org/abs/2507.00788)) + ### Implications for Learning This research shapes the rest of this guide: @@ -923,6 +925,7 @@ Sources for [§3 The Reality of AI Productivity](#the-reality-of-ai-productivity - **Stack Overflow 2024: AI Sentiment** — [stackoverflow.co](https://stackoverflow.co/labs/developer-sentiment-ai-ml/) — Developer attitudes toward AI tools, productivity perceptions - **Uplevel Engineering Intelligence (2024)** — Burnout and productivity metrics with AI coding tools - **METR Experienced Developer RCT (2025)** — [arXiv:2507.09089](https://arxiv.org/abs/2507.09089) — Randomized controlled trial (16 experienced devs, 246 issues, repos 1M+ lines): AI tools made developers 19% slower on familiar codebases, despite perceiving themselves 20% faster (39-point perception gap). Strongest evidence for skill atrophy risk in experienced developers. +- **Borg et al. "Echoes of AI" RCT (2025)** — [arXiv:2507.00788](https://arxiv.org/abs/2507.00788) — 2-phase blind RCT (151 participants, 95% professional developers): AI users 30.7% faster (median), habitual users ~55.9% faster. Phase 2: downstream developers evolving AI-generated code showed no significant difference in evolution time or code quality vs. human-generated code. First RCT to explicitly target maintainability of AI-assisted code. Co-authored by Dave Farley ("Continuous Delivery"). Note: arXiv preprint (v2 Dec 2025), not yet published in peer-reviewed proceedings. - **DORA/Google DevOps Research (2024)** — AI tool adoption impact on team performance ### Practitioner Perspectives diff --git a/guide/ultimate-guide.md b/guide/ultimate-guide.md index ecdacf9..686145c 100644 --- a/guide/ultimate-guide.md +++ b/guide/ultimate-guide.md @@ -1091,6 +1091,8 @@ Research consistently shows AI code has higher defect rates than human-written c **Key insight**: AI produces code faster but verification becomes the bottleneck. The question isn't "does it work?" but "how do I know it works?" +> **Nuance on downstream maintainability**: A 2-phase blind RCT (Borg et al., 2025, n=151 professional developers) found no significant difference in the time needed for downstream developers to evolve AI-generated vs. human-generated code. The defect rates above are real — but they do not systematically translate into higher maintenance burden for the next developer. The risk is more narrowly scoped than commonly assumed. ([arXiv:2507.00788](https://arxiv.org/abs/2507.00788)) + ### The Verification Spectrum Not all code needs the same scrutiny. Match verification effort to risk: diff --git a/machine-readable/reference.yaml b/machine-readable/reference.yaml index f3ba7f5..1f675dc 100644 --- a/machine-readable/reference.yaml +++ b/machine-readable/reference.yaml @@ -91,6 +91,11 @@ deep_dive: learning_embracing_ai: "guide/learning-with-ai.md:518" learning_30day_plan: "guide/learning-with-ai.md:710" learning_red_flags: "guide/learning-with-ai.md:770" + # Productivity Research RCTs + productivity_rct_metr: "guide/learning-with-ai.md:925" # METR 2025: experienced devs 19% slower on large codebases despite perceiving 20% faster + productivity_rct_echoes: "guide/learning-with-ai.md:926" # Borg 2025: 30.7% faster (median), ~55.9% habitual users, no maintainability impact downstream + productivity_maintainability_empirical: "guide/learning-with-ai.md:926" # Empirical data on "AI code is unmaintainable" claim — blind RCT shows no significant difference + trust_calibration_maintainability_nuance: "guide/ultimate-guide.md:1092" # Nuance: defect rates ≠ maintenance burden (Borg et al. 2025) learning_mode_template: "examples/claude-md/learning-mode.md" learn_quiz_command: "examples/commands/learn/quiz.md" learn_teach_command: "examples/commands/learn/teach.md"