diff --git a/docs/README.md b/docs/README.md
index df459a38..81839295 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -21,6 +21,7 @@ Project-intro and architecture explanation docs are intentionally omitted.
 ## P2 (Benchmarks / Specialized)
 
 1. `docs/e2e-finance-benchmark.md`
+2. `docs/web-tools-policy-optimization.md`
 
 ## Regeneration Rule
 
diff --git a/docs/web-tools-policy-optimization.md b/docs/web-tools-policy-optimization.md
new file mode 100644
index 00000000..eb34d6f3
--- /dev/null
+++ b/docs/web-tools-policy-optimization.md
@@ -0,0 +1,63 @@
+# Web Tools Policy Optimization Roadmap
+
+Related Linear issue: [MUL-267](https://linear.app/indexlabs/issue/MUL-267/refactor-web-evidence-guard-to-hybrid-policy-and-configurable-rule)
+
+## Context
+
+The current web evidence guard solved the immediate quality issue:
+- It enforces `web_search` -> `web_fetch` evidence coverage in runtime.
+- It blocks snippet-only finalization in key web-dependent cases.
+
+However, semantic intent detection currently relies on hard-coded regex cue groups in `packages/core/src/agent/web-tools-policy.ts`. This is deterministic but not ideal for long-term maintainability and multilingual robustness.
+
+## Problem Statement
+
+Current limitations:
+- Semantic classification logic is tightly coupled with runtime enforcement code.
+- Pattern lists are code-level constants, making iteration high-friction.
+- Coverage expansion risks overfitting and regression without a stronger benchmark loop.
+
+## Target Architecture
+
+Use a hybrid policy model:
+1. Deterministic guardrail layer (must keep)
+- Tool-trace based invariants (e.g. search/fetch sequencing, minimum successful fetch count).
+
+2. Semantic decision layer (new)
+- Lightweight model/classifier returns decision + confidence + reason codes.
+
+3. Rulepack fallback layer (refactor existing patterns)
+- Externalized locale-aware cue packs for conservative fallback only.
+
+## Migration Plan
+
+Phase 1: Decouple configuration
+- Move regex cue groups out of `web-tools-policy.ts` into a policy registry.
+- Keep behavior equivalent.
+
+Phase 2: Add semantic classifier path
+- Add an optional semantic decision step with confidence threshold.
+- Preserve deterministic tool-trace constraints as final authority.
+
+Phase 3: Observability and tuning
+- Emit run-log fields for policy decision source:
+  - `tool-trace`
+  - `semantic`
+  - `fallback-pattern`
+- Add benchmark slices focused on false-positive/false-negative policy triggers.
+
+Phase 4: Reduce hard-coded fallback
+- Keep only minimal safety patterns in code.
+- Shift language/phrase evolution to versioned config updates.
+
+## Acceptance Criteria
+
+- No large hard-coded regex arrays in runtime policy file.
+- Semantic decision path is independently testable and feature-flagged.
+- Baseline behavior remains backward-compatible for existing guard cases.
+- Benchmark report shows equal or lower policy misfire rate.
+
+## Non-goals
+
+- Replacing deterministic tool-trace enforcement with pure model decisions.
+- Expanding scope to unrelated tool policy domains in the same iteration.