Initial commit: 6 AI marketing skill categories

- growth-engine: Autonomous experiment engine (Karpathy autoresearch for marketing)
- sales-pipeline: RB2B router, deal resurrector, trigger prospector, ICP learner
- content-ops: Expert panel, quality gate, editorial brain, quote miner
- outbound-engine: Cold outbound optimizer, lead pipeline, competitive monitor
- seo-ops: Content attack briefs, GSC optimizer, trend scout
- finance-ops: CFO briefing, cost estimate, scenario modeler

79 files, all sanitized - zero hardcoded credentials or internal references.
This commit is contained in:
Alfred Claw 2026-03-27 20:14:52 -07:00
commit a96d0d8889
81 changed files with 15050 additions and 0 deletions

21
LICENSE Normal file
View file

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Single Grain
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

146
README.md Normal file
View file

@ -0,0 +1,146 @@
# AI Marketing Skills
**Open-source Claude Code skills for B2B marketing and sales teams.** Built by the team at [Single Grain](https://www.singlegrain.com) — battle-tested on real pipelines generating millions in revenue.
These aren't prompts. They're complete workflows — scripts, scoring algorithms, expert panels, and automation pipelines you can plug into Claude Code (or any AI coding agent) and run today.
---
## 🗂️ Skills
| Category | What It Does | Key Skills |
|----------|-------------|------------|
| [**Growth Engine**](./growth-engine/) | Autonomous marketing experiments that run, measure, and optimize themselves | Experiment Engine, Pacing Alerts, Weekly Scorecard |
| [**Sales Pipeline**](./sales-pipeline/) | Turn anonymous website visitors into qualified pipeline | RB2B Router, Deal Resurrector, Trigger Prospector, ICP Learner |
| [**Content Ops**](./content-ops/) | Ship content that scores 90+ every time | Expert Panel, Quality Gate, Editorial Brain, Quote Miner |
| [**Outbound Engine**](./outbound-engine/) | ICP definition to emails in inbox — fully automated | Cold Outbound Optimizer, Lead Pipeline, Competitive Monitor |
| [**SEO Ops**](./seo-ops/) | Find the keywords your competitors missed | Content Attack Briefs, GSC Optimizer, Trend Scout |
| [**Finance Ops**](./finance-ops/) | Your AI CFO that finds hidden costs in 30 minutes | CFO Briefing, Cost Estimate, Scenario Modeler |
---
## 🚀 Quick Start
Each skill category has its own README with setup instructions. The general pattern:
```bash
# 1. Clone the repo
git clone https://github.com/singlegrain/ai-marketing-skills.git
cd ai-marketing-skills
# 2. Pick a category
cd growth-engine # or sales-pipeline, content-ops, etc.
# 3. Install dependencies
pip install -r requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit .env with your API keys
# 5. Run
python experiment-engine.py create \
--hypothesis "Thread posts get 2x engagement vs single posts" \
--variable format \
--variants '["thread", "single"]' \
--metric impressions
```
---
## 🧠 How These Work with Claude Code
Every category includes a `SKILL.md` file. Drop it into your Claude Code project and the AI agent knows how to use the tools:
```
# In your project directory
cp ai-marketing-skills/growth-engine/SKILL.md .claude/skills/growth-engine.md
```
Then ask Claude Code: *"Run an experiment testing carousel vs. static posts on LinkedIn"* — it handles the rest.
---
## 📊 What Makes These Different
**These aren't toy demos.** Each skill was built to run real business operations:
- **Growth Engine** uses bootstrap confidence intervals and Mann-Whitney U tests — real statistics, not vibes
- **Deal Resurrector** has three intelligence layers including "follow the champion" — tracking departed contacts to their new companies
- **ICP Learner** rewrites your ideal customer profile based on actual win/loss data — your targeting improves automatically
- **Expert Panel** recursively scores content with domain-specific expert personas until quality hits 90+
- **RB2B Router** does intent scoring, seniority-based company dedup, and agency classification before routing to outbound sequences
---
## 📁 Repository Structure
```
ai-marketing-skills/
├── README.md ← You are here
├── growth-engine/ ← Autonomous experiments
│ ├── SKILL.md
│ ├── experiment-engine.py
│ ├── pacing-alert.py
│ ├── autogrowth-weekly-scorecard.py
│ └── ...
├── sales-pipeline/ ← Visitor → pipeline automation
│ ├── SKILL.md
│ ├── rb2b_instantly_router.py
│ ├── deal_resurrector.py
│ ├── trigger_prospector.py
│ ├── icp_learning_analyzer.py
│ └── ...
├── content-ops/ ← Quality scoring & production
│ ├── SKILL.md
│ ├── scripts/
│ ├── experts/ ← 9 expert panel definitions
│ ├── scoring-rubrics/ ← 5 scoring rubric templates
│ └── ...
├── outbound-engine/ ← Cold outbound automation
│ ├── SKILL.md
│ ├── scripts/
│ ├── references/ ← ICP template, copy rules
│ └── ...
├── seo-ops/ ← SEO intelligence
│ ├── SKILL.md
│ ├── content_attack_brief.py
│ ├── gsc_client.py
│ ├── trend_scout.py
│ └── ...
└── finance-ops/ ← Financial analysis
├── SKILL.md
├── scripts/
├── references/ ← Metrics, rates, ROI models
└── ...
```
---
## 🤝 Contributing
Found a bug? Have an improvement? PRs welcome.
1. Fork the repo
2. Create your feature branch (`git checkout -b feature/better-scoring`)
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
---
## 📄 License
MIT License. Use these however you want.
---
## 🏢 About
Built by the marketing engineering team at [Single Grain](https://www.singlegrain.com). We help B2B companies grow with AI-powered marketing and sales operations.
**Want these skills managed for you?** [Talk to us](https://www.singlegrain.com/contact/) — we run these systems for companies doing $10M-$500M in revenue.
---
*Star this repo if you find it useful. It helps others discover these tools.*

36
content-ops/.env.example Normal file
View file

@ -0,0 +1,36 @@
# ── Required ──
# Anthropic API key (for LLM-powered features: editorial brain, content transform, expert panel)
ANTHROPIC_API_KEY=sk-ant-...
# ── Optional: Data directory ──
# Override default data directory (default: ./data/)
# CONTENT_OPS_DATA_DIR=./data
# ── Optional: Editorial Brain ──
# Override default model for editorial brain clip discovery
# EDITORIAL_BRAIN_MODEL=claude-sonnet-4-20250514
# ── Optional: Quote Mining Engine ──
# Path to feeds JSON config: {"Feed Name": "https://feed-url/rss", ...}
# QUOTE_MINING_FEEDS_FILE=./config/feeds.json
# Inline feeds JSON (alternative to file)
# QUOTE_MINING_FEEDS={"My Podcast": "https://feeds.example.com/rss"}
# Directory containing meeting notes (markdown files) to scan for quotes
# QUOTE_MINING_NOTES_DIR=./notes/
# Speaker name to extract from meeting notes (e.g., "John Smith")
# QUOTE_MINING_SPEAKER=
# ── Optional: Content Transform ──
# Voice configuration file (markdown describing your brand voice)
# VOICE_CONFIG_FILE=./config/voice.md
# Style guide file (markdown with writing style rules)
# STYLE_GUIDE_FILE=./config/style-guide.md

168
content-ops/README.md Normal file
View file

@ -0,0 +1,168 @@
# AI Content Ops
**Ship content that scores 90+ every time. Automatically.**
Most content teams publish and pray. This pipeline scores, gates, and iterates every piece of content through an AI expert panel before it goes live. Nothing ships below 90/100.
## What's Inside
### 🎯 Expert Panel (`SKILL.md`)
Claude Code skill that auto-assembles a panel of 7-10 domain experts tailored to whatever you're scoring. Works on:
- Blog posts, social content, email sequences
- Landing pages, ads, CTAs
- Strategy docs, pitch decks, charts
- Recruiting outreach, vendor evaluations
- Literally anything that needs a quality gate
The panel scores your content, identifies weaknesses, revises, and loops until every expert scores 90+. Max 3 rounds. Includes a 1.5x-weighted AI Writing Detector that catches all 24 known AI writing patterns.
### 🚦 Content Quality Gate (`scripts/content-quality-gate.py`)
CI/CD-style gate for your content pipeline. Runs the quality scorer on a batch of drafts and filters out anything below threshold. Nothing publishes without passing.
### 📊 Content Quality Scorer (`scripts/content-quality-scorer.py`)
Automated scoring engine with 5 dimensions:
- **Voice similarity** (35%) — matches your brand voice patterns
- **Specificity** (25%) — real numbers, named entities, concrete examples
- **AI slop penalty** (20%) — detects and penalizes 50+ banned AI words and 8 AI writing patterns
- **Length appropriateness** (10%) — platform-specific character limits
- **Engagement potential** (10%) — hooks, CTAs, debate invitations
### 🧠 Editorial Brain (`scripts/editorial-brain.py`)
Two-pass LLM analysis for finding clip-worthy moments in video transcripts:
1. **Pass 1**: Scans transcript chunks for candidate moments (hook → build → payoff arcs)
2. **Pass 2**: Deep-scores each candidate on hook/build/payoff/clean-cut (0-100)
3. Only 90+ clips get cut
Fundamentally different from keyword matching. Thinks like a human editor.
### ⛏️ Quote Mining Engine (`scripts/quote-mining-engine.py`)
Scans podcast RSS feeds and meeting notes to extract quotable, contrarian, viral-worthy moments. Scoring heuristics:
- Contrarian signals (wrong, myth, overrated, secret...)
- Specificity signals ($amounts, percentages, multipliers)
- Emotional triggers (fear, love, shocking, AI...)
- Shareability signals (how to, framework, lesson learned...)
### 🔄 Content Transform (`scripts/content-transform.py`)
Repurposes long-form content into platform-native formats:
- **X threads/posts** — punchy, data-driven, with ASCII diagrams
- **LinkedIn posts** — hook before the fold, story arc, engagement CTA
- **YouTube Short scripts** — HOOK/SETUP/PAYOFF/CTA structure with visual cues
- **Newsletter sections** — scannable, value-dense, "why this matters"
Includes optional expert panel integration for iterative quality improvement.
## Quick Start
```bash
# 1. Install dependencies
pip install -r requirements.txt
# 2. Set up environment
cp .env.example .env
# Edit .env with your API keys
# 3. Score a batch of content drafts
python scripts/content-quality-scorer.py --input drafts.json --verbose
# 4. Run the quality gate
python scripts/content-quality-gate.py --input drafts.json --threshold 70
# 5. Mine quotes from your podcast RSS
python scripts/quote-mining-engine.py --days 90 --min-score 60
# 6. Find clip-worthy moments in a video
python scripts/editorial-brain.py --url "https://youtube.com/watch?v=..." --min-score 90
# 7. Transform content atoms into platform drafts
python scripts/content-transform.py --atoms atoms.json --top-n 10
```
## Configuration
All scripts use environment variables for configuration. See `.env.example` for the full list.
### Voice Customization
The quality scorer and content transformer use configurable voice patterns. Edit these in your `.env` or pass custom config files:
- `VOICE_MARKERS` — regex patterns that signal your brand voice
- `BANNED_WORDS` — AI slop vocabulary to penalize
- `PLATFORM_LIMITS` — character limits per platform
### Scoring Weights
Adjust scoring weights via a JSON config file:
```json
{
"weights": {
"voice_similarity": 0.35,
"specificity": 0.25,
"slop_penalty": 0.20,
"length_appropriateness": 0.10,
"engagement_potential": 0.10
},
"threshold": 70
}
```
## Expert Panel Domains
Pre-built expert panels included:
- `experts/humanizer.md` — AI writing detection (24 patterns, mandatory)
- `experts/x-articles.md` — X/Twitter long-form posts
- `experts/linkedin.md` — LinkedIn posts
- `experts/newsletter.md` — Email newsletters
- `experts/youtube-shorts.md` — YouTube Shorts scripts
- `experts/instagram.md` — Instagram visual content
- `experts/podcast-quotes.md` — Podcast quote cards
- `experts/recruiting.md` — Recruiting outreach
- `experts/seo-strategy.md` — SEO strategy docs
Scoring rubrics:
- `scoring-rubrics/content-quality.md` — Blog, social, email, scripts
- `scoring-rubrics/strategic-quality.md` — Strategy and analysis
- `scoring-rubrics/conversion-quality.md` — Landing pages, ads, CTAs
- `scoring-rubrics/visual-quality.md` — Charts, infographics, slides
- `scoring-rubrics/evaluation-quality.md` — Candidate/vendor evaluations
## Input Formats
### Content Drafts (for scorer/gate)
```json
{
"drafts": [
{
"id": "draft-001",
"platform": "x",
"draft": "Your content text here..."
}
]
}
```
### Content Atoms (for transformer)
```json
{
"atoms": [
{
"id": "atom-001",
"content": "Long-form source content...",
"tags": ["AI", "marketing"],
"platforms_missing": ["x", "linkedin"],
"repurpose_score": 8
}
]
}
```
## Architecture
```
Content Source → Content Transform → Quality Scorer → Quality Gate → Publish
↑ ↓
Expert Panel ←── Revision Loop (max 3 rounds)
```
The pipeline is modular. Use any script standalone or wire them together.
## License
MIT

231
content-ops/SKILL.md Normal file
View file

@ -0,0 +1,231 @@
---
name: expert-panel
description: >-
Score, evaluate, and iteratively improve any content or strategy using an
auto-assembled panel of domain experts. Handles copy, sequences, landing pages,
strategy docs, titles, charts, recruiting evaluations, or anything else that
needs a quality gate. Recursively iterates until all scores hit 90+ (max 3
rounds). Use when asked to: "expert panel this", "score this", "rate these
variants", "quality check this", "panel review", "which version is better",
"expert score", "evaluate this copy/strategy/page", or when another skill
needs a quality gate on its output. Also triggers on: "score this landing page",
"expert panel these email variants", "rate this headline", "panel these charts".
---
# Expert Panel
General-purpose scoring and iterative improvement engine. Auto-assembles the
right experts for whatever is being evaluated, scores it, and loops until 90+.
---
## Step 1: Intake — Understand What's Being Scored
Collect or infer from context:
1. **Content/artifact** — The thing(s) to score (paste, file path, or URL)
2. **Content type** — Copy, sequence, landing page, strategy, title, chart, candidate eval, etc.
3. **Offer context** — What's being sold/promoted? To whom? What domain/industry?
4. **Variants** — Are there multiple versions to compare? (A/B/C)
5. **Source skill** — Is this output from another skill? (e.g., cold-outbound-optimizer)
If yes, note the source for feedback-to-source routing in Step 6.
If context is obvious from the conversation, don't ask — just proceed.
---
## Step 2: Auto-Assemble the Expert Panel
Build a panel of **710 experts** tailored to the content type and domain.
### Assembly rules
1. **Start with content-type experts.** Read `experts/` directory for pre-built panels matching
the content type. If an exact match exists (e.g., `experts/linkedin.md` for a LinkedIn post),
use it as the base.
2. **Add domain/offer experts.** Based on the offer context, add 13 experts who understand
the specific industry or domain. Examples:
- Scoring bakery marketing → add Food & Beverage Marketing Expert
- Scoring SaaS landing page → add SaaS Conversion Expert
- Scoring recruiting outreach → add Agency Recruiter + Talent Market Expert
- Scoring medical device copy → add Healthcare Compliance Expert
3. **Always include these two:**
- **AI Writing Detector** — See `experts/humanizer.md`. Weight: 1.5x. Non-negotiable.
- **Brand Voice Match** — Checks alignment with the configured brand voice and
known rejection patterns from `references/patterns.md` (if present).
4. **Check learned patterns.** If `references/patterns.md` exists, read it. If any patterns
apply to this content type, brief the panel on them. Dock points for known-bad patterns.
5. **Cap at 10 experts.** If you have more than 10, merge overlapping roles.
### Panel output format
List each expert with: Name, lens/focus, what they check.
---
## Step 3: Select Scoring Rubric
Choose the appropriate rubric from `scoring-rubrics/`:
| Content type | Rubric file |
|---|---|
| Blog, social, email, newsletter, scripts | `scoring-rubrics/content-quality.md` |
| Strategy, recommendations, analysis | `scoring-rubrics/strategic-quality.md` |
| Landing pages, ads, CTAs | `scoring-rubrics/conversion-quality.md` |
| Charts, data viz, infographics | `scoring-rubrics/visual-quality.md` |
| Candidate evaluations | `scoring-rubrics/evaluation-quality.md` |
| Other | Synthesize a rubric from the two closest matches |
Read the selected rubric file for detailed criteria and point allocation.
---
## Step 4: Score — Recursive Loop Until 90+
**Target: 90/100 across all experts. Non-negotiable. Max 3 rounds.**
### Each round produces:
```
## Round [N] — Score: [AVG]/100
| Expert | Score | Key Feedback |
|--------|-------|--------------|
| [Name] | [0-100] | [One-line rationale] |
| ... | ... | ... |
**Aggregate:** [weighted average — humanizer at 1.5x]
**Top 3 weaknesses:** [ranked]
**Changes made:** [specific edits addressing each weakness]
```
Then the revised content/artifact.
### Rules
- Scores must be brutally honest. No padding to 90.
- Humanizer score weighted 1.5x in the aggregate.
- If aggregate < 90: identify top 3 weaknesses revise next round.
- If aggregate ≥ 90: finalize and proceed to output.
- After 3 rounds, if still < 90: return best version with honest score + note on what's
holding it back.
- Show ALL rounds in output — the iteration trail is part of the value.
### Variant comparison mode
When scoring multiple variants (A/B/C):
- Score each variant independently through the full panel.
- After scoring, rank variants by aggregate score.
- If top variant is < 90, iterate on the best one (don't iterate all of them).
---
## Step 5: Output Format
### Winner + Score (always at top)
```
## 🏆 Result: [SCORE]/100 — [PASS ✅ | NEEDS WORK ⚠️]
[Final content/artifact here]
**Iterations:** [N] rounds
**Panel:** [Expert names, comma-separated]
```
If variants: show winner first, then runner-up scores.
```
## 🏆 Winner: Variant [X] — [SCORE]/100
[Winning content]
### Runner-up scores
- Variant A: 87/100
- Variant B: 82/100
- Variant C: 91/100 ← Winner
```
### Feedback History (below the result)
Show full scoring rounds.
```
---
<details>
<summary>📊 Scoring History (N rounds)</summary>
[All round tables from Step 4]
</details>
```
---
## Step 6: Feedback-to-Source (When Scoring Another Skill's Output)
When the scored content came from another skill, generate a **Source Improvement Brief**:
```
## 🔁 Feedback for [Source Skill]
### What scored low
- [Pattern]: [Specific example from this content]
### Suggested skill improvements
- [Concrete change to the source skill's process/rubric/prompt]
### Patterns to add to source skill
- [Any recurring weakness that should become a rule]
```
This brief can be used to update the source skill's SKILL.md or rubrics.
---
## Step 7: Memory — Learn from Approvals and Rejections
After the user approves or rejects panel output:
### On approval (score ≥ 90, user accepts)
Note what worked. No action needed unless a new positive pattern emerges.
### On rejection (user overrides the panel or rejects 90+ content)
1. Ask why (or infer from context).
2. Add a new pattern to `references/patterns.md` using this format:
```markdown
## [Pattern Name]
- **Type:** rejection | preference | override
- **Content types:** [which types this applies to]
- **Rule:** [What to always/never do]
- **Example:** [The specific instance that triggered this]
- **Date:** [YYYY-MM-DD]
- **Point dock:** [-N points when detected]
```
3. Confirm: "Added pattern: [one-line summary]. Panel will dock [N] points for this going forward."
### Pattern enforcement
Every scoring round, check `references/patterns.md` against the content. Apply point docks
before expert scoring begins. This means known-bad patterns are penalized even if individual
experts miss them.
---
## Reference Files
| File | Purpose | When to read |
|---|---|---|
| `experts/humanizer.md` | AI writing detection rubric (24 patterns) | Every scoring run |
| `experts/[domain].md` | Pre-built expert panels for common domains | When domain matches |
| `scoring-rubrics/content-quality.md` | Content scoring rubric | Content scoring |
| `scoring-rubrics/strategic-quality.md` | Strategy scoring rubric | Strategy scoring |
| `scoring-rubrics/conversion-quality.md` | Landing page/ad/CTA rubric | Conversion scoring |
| `scoring-rubrics/visual-quality.md` | Chart/data viz/infographic rubric | Visual scoring |
| `scoring-rubrics/evaluation-quality.md` | Candidate/assessment rubric | Eval scoring |
| `references/patterns.md` | Learned rejection patterns | Every scoring run |
| `references/expert-assembly.md` | Domain-expert examples for auto-assembly | When building unfamiliar panels |

View file

@ -0,0 +1,4 @@
{
"My Marketing Podcast": "https://feeds.example.com/marketing-podcast/rss",
"Industry Show": "https://feeds.example.com/industry-show/rss"
}

View file

@ -0,0 +1,145 @@
# Expert Panel: AI Writing Detector (Humanizer)
## Context
- Based on the 24 AI writing patterns from Wikipedia's "Signs of AI writing" guide
- This expert scores drafts on how AI-generated they sound
- Scoring: 0 = obviously AI-generated, 100 = indistinguishable from human
- This should be the LAST check before any draft is finalized
## Scoring Rubric
### Banned Vocabulary (instant -5 per occurrence)
delve, tapestry, landscape (abstract), leverage, multifaceted, nuanced, pivotal, realm, robust, seamless, testament, transformative, underscore (verb), utilize, whilst, keen, embark, comprehensive, intricate, commendable, meticulous, paramount, groundbreaking, innovative, cutting-edge, synergy, holistic, paradigm, ecosystem, Additionally, align with, crucial, enduring, enhance, fostering, garner, highlight (verb), interplay, intricacies, showcase, vibrant, valuable, profound, renowned, breathtaking, nestled, stunning
### The 24 Patterns to Flag
#### CONTENT PATTERNS
**1. Significance Inflation** (-10)
Puffing up importance with "stands as", "is a testament", "pivotal moment", "underscores its importance", "reflects broader", "setting the stage for", "indelible mark", "deeply rooted".
- Before: "This initiative marked a pivotal moment in the evolution of digital marketing."
- After: "The company launched its first programmatic ad campaign in 2019."
**2. Undue Notability Claims** (-5)
Listing media mentions without context. "Active social media presence", "leading expert".
- Before: "His insights have been featured in Forbes, Inc, and Entrepreneur."
- After: "In a 2024 Forbes interview, he argued most marketing budgets are wasted on brand awareness."
**3. Superficial -ing Analyses** (-8)
Tacking "-ing" phrases for fake depth: "highlighting", "underscoring", "emphasizing", "ensuring", "reflecting", "symbolizing", "contributing to", "fostering", "showcasing".
- Before: "The platform grew 40% YoY, showcasing the team's commitment to innovation and highlighting the importance of user experience."
- After: "The platform grew 40% YoY. Most of that came from a single referral loop they built in Q2."
**4. Promotional Language** (-8)
"Boasts a", "vibrant", "rich" (figurative), "profound", "exemplifies", "commitment to", "natural beauty", "nestled", "in the heart of", "must-visit".
- Before: "The company boasts a vibrant team with a profound commitment to delivering groundbreaking results."
- After: "The company has 45 employees. Revenue grew 32% last year."
**5. Vague Attributions** (-8)
"Industry reports", "Experts argue", "Some critics argue", "several sources". No specific citations.
- Before: "Experts believe AI will transform the marketing landscape."
- After: "A 2024 Gartner survey found 67% of CMOs plan to increase AI spend next year."
**6. Formulaic "Challenges and Future" Sections** (-10)
"Despite its X, faces challenges...", "Despite these challenges, continues to Y", "Future Outlook".
- Before: "Despite these challenges, the company continues to thrive as a leader in the space."
- After: "Customer churn hit 8% in Q3. They hired a retention team in October."
#### LANGUAGE AND GRAMMAR PATTERNS
**7. AI Vocabulary Clustering** (-10)
Multiple banned words in same paragraph. See banned list above.
- Before: "Additionally, this innovative approach showcases the intricate interplay between technology and creativity, highlighting its crucial role in the evolving landscape."
- After: "The tool saves about 3 hours per week on content scheduling. That's it."
**8. Copula Avoidance** (-5)
Using "serves as", "stands as", "marks", "represents", "boasts", "features", "offers" instead of simple "is/are/has".
- Before: "The newsletter serves as a valuable resource for marketers."
- After: "The newsletter is a resource for marketers. 12K subscribers open it weekly."
**9. Negative Parallelisms** (-5)
"Not only...but...", "It's not just about X, it's Y", "It's not merely X, it's Y".
- Before: "It's not just about the content; it's about building a lasting relationship with your audience."
- After: "Good content gets replies. That's how you build an audience."
**10. Rule of Three Overuse** (-8)
Forcing ideas into groups of three. Triple adjectives, triple nouns, triple parallel clauses.
- Before: "The event features keynote sessions, panel discussions, and networking opportunities."
- After: "The event has talks and panels. There's also time for networking between sessions."
**11. Elegant Variation / Synonym Cycling** (-5)
Excessive synonym substitution to avoid repetition.
- Before: "The CEO shared his vision. The business leader outlined the strategy. The company head detailed the plan."
- After: "The CEO shared his vision and outlined the strategy."
**12. False Ranges** (-5)
"From X to Y" where X and Y aren't on a meaningful scale.
- Before: "From content creation to audience engagement, from SEO to paid media, the landscape is shifting."
- After: "Content, SEO, and paid media are all changing. Here's what actually matters."
#### STYLE PATTERNS
**13. Em Dash Overuse** (-5)
More than 1 em dash per 200 words. AI uses them for "punchy" sales writing.
**14. Overuse of Boldface** (-3)
Mechanical bold emphasis on every key term.
**15. Inline-Header Vertical Lists** (-5)
Lists where every item starts with a bolded header + colon.
**16. Title Case in Headings** (-3)
Capitalizing All Main Words In Every Heading.
**17. Emoji Decoration** (-5)
Emojis on headings or bullet points (🚀💡✅).
**18. Curly Quotation Marks** (-2)
Using " " instead of " ".
#### COMMUNICATION PATTERNS
**19. Collaborative Artifacts** (-10)
"I hope this helps", "Of course!", "Certainly!", "Would you like...", "let me know", "here is a...".
**20. Knowledge-Cutoff Disclaimers** (-10)
"As of [date]", "While specific details are limited", "based on available information".
**21. Sycophantic Tone** (-8)
"Great question!", "You're absolutely right!", "That's an excellent point!"
#### FILLER AND HEDGING
**22. Filler Phrases** (-5 each)
"In order to" → "To". "Due to the fact that" → "Because". "At this point in time" → "Now". "It is important to note that" → just state it.
**23. Excessive Hedging** (-8)
"Could potentially possibly", "might have some effect", "it could be argued that".
- Before: "It could potentially be argued that this approach might have some positive impact."
- After: "This approach works. Here's the data."
**24. Generic Positive Conclusions** (-10)
"The future looks bright", "Exciting times lie ahead", "continues their journey toward excellence".
- Before: "The future looks bright for AI in marketing. Exciting times lie ahead."
- After: "They plan to double their AI budget next quarter. We'll see if it pays off."
## Scoring Method
Start at 100. Deduct points for each pattern detected (penalties listed above). Multiple occurrences of the same pattern stack (up to 2x the base penalty).
- **90-100**: Human-sounding. Clean.
- **70-89**: Minor AI tells. Quick fixes needed.
- **50-69**: Obvious AI patterns. Significant rewrite needed.
- **0-49**: Reads like ChatGPT output. Full rewrite.
## What Good Looks Like
Good human writing has:
- Opinions, not just reporting
- Varied sentence rhythm (short punches + longer ones)
- Specific details over vague claims
- Simple verbs (is, has, does) over elaborate constructions
- Acknowledgment of uncertainty or mixed feelings
- First-person perspective when appropriate
- Humor, edge, or personality
- Concrete examples with names, dates, numbers

View file

@ -0,0 +1,28 @@
# Expert Panel: Instagram Visual Content
## Context
- Focus on Instagram infographic and data-driven post captions
- Data-forward, insight-dense, and visually bold content
- Captions should be punchy, hashtagged, and scroll-stopping
## The 6 Experts
1. **Visual Impact Scorer** — Is the image concept scroll-stopping? Would someone pause their scroll for this graphic? Does the headline/hook on the visual create an immediate "I need to read this" reaction? Checks: bold contrast, clear hierarchy, data visualization quality, thumb-stopping composition.
2. **Caption Copywriter** — Is the caption punchy and platform-native? First line is the hook before "more" truncation. Body delivers the insight in 2-3 tight sentences. Hashtags are relevant and placed at the end. No fluff, no filler. Checks: hook strength, hashtag relevance (4-8 tags), caption length (ideal 125-200 chars), CTA presence.
3. **Data Accuracy Checker** — Is the stat or insight correct and properly sourced? Is this a real data point, not a vague "studies show" claim? Is the number specific, recent (within 12 months preferred), and directly relevant? Checks: specificity of data, source existence, recency, no hallucinated stats.
4. **Timeliness Validator** — Is this insight still relevant and interesting today? Has this exact take already flooded Instagram this week? Does the topic align with current conversations in the target space? Checks: topic freshness, differentiation from overposted angles.
5. **Brand Consistency** — Does this match the configured brand voice? Direct, data-driven, slightly contrarian, ROI-obsessed. No motivation porn. No vague inspiration. Only specific, actionable, data-backed insights. Checks: brand voice alignment, topic fit, no generic "hustle" content.
6. **AI Writing Detector (Humanizer)** — 1.5x weight. Checks caption text specifically. Instagram captions fail when they sound AI-generated. See `experts/humanizer.md` for the full 24-pattern rubric. Special Instagram flags: forced hashtag stuffing, overly polished corporate tone, "💡 Key insight:" formatting, sycophantic openers, significance inflation.
## Scoring Criteria
- **Visual hook** — Would someone stop scrolling for this?
- **Caption punch** — First sentence earns the "more" tap
- **Data credibility** — Real numbers, real sources
- **Timeliness** — Fresh angle, not stale take
- **Brand fit** — Matches configured brand voice
- **Human voice** — Reads like a real person, not a content bot

View file

@ -0,0 +1,25 @@
# Expert Panel: LinkedIn Posts
## The 10 Experts
1. **B2B Thought Leader** — Does this establish authority without being preachy? Would a CMO reshare this?
2. **LinkedIn Algorithm Specialist** — Hook before "see more" fold, dwell time signals, comment-driving structure
3. **Storytelling Coach** — Is there a real story? Personal anecdote? Emotional arc?
4. **Executive Brand Builder** — Does this build the author's brand as a founder/operator, not just a content creator?
5. **Engagement Optimizer** — Will this get comments, not just likes? Is there a debate hook?
6. **Hook Writer** — First 2-3 lines before the fold. Would you click "see more"?
7. **Professional Copywriter** — Professional but not corporate. Warm but not soft. Every sentence counts.
8. **Data Visualization Expert** — Are numbers presented compellingly? Could a stat be formatted as a callout?
9. **Community Builder** — Does this invite conversation? Does it make readers feel part of something?
10. **Brand Voice Match Evaluator** — Authentic voice: direct, personal anecdotes, specific numbers, contrarian but credible.
11. **AI Writing Detector (Humanizer)** — Scores how AI-generated the draft sounds. Checks all 24 humanizer patterns. See `experts/humanizer.md` for full rubric. This expert's score is weighted 1.5x.
## Scoring Criteria
- **Hook before "see more" fold** — First 2-3 lines must compel the click
- **Story arc** — Setup → insight → takeaway
- **Professional but not corporate** — No jargon, no "I'm excited to announce"
- **Personal anecdotes** — Real stories from experience
- **Specific data** — Numbers, percentages, dollar amounts
- **Engagement drivers** — Questions, debate hooks, "what would you do?"
- **Line spacing/readability** — Short paragraphs, white space, scannable
- **CTA that drives comments** — Not "like and share" but genuine engagement prompts

View file

@ -0,0 +1,23 @@
# Expert Panel: Newsletter
## The 10 Experts
1. **Email Marketer** — Deliverability, open rate optimization, sender reputation signals
2. **Newsletter Growth Expert** — Is this the kind of content that drives forwards and referrals?
3. **Copywriter** — Every sentence earns its place. No filler. Punchy and clear.
4. **Data Journalist** — Are claims backed by data? Are sources credible and recent?
5. **CTA Specialist** — Is there a clear action? Does the reader know what to do next?
6. **Subject Line Expert** — Would you open this email? 40-50 chars, curiosity or value signal
7. **Retention Specialist** — Will subscribers stay after reading this? Does it deliver on the promise?
8. **Layout/Formatting Coach** — Scannable? Headers, bullets, bold text for skimmers?
9. **Value-Per-Word Optimizer** — Information density. Could this be 20% shorter and still deliver?
10. **Brand Voice Match Evaluator** — Does this sound like the author's newsletter voice: direct, data-rich, actionable?
11. **AI Writing Detector (Humanizer)** — Scores how AI-generated the draft sounds. See `experts/humanizer.md` for full rubric. This expert's score is weighted 1.5x.
## Scoring Criteria
- **Value density** — Every paragraph teaches something specific
- **Scanability** — Headers, bullets, bold. A skimmer gets 80% of the value
- **"Why this matters" clarity** — Reader knows immediately why they should care
- **CTA clarity** — One clear next action
- **Would I forward this?** — The ultimate newsletter test
- **Subject line** — Opens the email, sets expectations

View file

@ -0,0 +1,34 @@
# Expert Panel: Podcast Quote Cards
## Context
- Focus on quote cards extracted from podcast episodes or guest appearances
- Quote cards live on Instagram, LinkedIn, and X — they must work visually AND as text
- Target audience: marketers, agency owners, founders, operators
## The 6 Experts
1. **Quote Impact Scorer** — Is this actually quotable? Would someone screenshot this and send it to a friend? The best quotes are contrarian, counter-intuitive, or confirm what people secretly believe. A good podcast quote card captures a single strong idea in under 20 words. Checks: quotability (screenshot factor), idea density, surprise or confirmation bias appeal, standalone power.
2. **Context Validator** — Does the quote make sense without the full episode? Quote cards get ripped from context constantly. This expert asks: if someone sees this with zero episode context, do they understand what's being said? Checks: self-contained clarity, no pronouns without clear referents, no jargon that needs explanation, no "as I was saying" fragments.
3. **Attribution Accuracy** — Is the quote attributed correctly? Correct speaker name and title. No misattribution, no paraphrase presented as direct quote. Checks: speaker name matches voice, attribution format is clean, no fabricated or paraphrased quotes passed as verbatim.
4. **Audience Relevance** — Does the target audience care about this topic? Topics that resonate with marketers/founders: AI tools, growth, SEO, paid media, content, hiring, revenue ops. Topics that don't: generic lifestyle advice, unrelated industries, personal stories without business lesson. Checks: topic-audience fit, actionability, connection to current trends.
5. **Visual Text Scorer** — Will this text read well on an image card? Quote cards are read at thumbnail size on mobile. Long quotes fail. Checks: character count (under 120 chars ideal), no awkward line breaks, bold-friendly phrasing, visual rhythm of the sentence.
6. **AI Writing Detector (Humanizer)** — 1.5x weight. Applies to both the quote text and the caption. Quotes from real podcast episodes should sound natural and human. Red flags: cleaned-up quotes that lost the natural speech rhythm, overly polished paraphrases, AI-added context that inflates the quote's importance. See `experts/humanizer.md` for the full rubric.
## Scoring Criteria
- **Screenshot factor** — Would someone save and share this?
- **Self-contained** — Makes sense without episode context
- **Attribution accuracy** — Correct speaker, correct format
- **Audience fit** — Relevant to marketers/founders
- **Visual readability** — Works at small size on mobile
- **Human voice** — Sounds like a real person said this
## Quote Standards
- Under 120 characters is ideal for visual cards
- Direct quotes only — no paraphrasing
- Attribution: "— [Name]" or "[Name] on [Show Name]"
- Caption: 1-2 sentences + 3-5 hashtags

View file

@ -0,0 +1,21 @@
# Expert Panel: Recruiting
## The 10 Experts
1. **Agency Recruiter** — Understands agency culture, pace, and what makes someone thrive vs burn out
2. **Talent Acquisition Leader** — Pipeline strategy, sourcing channels, employer branding
3. **Hiring Manager** — Day-to-day fit. Can this person actually do the job on day one?
4. **Culture Fit Assessor** — Values alignment, team dynamics, growth mindset indicators
5. **Compensation Analyst** — Is the offer competitive? Market rate awareness
6. **LinkedIn Sourcer** — Profile signals, career trajectory patterns, red flags in work history
7. **Diversity Specialist** — Diverse perspectives, inclusive hiring practices, bias checks
8. **Startup Hiring Expert** — Can this person handle ambiguity, wear multiple hats, move fast?
9. **AI Fluency Evaluator** — Does this candidate use AI tools? Can they leverage AI in their role?
10. **Industry Insider** — Understands the relevant industry landscape, competitor talent pools
## Scoring Criteria
- **Candidate-role fit** — Skills, experience, and trajectory match the role requirements
- **Evidence quality** — Claims backed by portfolio, metrics, references (not just resume bullets)
- **Risk assessment accuracy** — Honest about gaps, flight risk, culture mismatch potential
- **Outreach angle creativity** — What would make this person respond to a cold message?
- **AI fluency signal strength** — Evidence of AI tool usage, automation mindset, future-readiness

View file

@ -0,0 +1,22 @@
# Expert Panel: SEO Strategy
## The 10 Experts
1. **Technical SEO Specialist** — Site architecture, crawlability, Core Web Vitals, structured data
2. **Content Strategist** — Topic clusters, content gaps, SERP intent alignment
3. **Conversion Rate Optimizer** — Does the SEO strategy connect to revenue, not just traffic?
4. **Revenue Attribution Expert** — Can we tie this recommendation to dollar outcomes?
5. **Competitive Analyst** — What are competitors doing? Where are the gaps?
6. **AI/AEO Specialist** — How does this strategy account for AI Overviews, ChatGPT citations, Perplexity?
7. **Data Scientist** — Is the analysis statistically sound? Are trends real or noise?
8. **Growth Hacker** — What's the fastest path to measurable results?
9. **Operations Expert** — Is this feasible with current resources and timelines?
10. **ROI Calculator** — Does this pass a 4:1 ROI bar? What's the expected return?
## Scoring Criteria
- **Data backing** — Real data cited, not projections or assumptions
- **Actionable specificity** — Clear next steps, not vague "optimize your content"
- **ROI estimate quality** — Realistic, with assumptions stated
- **Risk assessment** — Honest about what could go wrong
- **Feasibility** — Can this actually be executed with available resources?
- **Alignment with priorities** — Serves current business goals

View file

@ -0,0 +1,30 @@
# Expert Panel: X Articles (Long-Form X Posts)
## Context
- Focus on X ARTICLES (long-form X posts), not just threads
- These are meaty, value-dense posts that stop the scroll and deliver insight
## The 10 Experts
1. **Viral X Writer** — Judges structure, pacing, and viral mechanics. Does this follow the patterns that get 100K+ impressions?
2. **Engagement Strategist** — Analyzes reply-bait, shareability, and algorithm signals. Will this get engagement or just impressions?
3. **Hook Specialist** — First 2 lines only. Would you stop scrolling? Is there a curiosity gap, contrarian claim, or surprising stat?
4. **Data Storytelling Expert** — Are the numbers specific, recent, and surprising? Are they woven into narrative or just dropped in?
5. **Contrarian Positioning Coach** — Is there a genuine contrarian angle? Or is this just conventional wisdom repackaged?
6. **CTA Optimizer** — Does the ending drive action? Comments, follows, saves? Is it natural or forced?
7. **Audience Growth Expert** — Will this attract NEW followers or just engage existing ones? Does it signal expertise?
8. **Algorithm Specialist** — Post length, formatting, engagement signals. Will X's algorithm boost this?
9. **Copywriter** — Sentence-level quality. Short punchy sentences? No filler? Every word earns its place?
10. **Brand Voice Match Evaluator** — Does this sound like a real person wrote it? Authentic voice: direct, personal anecdotes, specific numbers, contrarian but credible.
11. **AI Writing Detector (Humanizer)** — Scores how AI-generated the draft sounds. Checks all 24 humanizer patterns: banned vocabulary, significance inflation, formulaic structures, vague attributions, promotional language, hedging, em dash overuse, triple structures, generic conclusions. See `experts/humanizer.md` for full rubric. This expert's score is weighted 1.5x — if it flags the draft as AI-sounding, the draft MUST be revised.
## Scoring Criteria
- **Hook in first 2 lines** — Would you stop scrolling for this?
- **Data specificity** — Real numbers, not vague claims
- **Contrarian angle** — Genuine insight, not clickbait
- **Story arc** — Setup → tension → payoff
- **Voice authenticity** — Sounds like a real person, not a content mill
- **CTA strength** — Natural engagement driver
- **Readability** — Short paragraphs, line breaks, scannable
- **Shareability** — "I need to repost this"
- **Visual elements** — At least one ASCII diagram or visual element

View file

@ -0,0 +1,24 @@
# Expert Panel: YouTube Shorts
## The 10 Experts
1. **Short-Form Creator** — Does this work as a standalone piece? Would you watch it on your For You page?
2. **Retention Curve Specialist** — Where will viewers drop off? Is every second justified?
3. **Script Doctor** — Is the script tight? No wasted words? Clear structure?
4. **Visual Storytelling Expert** — What should be on screen at each moment? B-roll, text overlays, screen shares?
5. **TikTok/Reels Crossover Expert** — Would this work cross-platform? Format-native for each?
6. **Pacing Coach** — Is the energy right? No dead spots? Builds momentum?
7. **Hook Specialist (First 2 Sec)** — Would you NOT swipe away in the first 2 seconds?
8. **Payoff Designer** — Does the ending deliver? Is there a satisfying resolution or surprise?
9. **Re-watch Optimizer** — Is there a loop? A detail you'd catch on second watch?
10. **Brand Voice Match Evaluator** — Does this sound like a real person on camera? Direct, confident, slightly irreverent?
11. **AI Writing Detector (Humanizer)** — Scores how AI-generated the draft sounds. See `experts/humanizer.md` for full rubric. This expert's score is weighted 1.5x.
## Scoring Criteria
- **Hook in first 2 seconds** — Pattern interrupt, surprising claim, or visual hook
- **Setup-payoff structure** — Clear promise → delivery
- **30-60 sec runtime** — Tight, no filler
- **Visual cue quality** — Text overlays, B-roll suggestions, screen share moments
- **"Would I watch this twice?"** — Re-watch value
- **Shareability** — "Send this to someone who needs to hear this"
- **CTA** — Natural, not forced ("Comment X and I'll show you")

View file

@ -0,0 +1,74 @@
# Expert Assembly Guide
Examples of domain-specific experts to add based on offer context. Use this when
auto-assembling panels for unfamiliar domains.
## Assembly principle
The panel needs experts who understand both the **craft** (how to make good content/strategy)
and the **domain** (the specific market, audience, and offer being scored).
---
## Domain Expert Examples
### SaaS / Software
- SaaS Conversion Expert — free trial vs demo, PLG patterns, activation metrics
- Developer Audience Specialist — if targeting devs, knows what resonates vs cringe
- Pricing Page Analyst — tier structure, anchoring, feature comparison tables
### E-Commerce / DTC
- DTC Brand Strategist — unboxing, retention loops, subscription models
- Product Page Optimizer — hero images, reviews, urgency without fakeness
- Email/SMS Commerce Expert — abandoned cart, post-purchase, winback flows
### Healthcare / Medical
- Healthcare Compliance Expert — HIPAA, FDA advertising rules, claim substantiation
- Patient Communication Specialist — empathy without condescension, plain language
- Medical Professional Audience Expert — if targeting HCPs, clinical credibility
### Financial Services
- FinServ Compliance Reviewer — SEC/FINRA advertising rules, disclaimers
- Trust & Authority Expert — credential signaling, risk communication
- Retail Investor Audience Specialist — jargon translation, fear/greed calibration
### Food & Beverage / Restaurant
- Food Marketing Expert — appetite appeal, sensory language, seasonal hooks
- Local Business Marketing Specialist — geo-targeting, community signals
- Visual Food Stylist — photography/visual standards for food content
### Professional Services / Agency
- B2B Services Buyer Expert — what CMOs/VPs actually respond to
- Case Study Analyst — proof structure, metrics that matter, client story arc
- Competitive Positioning Expert — differentiation in crowded service markets
### Education / Courses
- Course Launch Expert — urgency, social proof, transformation promise
- Curriculum Designer — learning outcomes, module structure, completion optimization
- Student Success Storyteller — before/after, specific outcomes, relatable journeys
### Real Estate
- Real Estate Marketing Expert — listing copy, neighborhood selling, visual standards
- Luxury Market Specialist — if high-end, understands aspiration vs information
- Lead Nurture Expert — long sales cycles, drip sequence optimization
---
## Universal Experts (always consider)
These roles apply to nearly any domain:
- **Audience Empathy Expert** — Does the scorer actually understand the target audience's
daily reality, pain points, and language?
- **Competitive Context Expert** — What else is the audience seeing? Is this differentiated
or just another version of what everyone says?
- **Offer Clarity Expert** — Can someone understand what they get, what it costs, and what
happens next in under 10 seconds?
---
## When no pre-built panel exists
1. Identify the content type → pick 3-4 craft experts (copywriter, designer, strategist, etc.)
2. Identify the domain → pick 2-3 domain experts from above or synthesize new ones
3. Add humanizer (mandatory) and brand voice match (mandatory)
4. Cap at 10, merge overlapping roles

View file

@ -0,0 +1,15 @@
# Learned Patterns
Patterns learned from content approvals and rejections. The expert panel checks these
before scoring begins and docks points for known-bad patterns.
<!-- Add patterns as they are learned. Format:
## [Pattern Name]
- **Type:** rejection | preference | override
- **Content types:** [which types this applies to]
- **Rule:** [What to always/never do]
- **Example:** [The specific instance that triggered this]
- **Date:** [YYYY-MM-DD]
- **Point dock:** [-N points when detected]
-->

View file

@ -0,0 +1,7 @@
# Core dependencies
anthropic>=0.39.0 # Claude API client (for LLM-powered features)
feedparser>=6.0.0 # RSS feed parsing (quote mining engine)
# Optional: for video clip cutting (editorial brain)
# yt-dlp # YouTube subtitle/video download (install separately)
# ffmpeg # Video cutting (install via system package manager)

View file

@ -0,0 +1,25 @@
## Content Quality Rubric (0-100)
### Hook Power (0-25)
- 0-5: Generic, no reason to keep reading
- 6-15: Interesting but not urgent
- 16-20: Strong curiosity gap or contrarian claim
- 21-25: Impossible to scroll past. Specific, surprising, personal.
### Voice Authenticity (0-25)
- Does this sound like a real person wrote it?
- Short punchy sentences? Specific numbers? Personal framing?
- No corporate jargon? No filler words?
- Contrarian but backed by data?
### Value Density (0-25)
- Every sentence earns its place
- Specific data points, not vague claims
- Actionable insight, not just observation
- "I learned something I can use today"
### Engagement Potential (0-25)
- Would someone share/repost this?
- Does the CTA invite genuine response?
- Does it spark debate or agreement?
- Platform-native formatting?

View file

@ -0,0 +1,27 @@
## Conversion Quality Rubric (0-100)
For landing pages, ads, CTAs, signup flows, pricing pages.
### Headline / Hero (0-25)
- 0-5: Generic, no clear value prop
- 6-15: Communicates offer but not compelling
- 16-20: Clear value prop with specificity
- 21-25: Impossible to bounce. Specific, urgent, addresses the visitor's exact pain.
### Clarity & Friction (0-25)
- Is the offer immediately obvious? (3-second test)
- Can a visitor complete the desired action without confusion?
- Are there unnecessary form fields, steps, or distractions?
- Does copy match the traffic source expectation?
### Social Proof & Trust (0-25)
- Specific results (numbers, names, companies) vs vague testimonials
- Trust signals (logos, security badges, guarantees) present and credible
- Case studies or data points that prove the claim
- No fake urgency or manufactured scarcity
### CTA Strength (0-25)
- CTA copy specific to the action ("Get my audit" > "Submit")
- CTA visible without scrolling
- Single clear primary action (no competing CTAs)
- Micro-copy reduces anxiety ("No credit card required", "2-minute setup")

View file

@ -0,0 +1,27 @@
## Evaluation Quality Rubric (0-100)
For candidate assessments, vendor evaluations, tool comparisons, opportunity scoring.
### Evidence Quality (0-25)
- Claims backed by data, portfolio, references, or verifiable metrics
- No resume-bullet-level assertions without proof
- Specific examples cited (projects, outcomes, timelines)
- Red flags acknowledged, not glossed over
### Criteria Relevance (0-25)
- Evaluation criteria match the actual role/need
- Weighted by what matters most (not equal weight to everything)
- Context-appropriate (startup vs enterprise, junior vs senior)
- Anti-criteria considered (what would make this a bad fit?)
### Risk Assessment (0-25)
- Honest about gaps, unknowns, and flight risk
- Mitigation strategies suggested for identified risks
- Comparison to alternatives or market baseline
- No false confidence — uncertainty stated clearly
### Actionability (0-25)
- Clear recommendation (hire/pass/shortlist, buy/skip, proceed/wait)
- Next steps defined
- Decision criteria transparent
- Dissenting view included if panel is split

View file

@ -0,0 +1,21 @@
## Strategic Quality Rubric (0-100)
### Data Foundation (0-25)
- Real data cited, not projections
- Sources verifiable
- Numbers specific and recent
### Actionability (0-25)
- Clear next step
- Timeline realistic
- Resources identified
### ROI Clarity (0-25)
- 4:1 minimum demonstrated
- Costs estimated
- Comparison to alternatives
### Risk Assessment (0-25)
- Honest about what could go wrong
- Mitigation plan included
- Dependencies identified

View file

@ -0,0 +1,27 @@
## Visual Quality Rubric (0-100)
For charts, data visualizations, infographics, diagrams, slide decks.
### Data Accuracy & Integrity (0-25)
- Numbers match the source data
- Axes labeled correctly, scales not misleading
- No cherry-picked timeframes or truncated axes
- Source cited
### Visual Clarity (0-25)
- Can a viewer understand the main point in under 5 seconds?
- Labels readable at expected display size
- Color choices accessible (colorblind-safe)
- No chart junk (unnecessary gridlines, 3D effects, decorative elements)
### Insight Delivery (0-25)
- Does the visualization tell a story or just display data?
- Is the "so what?" obvious without explanation?
- Annotations highlight the key takeaway
- Title states the insight, not just the topic ("Revenue doubled in Q3" > "Revenue by quarter")
### Design & Polish (0-25)
- Consistent typography and color palette
- Proper alignment and spacing
- Brand-appropriate styling
- Mobile/thumbnail readable if applicable

View file

@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
Content Quality Gate CI/CD-style gate for content publishing.
Filters drafts through quality scorer before they publish.
Nothing goes live without passing automated quality scoring.
Usage:
python content-quality-gate.py --input drafts.json
python content-quality-gate.py --input drafts.json --conservative
python content-quality-gate.py --input drafts.json --threshold 75
"""
import json
import os
import sys
import argparse
from pathlib import Path
from datetime import datetime, timezone
import subprocess
SCRIPT_DIR = Path(__file__).resolve().parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = Path(os.environ.get("CONTENT_OPS_DATA_DIR", PROJECT_DIR / "data"))
DRAFTS_INPUT_FILE = DATA_DIR / "content-drafts-latest.json"
DRAFTS_OUTPUT_FILE = DATA_DIR / "content-drafts-filtered.json"
QUALITY_SCORES_FILE = DATA_DIR / "quality-scores-latest.json"
def run_quality_scorer(input_file, verbose=False):
"""Run the quality scorer on the drafts file."""
scorer_script = SCRIPT_DIR / "content-quality-scorer.py"
cmd = [
sys.executable,
str(scorer_script),
"--input", str(input_file)
]
if verbose:
cmd.append("--verbose")
print(f"🔍 Running quality scorer...")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"❌ Quality scorer failed:")
print(f"STDOUT: {result.stdout}")
print(f"STDERR: {result.stderr}")
return False
if verbose:
print(result.stdout)
return True
def load_quality_scores():
"""Load the latest quality scoring results."""
if not QUALITY_SCORES_FILE.exists():
print(f"❌ Quality scores file not found: {QUALITY_SCORES_FILE}")
return None
try:
with open(QUALITY_SCORES_FILE) as f:
return json.load(f)
except Exception as e:
print(f"❌ Error loading quality scores: {e}")
return None
def filter_drafts_by_quality(drafts, quality_results, conservative_mode=False):
"""Filter drafts based on quality scores."""
if not quality_results or "results" not in quality_results:
print("❌ No quality results available for filtering")
return drafts, []
passed_ids = set()
failed_drafts = []
quality_by_id = {}
for result in quality_results["results"]:
draft_id = result.get("draft_id")
quality_by_id[draft_id] = result
if result.get("passed", False):
passed_ids.add(draft_id)
else:
failed_drafts.append({
"draft_id": draft_id,
"platform": result.get("platform"),
"score": result.get("total_score"),
"reasons": result.get("failure_reasons", [])
})
filtered_drafts = []
for draft in drafts:
draft_id = draft.get("id")
if draft_id in quality_by_id:
quality_info = quality_by_id[draft_id]
draft["quality_score"] = quality_info.get("total_score")
draft["quality_passed"] = quality_info.get("passed")
draft["quality_reasons"] = quality_info.get("failure_reasons", [])
draft["quality_scored_at"] = quality_info.get("scored_at")
if conservative_mode:
filtered_drafts.append(draft)
elif draft_id in passed_ids:
filtered_drafts.append(draft)
return filtered_drafts, failed_drafts
def save_filtered_drafts(original_data, filtered_drafts, quality_results):
"""Save filtered drafts with quality metadata."""
filtered_data = original_data.copy()
filtered_data["drafts"] = filtered_drafts
filtered_data["filtered_at"] = datetime.now(timezone.utc).isoformat()
filtered_data["quality_gate_applied"] = True
filtered_data["original_draft_count"] = original_data.get("draft_count", len(original_data.get("drafts", [])))
filtered_data["filtered_draft_count"] = len(filtered_drafts)
filtered_data["quality_threshold"] = quality_results.get("threshold")
filtered_data["quality_pass_rate"] = quality_results.get("pass_rate")
filtered_data["quality_average_score"] = quality_results.get("average_score")
filtered_data["draft_count"] = len(filtered_drafts)
DRAFTS_OUTPUT_FILE.parent.mkdir(parents=True, exist_ok=True)
with open(DRAFTS_OUTPUT_FILE, 'w') as f:
json.dump(filtered_data, f, indent=2)
return filtered_data
def run_quality_gate(input_file=None, conservative_mode=False, verbose=False):
"""Run the complete quality gate process."""
input_path = Path(input_file) if input_file else DRAFTS_INPUT_FILE
if not input_path.exists():
print(f"❌ Input file not found: {input_path}")
return None
try:
with open(input_path) as f:
original_data = json.load(f)
drafts = original_data.get("drafts", [])
print(f"📊 Loaded {len(drafts)} drafts from {input_path}")
except Exception as e:
print(f"❌ Error loading drafts: {e}")
return None
if not drafts:
print("❌ No drafts found in input file")
return None
if not run_quality_scorer(input_path, verbose):
return None
quality_results = load_quality_scores()
if not quality_results:
return None
filtered_drafts, failed_drafts = filter_drafts_by_quality(drafts, quality_results, conservative_mode)
filtered_data = save_filtered_drafts(original_data, filtered_drafts, quality_results)
original_count = len(drafts)
filtered_count = len(filtered_drafts)
filtered_out = original_count - filtered_count
print(f"\n{'='*60}")
print(f"QUALITY GATE RESULTS")
print(f"{'='*60}")
print(f"Original drafts: {original_count}")
print(f"Passed quality gate: {filtered_count}")
print(f"Filtered out: {filtered_out}")
print(f"Pass rate: {quality_results.get('pass_rate', 0):.1f}%")
print(f"Average score: {quality_results.get('average_score', 0):.1f}/100")
print(f"Threshold: {quality_results.get('threshold', 60)}/100")
if conservative_mode:
print(f"\n⚠️ CONSERVATIVE MODE: All drafts passed through with quality flags")
platform_stats = {}
for draft in filtered_drafts:
platform = draft.get("platform", "unknown")
platform_stats[platform] = platform_stats.get(platform, 0) + 1
if platform_stats:
print(f"\n📱 Filtered Drafts by Platform:")
for platform, count in sorted(platform_stats.items()):
print(f" {platform}: {count}")
if failed_drafts:
failure_reasons = {}
for failed in failed_drafts:
for reason in failed["reasons"]:
failure_reasons[reason] = failure_reasons.get(reason, 0) + 1
if failure_reasons:
print(f"\n❌ Top Failure Reasons:")
for reason, count in sorted(failure_reasons.items(), key=lambda x: x[1], reverse=True)[:3]:
print(f" {reason}: {count} drafts")
print(f"\n💾 Filtered drafts saved to: {DRAFTS_OUTPUT_FILE}")
if filtered_count == 0:
print("\n⚠️ WARNING: No drafts passed quality gate!")
print("Consider lowering threshold or improving content quality.")
return None
return filtered_data
def main():
parser = argparse.ArgumentParser(description="Filter content drafts through quality gate")
parser.add_argument("--input", type=str, help="Input drafts JSON file")
parser.add_argument("--conservative", action="store_true", help="Pass all drafts but add quality flags")
parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
parser.add_argument("--threshold", type=float, help="Override quality threshold")
args = parser.parse_args()
if args.threshold:
weights_file = DATA_DIR / "quality-scorer-weights.json"
if weights_file.exists():
try:
with open(weights_file) as f:
weights_data = json.load(f)
weights_data["threshold"] = args.threshold
with open(weights_file, 'w') as f:
json.dump(weights_data, f, indent=2)
print(f"🎯 Set threshold to {args.threshold}")
except Exception as e:
print(f"⚠ Could not update threshold: {e}")
filtered_data = run_quality_gate(
input_file=args.input,
conservative_mode=args.conservative,
verbose=args.verbose
)
if filtered_data:
filtered_count = filtered_data.get("filtered_draft_count", 0)
if filtered_count > 0:
print(f"\n📤 Next: Pass filtered drafts to your publishing pipeline")
else:
print(f"\n⚠️ No drafts to publish. Consider:")
print(f" • Lowering threshold: --threshold 50")
print(f" • Conservative mode: --conservative")
print(f" • Improving content quality in transform step")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,525 @@
#!/usr/bin/env python3
"""
Content Quality Scorer Automated content scoring engine.
Scores drafts against configurable voice patterns BEFORE they publish.
Five scoring dimensions: voice similarity, specificity, AI slop detection,
length appropriateness, and engagement potential.
Input: JSON file with drafts array
Output: scored drafts with pass/fail recommendations
Usage:
python content-quality-scorer.py --input drafts.json --verbose
python content-quality-scorer.py --input drafts.json --threshold 75
python content-quality-scorer.py --init-weights # Create default weights file
"""
import json
import re
import os
import sys
import argparse
from pathlib import Path
from datetime import datetime, timezone
from collections import Counter
import math
# ── Configuration (all paths relative/configurable) ──
SCRIPT_DIR = Path(__file__).resolve().parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = Path(os.environ.get("CONTENT_OPS_DATA_DIR", PROJECT_DIR / "data"))
DRAFTS_FILE = DATA_DIR / "content-drafts-latest.json"
WEIGHTS_FILE = DATA_DIR / "quality-scorer-weights.json"
LOG_FILE = DATA_DIR / "quality-scores-log.json"
# Default scoring threshold (adjustable)
DEFAULT_THRESHOLD = 60
# Platform character limits
PLATFORM_LIMITS = {
"x": {"min": 50, "max": 280, "optimal_min": 150, "optimal_max": 260},
"linkedin": {"min": 200, "max": 1500, "optimal_min": 500, "optimal_max": 1200},
"youtube_short": {"min": 100, "max": 800, "optimal_min": 200, "optimal_max": 600},
"newsletter": {"min": 300, "max": 2000, "optimal_min": 800, "optimal_max": 1600},
}
# Banned AI words — penalized in scoring
BANNED_WORDS = [
"leverage", "synergy", "ecosystem", "holistic", "at the end of the day",
"delve", "tapestry", "landscape", "multifaceted", "nuanced", "pivotal",
"realm", "robust", "seamless", "testament", "transformative", "underscore",
"utilize", "whilst", "keen", "embark", "comprehensive", "intricate",
"commendable", "meticulous", "paramount", "groundbreaking", "innovative",
"cutting-edge", "paradigm", "Additionally", "crucial", "enduring",
"enhance", "fostering", "garner", "highlight", "interplay", "intricacies",
"showcase", "vibrant", "valuable", "profound", "renowned", "breathtaking",
"nestled", "stunning", "I'm excited to share", "I think maybe",
"It could potentially", "dive into", "game-changer", "unlock"
]
# AI patterns to detect
AI_PATTERNS = [
(r"pivotal moment|is a testament|stands as", "significance_inflation"),
(r"boasts|vibrant|commitment to", "promotional_language"),
(r"experts believe|industry reports|studies show", "vague_attribution"),
(r"despite.{1,50}continues to", "formulaic_structure"),
(r"serves as|acts as|functions as", "copula_avoidance"),
(r"it's not just .{1,30}, it's", "negative_parallelism"),
(r"could potentially|might possibly|may perhaps", "excessive_hedging"),
(r"the future looks bright|exciting times ahead|stay tuned", "generic_conclusion"),
]
# Voice markers — configurable positive signals for your brand voice
# Override these by setting VOICE_MARKERS_FILE env var pointing to a JSON file
VOICE_MARKERS = [
# Numbers with specificity
(r'\$[\d,]+[KkMmBb]?(?:\+)?', 2.0, "revenue_markers"),
(r'\d+%', 1.5, "percentage_stats"),
(r'\d+x', 1.5, "multiplier_stats"),
(r'\d+ (?:hours?|minutes?|days?|weeks?|months?|years?)', 1.0, "time_specifics"),
(r'\d+ (?:pages?|pieces?|tools?|agents?|companies|founders?|members)', 1.0, "count_specifics"),
# Personal framing
(r'I (?:built|found|asked|remember|had lunch)', 2.0, "personal_framing"),
(r'Here\'s what happened|A friend who|I asked \d+', 1.5, "story_framing"),
# Contrarian hooks
(r'Most people .{1,50} wrong|Everyone says .{1,30} That\'s', 2.0, "contrarian_hooks"),
(r'Harsh reality:', 1.5, "harsh_reality"),
# Engagement patterns
(r'What\'s your take\?|What did I miss\?|What would you do', 1.0, "engagement_cta"),
# Short sentences (under 15 words)
(r'[.!?]\s+[A-Z][^.!?]{1,75}[.!?]', 0.5, "short_sentences"),
]
# Default scoring weights
DEFAULT_WEIGHTS = {
"voice_similarity": 0.35,
"specificity": 0.25,
"slop_penalty": 0.20,
"length_appropriateness": 0.10,
"engagement_potential": 0.10,
}
def load_weights():
"""Load scoring weights from file or return defaults."""
if WEIGHTS_FILE.exists():
try:
with open(WEIGHTS_FILE) as f:
data = json.load(f)
weights = data.get("weights", DEFAULT_WEIGHTS)
threshold = data.get("threshold", DEFAULT_THRESHOLD)
return weights, threshold
except Exception as e:
print(f"⚠ Error loading weights: {e}, using defaults")
return DEFAULT_WEIGHTS, DEFAULT_THRESHOLD
def save_weights(weights, threshold):
"""Save scoring weights and threshold to file."""
data = {
"weights": weights,
"threshold": threshold,
"updated_at": datetime.now(timezone.utc).isoformat(),
"version": "1.0"
}
WEIGHTS_FILE.parent.mkdir(parents=True, exist_ok=True)
with open(WEIGHTS_FILE, 'w') as f:
json.dump(data, f, indent=2)
def log_score(draft_id, platform, scores, passed, reasons):
"""Log scoring results for analysis."""
log_entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"draft_id": draft_id,
"platform": platform,
"scores": scores,
"total_score": sum(scores.values()),
"passed": passed,
"failure_reasons": reasons,
}
log_data = []
if LOG_FILE.exists():
try:
with open(LOG_FILE) as f:
log_data = json.load(f)
except Exception:
log_data = []
log_data.append(log_entry)
# Keep only last 1000 entries
if len(log_data) > 1000:
log_data = log_data[-1000:]
LOG_FILE.parent.mkdir(parents=True, exist_ok=True)
with open(LOG_FILE, 'w') as f:
json.dump(log_data, f, indent=2)
def score_voice_similarity(draft_text):
"""Score how well draft matches voice patterns (0-100)."""
score = 0
matches = {}
for pattern, weight, category in VOICE_MARKERS:
pattern_matches = re.findall(pattern, draft_text, re.IGNORECASE)
if pattern_matches:
match_count = len(pattern_matches)
category_score = min(weight * math.log(match_count + 1) * 10, weight * 25)
score += category_score
matches[category] = matches.get(category, 0) + match_count
# Bonus for short punchy sentences
sentences = re.split(r'[.!?]+', draft_text)
short_sentences = [s for s in sentences if len(s.split()) <= 15 and len(s.split()) >= 3]
sentence_ratio = len(short_sentences) / max(len(sentences), 1)
score += sentence_ratio * 15
return min(score, 100), matches
def score_specificity(draft_text):
"""Score specificity — real numbers, examples, named entities (0-100)."""
score = 0
number_patterns = [
r'\$[\d,]+[KkMmBb]?(?:\+)?',
r'\d+%',
r'\d+x',
r'\d+[\.,]?\d*\s*(?:hours?|minutes?|days?|weeks?|months?|years?)',
r'\d+\s*(?:pages?|pieces?|tools?|agents?|companies|founders?|members)',
]
total_numbers = 0
for pattern in number_patterns:
matches = re.findall(pattern, draft_text, re.IGNORECASE)
total_numbers += len(matches)
word_count = len(draft_text.split())
number_density = total_numbers / max(word_count / 50, 1)
score += min(number_density * 30, 50)
# Named entities and specific examples
entity_patterns = [
r'[A-Z][a-z]+ [A-Z][a-z]+(?:\s[A-Z][a-z]+)*',
r'@[A-Za-z0-9_]+',
r'(?:Apple|Google|Meta|Microsoft|Amazon|Tesla|ChatGPT|Claude|OpenAI)',
]
entity_count = 0
for pattern in entity_patterns:
matches = re.findall(pattern, draft_text)
entity_count += len(matches)
score += min(entity_count * 10, 30)
# Before/after comparisons
comparison_patterns = [
r'\d+.*→.*\d+',
r'from \d+.*to \d+',
r'before.*\d+.*after.*\d+',
r'used to.*now.*'
]
for pattern in comparison_patterns:
if re.search(pattern, draft_text, re.IGNORECASE):
score += 10
break
return min(score, 100)
def score_slop_penalty(draft_text):
"""Detect and penalize AI slop and banned phrases (0-100, higher = less slop)."""
score = 100
detected_issues = []
text_lower = draft_text.lower()
banned_found = []
for word in BANNED_WORDS:
if word.lower() in text_lower:
banned_found.append(word)
score -= 10
if banned_found:
detected_issues.append(f"Banned words: {', '.join(banned_found[:3])}")
ai_patterns_found = []
for pattern, pattern_name in AI_PATTERNS:
matches = re.findall(pattern, draft_text, re.IGNORECASE)
if matches:
ai_patterns_found.append(pattern_name)
score -= 8
if ai_patterns_found:
detected_issues.append(f"AI patterns: {', '.join(ai_patterns_found[:3])}")
# Em dash overuse
em_dash_count = draft_text.count('')
word_count = len(draft_text.split())
if em_dash_count > word_count / 200:
score -= 5
detected_issues.append("Excessive em dash usage")
# Corporate speak
corporate_patterns = [
r'I\'m excited to share',
r'it is important to note',
r'in order to',
r'we are pleased to announce',
r'stay tuned for',
]
for pattern in corporate_patterns:
if re.search(pattern, draft_text, re.IGNORECASE):
score -= 15
detected_issues.append("Corporate speak detected")
break
return max(score, 0), detected_issues
def score_length_appropriateness(draft_text, platform):
"""Score if content length is appropriate for platform (0-100)."""
char_count = len(draft_text)
limits = PLATFORM_LIMITS.get(platform, PLATFORM_LIMITS["x"])
if char_count < limits["min"]:
shortfall_ratio = char_count / limits["min"]
return max(shortfall_ratio * 100, 20)
elif char_count > limits["max"]:
excess_ratio = limits["max"] / char_count
return max(excess_ratio * 100, 30)
elif limits["optimal_min"] <= char_count <= limits["optimal_max"]:
return 100
else:
return 85
def score_engagement_potential(draft_text, platform):
"""Score engagement potential based on CTAs and hooks (0-100)."""
score = 0
cta_patterns = {
"x": [r'What\'s your take\?', r'What did I miss\?', r'Reply with'],
"linkedin": [r'What would you do', r'What do you think', r'Drop .* below', r'curious.*your'],
"youtube_short": [r'Comment.*and I\'ll', r'Follow for more'],
"newsletter": [r'subscribe', r'read more', r'check it out'],
}
platform_ctas = cta_patterns.get(platform, cta_patterns["x"])
for pattern in platform_ctas:
if re.search(pattern, draft_text, re.IGNORECASE):
score += 25
break
# Strong hooks (first 100 characters)
hook = draft_text[:100]
hook_patterns = [
r'^\d+.*\.',
r'^Most people.*wrong',
r'^I (?:built|found|asked)',
r'^Harsh reality:',
r'^Here\'s what',
]
for pattern in hook_patterns:
if re.search(pattern, hook, re.IGNORECASE):
score += 25
break
# Question-based engagement
question_count = len(re.findall(r'\?', draft_text))
if question_count >= 1:
score += min(question_count * 15, 30)
# Debate invitation
debate_patterns = [
r'Agree or disagree',
r'What\'s your experience',
r'Change my mind',
]
for pattern in debate_patterns:
if re.search(pattern, draft_text, re.IGNORECASE):
score += 20
break
return min(score, 100)
def score_draft(draft, weights, threshold):
"""Score a single draft against all criteria."""
platform = draft.get("platform", "x")
draft_text = draft.get("draft", "")
voice_score, voice_matches = score_voice_similarity(draft_text)
specificity_score = score_specificity(draft_text)
slop_score, slop_issues = score_slop_penalty(draft_text)
length_score = score_length_appropriateness(draft_text, platform)
engagement_score = score_engagement_potential(draft_text, platform)
scores = {
"voice_similarity": voice_score,
"specificity": specificity_score,
"slop_penalty": slop_score,
"length_appropriateness": length_score,
"engagement_potential": engagement_score,
}
total_score = sum(scores[key] * weights[key] for key in scores.keys())
total_score = round(total_score, 1)
passed = total_score >= threshold
failure_reasons = []
if voice_score < 50:
failure_reasons.append("Low voice match - lacks brand voice patterns")
if specificity_score < 40:
failure_reasons.append("Not specific enough - needs real numbers/examples")
if slop_score < 70:
failure_reasons.append("Contains AI slop - " + "; ".join(slop_issues))
if length_score < 60:
failure_reasons.append(f"Length issue for {platform}")
if engagement_score < 40:
failure_reasons.append("Weak engagement - needs better CTA/hook")
result = {
"draft_id": draft.get("id"),
"platform": platform,
"total_score": total_score,
"scores": scores,
"passed": passed,
"failure_reasons": failure_reasons,
"voice_matches": voice_matches,
"slop_issues": slop_issues,
"char_count": len(draft_text),
"scored_at": datetime.now(timezone.utc).isoformat(),
}
log_score(draft.get("id"), platform, scores, passed, failure_reasons)
return result
def score_drafts_file(file_path=None, output_path=None, threshold_override=None, verbose=False):
"""Score all drafts in a file."""
input_file = Path(file_path) if file_path else DRAFTS_FILE
if not input_file.exists():
print(f"❌ Input file not found: {input_file}")
return None
with open(input_file) as f:
data = json.load(f)
drafts = data.get("drafts", [])
if not drafts:
print("❌ No drafts found in input file")
return None
weights, threshold = load_weights()
if threshold_override:
threshold = threshold_override
print(f"📊 Using threshold override: {threshold}")
print(f"📊 Scoring {len(drafts)} drafts with threshold {threshold}")
if verbose:
print(f"📊 Weights: {weights}")
results = []
passed_count = 0
for i, draft in enumerate(drafts):
result = score_draft(draft, weights, threshold)
results.append(result)
if result["passed"]:
passed_count += 1
if verbose:
print(f"\n[{i+1}/{len(drafts)}] {result['platform']} | Score: {result['total_score']}/100")
if result["passed"]:
print(f" ✅ PASS")
else:
print(f" ❌ FAIL: {'; '.join(result['failure_reasons'])}")
total_scores = [r["total_score"] for r in results]
avg_score = sum(total_scores) / len(total_scores)
pass_rate = (passed_count / len(results)) * 100
summary = {
"scored_at": datetime.now(timezone.utc).isoformat(),
"total_drafts": len(drafts),
"passed_count": passed_count,
"pass_rate": round(pass_rate, 1),
"average_score": round(avg_score, 1),
"threshold": threshold,
"weights": weights,
"results": results,
}
if output_path:
output_file = Path(output_path)
else:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = DATA_DIR / f"quality-scores-{timestamp}.json"
output_file.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w') as f:
json.dump(summary, f, indent=2)
latest_file = DATA_DIR / "quality-scores-latest.json"
with open(latest_file, 'w') as f:
json.dump(summary, f, indent=2)
print(f"\n{'='*60}")
print(f"QUALITY SCORING RESULTS")
print(f"{'='*60}")
print(f"Total drafts: {len(drafts)}")
print(f"Passed: {passed_count} ({pass_rate:.1f}%)")
print(f"Failed: {len(drafts) - passed_count}")
print(f"Average score: {avg_score:.1f}/100")
print(f"Threshold: {threshold}/100")
print(f"\nSaved to: {output_file}")
print(f"Saved to: {latest_file}")
if verbose:
print(f"\n🏆 TOP SCORING DRAFTS:")
top_drafts = sorted(results, key=lambda x: x["total_score"], reverse=True)[:3]
for i, result in enumerate(top_drafts):
status = "✅ PASS" if result["passed"] else "❌ FAIL"
print(f" {i+1}. {result['platform']} | {result['total_score']}/100 | {status}")
return summary
def main():
parser = argparse.ArgumentParser(description="Score content drafts for quality")
parser.add_argument("--input", type=str, help="Input drafts JSON file")
parser.add_argument("--output", type=str, help="Output scores JSON file")
parser.add_argument("--threshold", type=float, help="Scoring threshold override")
parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
parser.add_argument("--init-weights", action="store_true", help="Initialize default weights file")
args = parser.parse_args()
if args.init_weights:
save_weights(DEFAULT_WEIGHTS, DEFAULT_THRESHOLD)
print(f"✅ Initialized weights file: {WEIGHTS_FILE}")
return
score_drafts_file(
file_path=args.input,
output_path=args.output,
threshold_override=args.threshold,
verbose=args.verbose
)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,745 @@
#!/usr/bin/env python3
"""
Content Transform Repurpose long-form content into platform-native drafts.
Reads content atoms, generates platform-native drafts using Claude API + optional
expert panel quality gate. Supports X threads/posts, LinkedIn, YouTube Shorts, and
newsletter formats.
LLM mode is DEFAULT. Use --template-only for fast template-based drafts (no API needed).
Usage:
python content-transform.py --atoms atoms.json --top-n 10
python content-transform.py --atoms atoms.json --template-only
python content-transform.py --atoms atoms.json --no-expert-panel
"""
import json
import uuid
import argparse
import os
import re
import sys
import textwrap
from datetime import datetime, timezone
from pathlib import Path
# ── Configuration ──
SCRIPT_DIR = Path(__file__).resolve().parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = Path(os.environ.get("CONTENT_OPS_DATA_DIR", PROJECT_DIR / "data"))
SKILL_DIR = PROJECT_DIR
ATOMS_FILE = DATA_DIR / "content-atoms-latest.json"
# Voice configuration files (optional, for LLM mode)
VOICE_CONFIG_FILE = os.environ.get("VOICE_CONFIG_FILE", str(PROJECT_DIR / "config" / "voice.md"))
STYLE_GUIDE_FILE = os.environ.get("STYLE_GUIDE_FILE", str(PROJECT_DIR / "config" / "style-guide.md"))
PLATFORM_MAP = {
"x": ["x_thread", "x_post"],
"linkedin": ["linkedin_post"],
"short_form": ["youtube_short_script"],
"newsletter": ["newsletter_section"],
"youtube_short": ["youtube_short_script"],
}
MISSING_TO_FORMAT = {
"x": "x_thread",
"linkedin": "linkedin_post",
"short_form": "youtube_short_script",
"newsletter": "newsletter_section",
"youtube_short": "youtube_short_script",
}
MISSING_TO_PLATFORM = {
"x": "x",
"linkedin": "linkedin",
"short_form": "youtube_short",
"newsletter": "newsletter",
"youtube_short": "youtube_short",
}
PLATFORM_TO_EXPERT = {
"x": "x-articles.md",
"linkedin": "linkedin.md",
"youtube_short": "youtube-shorts.md",
"newsletter": "newsletter.md",
}
EXPERT_PANEL_THRESHOLD = 95
EXPERT_PANEL_MAX_ITERATIONS = 3
def load_atoms(path=None):
p = Path(path) if path else ATOMS_FILE
with open(p) as f:
data = json.load(f)
return data.get("atoms", data) if isinstance(data, dict) else data
def rank_atoms(atoms, top_n=10):
"""Sort by repurpose_score * len(platforms_missing), take top N."""
for a in atoms:
a["_rank"] = a.get("repurpose_score", 0) * max(len(a.get("platforms_missing", [])), 1)
ranked = sorted(atoms, key=lambda x: x["_rank"], reverse=True)
return ranked[:top_n]
def clean_content(content):
content = re.sub(r'^[\w]+\s*·\s*@[\w]+\s*·.*$', '', content, flags=re.MULTILINE)
content = re.sub(r'\n{3,}', '\n\n', content)
return content.strip()
def extract_hook(content, max_chars=200):
content = clean_content(content)
for sep in [". ", ".\n", "\n"]:
idx = content.find(sep)
if 0 < idx < max_chars:
return content[:idx + 1].strip()
return content[:max_chars].strip()
def extract_key_points(content, max_points=6):
lines = content.split("\n")
points = []
for line in lines:
line = line.strip()
if not line:
continue
if line.startswith(("", "-", "", "*")) or re.match(r"^\d+[\.\)]", line):
cleaned = re.sub(r"^[•\-→\*\d+\.\)]+\s*", "", line).strip()
if len(cleaned) > 15:
points.append(cleaned)
elif len(line) > 20 and len(line) < 280:
points.append(line)
return points[:max_points] if points else [content[:200]]
def extract_numbers(content):
patterns = [
r'\$[\d,]+[KkMmBb]?(?:\+)?',
r'\d+%',
r'\d+x',
r'\d+[\.,]?\d*\s*(?:hours?|minutes?|days?|weeks?|months?|years?)',
r'\d+\s*(?:pages?|pieces?|tools?|agents?|companies|founders?|members)',
]
numbers = []
for p in patterns:
numbers.extend(re.findall(p, content, re.IGNORECASE))
return numbers[:5]
def shorten_sentence(s, max_words=15):
words = s.split()
if len(words) <= max_words:
return s
return " ".join(words[:max_words]) + "."
def make_punchy(text, max_words=15):
sentences = re.split(r'(?<=[.!?])\s+', text)
result = []
for s in sentences:
s = s.strip()
if not s:
continue
if len(s.split()) > max_words:
parts = re.split(r'[,;—]', s)
for p in parts:
p = p.strip()
if p:
result.append(shorten_sentence(p) if not p.endswith(('.', '!', '?')) else p)
else:
result.append(s)
return result
# ── TEMPLATE GENERATORS (used with --template-only) ──
def generate_x_thread(atom):
content = clean_content(atom["content"])
hook_text = extract_hook(content, 200)
points = extract_key_points(content)
numbers = extract_numbers(content)
tags = atom.get("tags", [])
atom_type = atom.get("atom_type", "")
if "data" in atom_type or numbers:
tweet1 = f"{hook_text}\n\nThe numbers tell a different story. 🧵"
elif "story" in atom_type or "anecdote" in atom_type:
tweet1 = f"{hook_text}\n\nHere's what happened next. 🧵"
else:
tweet1 = f"Most people get this wrong about {tags[0] if tags else 'this'}.\n\n{hook_text}"
if len(tweet1) > 280:
tweet1 = tweet1[:277] + "..."
tweets = [tweet1]
for i, point in enumerate(points[:5]):
point_short = shorten_sentence(point, 15)
if numbers and i < len(numbers):
tweet = f"{point_short}\n\n{numbers[i]} — that's the real number."
else:
tweet = point_short
if len(tweet) > 280:
tweet = tweet[:277] + "..."
tweets.append(tweet)
ctas = [
"What's your take? Reply with what you'd add.",
"What did I miss? Drop your thoughts below.",
"Agree or disagree? I want to hear your take.",
]
tweets.append(ctas[hash(atom["id"]) % len(ctas)])
while len(tweets) < 5:
tweets.insert(-1, "The gap is only getting wider. Those who move now win.")
thread = "\n\n---\n\n".join([f"🧵 {i+1}/{len(tweets)}\n{t}" for i, t in enumerate(tweets)])
return thread, tweets[0]
def generate_x_post(atom):
content = clean_content(atom["content"])
hook = extract_hook(content, 180)
numbers = extract_numbers(content)
num_str = f"\n\n{numbers[0]}." if numbers else ""
post = f"{hook}{num_str}\n\nWhat's your take?"
if len(post) > 280:
post = post[:277] + "..."
return post, hook
def generate_linkedin_post(atom):
content = clean_content(atom["content"])
hook = extract_hook(content, 150)
points = extract_key_points(content)
numbers = extract_numbers(content)
hook_section = f"{hook}\n\nHere's what I learned."
punchy = make_punchy(content)
story = "\n\n".join(punchy[:6])
point_section = "\n".join([f"{p}" for p in points[:4]]) if len(points) > 2 else ""
data_section = f"\nThe data: {', '.join(numbers[:3])}." if numbers else ""
ctas = [
"What would you do differently?",
"What's your experience with this?",
"Curious — what's your take?",
]
cta = ctas[hash(atom["id"]) % len(ctas)]
parts = [hook_section, story]
if point_section:
parts.append(point_section)
if data_section:
parts.append(data_section)
parts.append(cta)
post = "\n\n".join(parts)
if len(post) > 1500:
post = post[:1497] + "..."
return post, hook
def generate_youtube_short(atom):
content = clean_content(atom["content"])
hook = extract_hook(content, 100)
points = extract_key_points(content)
numbers = extract_numbers(content)
tags = atom.get("tags", [])
topic = tags[0] if tags else "this"
hook_line = f"[HOOK] (0:00-0:03)\n[Look directly at camera, energy up]\n\"{hook}\""
setup_points = points[:2]
setup_text = " ".join([shorten_sentence(p, 12) for p in setup_points])
setup_line = f"[SETUP] (0:03-0:13)\n[Cut to B-roll or screen share]\n\"{setup_text}\""
payoff_points = points[2:5] if len(points) > 2 else points
payoff_items = "\n".join([f"{shorten_sentence(p, 12)}" for p in payoff_points])
num_callout = f"\n[TEXT OVERLAY: {numbers[0]}]" if numbers else ""
payoff_line = f"[PAYOFF] (0:13-0:40)\n[Quick cuts between points]{num_callout}\n{payoff_items}"
cta_line = f"[CTA] (0:40-0:45)\n[Point at camera]\n\"Comment '{topic.upper()}' and I'll show you exactly how.\"\n[TEXT: Follow for more]"
script = f"{hook_line}\n\n{setup_line}\n\n{payoff_line}\n\n{cta_line}"
return script, hook
def generate_newsletter_section(atom):
content = clean_content(atom["content"])
hook = extract_hook(content, 150)
points = extract_key_points(content)
numbers = extract_numbers(content)
headline = f"**{hook}**"
punchy = make_punchy(content)
para1 = " ".join(punchy[:4])
para2 = " ".join(punchy[4:8]) if len(punchy) > 4 else ""
data = f"The numbers: {', '.join(numbers[:3])}." if numbers else ""
why = f"> **Why this matters:** {shorten_sentence(points[-1] if points else content[:100], 15)}"
parts = [headline, para1]
if para2:
parts.append(para2)
if data:
parts.append(data)
parts.append(why)
return "\n\n".join([p for p in parts if p.strip()]), hook
FORMAT_GENERATORS = {
"x_thread": generate_x_thread,
"x_post": generate_x_post,
"linkedin_post": generate_linkedin_post,
"youtube_short_script": generate_youtube_short,
"newsletter_section": generate_newsletter_section,
}
def estimate_engagement(atom, platform):
score = atom.get("repurpose_score", 5)
if score >= 8:
return "high"
elif score >= 5:
return "medium"
return "low"
def generate_drafts_for_atom(atom):
drafts = []
missing = atom.get("platforms_missing", [])
for platform_key in missing:
fmt = MISSING_TO_FORMAT.get(platform_key)
platform = MISSING_TO_PLATFORM.get(platform_key)
if not fmt or fmt not in FORMAT_GENERATORS:
continue
generator = FORMAT_GENERATORS[fmt]
draft_text, hook = generator(atom)
draft = {
"id": str(uuid.uuid4()),
"atom_id": atom["id"],
"atom_content": atom["content"][:500],
"atom_source": atom.get("source", "unknown"),
"platform": platform,
"format": fmt,
"draft": draft_text,
"hook": hook[:200],
"char_count": len(draft_text),
"estimated_engagement": estimate_engagement(atom, platform),
"created_at": datetime.now(timezone.utc).isoformat(),
"status": "draft",
"expert_score": None,
"iterations": 0,
"key_improvements": [],
}
drafts.append(draft)
return drafts
# ── ANTHROPIC API ──
def get_anthropic_key():
"""Get Anthropic API key from environment."""
key = os.environ.get("ANTHROPIC_API_KEY")
if key:
return key
print("ERROR: Set ANTHROPIC_API_KEY environment variable")
return None
def load_file_safe(path):
"""Load a text file, return empty string if missing."""
try:
return Path(path).read_text()
except Exception:
return ""
def load_expert_panel(platform):
"""Load expert panel for a platform."""
filename = PLATFORM_TO_EXPERT.get(platform, "x-articles.md")
return load_file_safe(SKILL_DIR / "experts" / filename)
def load_scoring_rubric():
"""Load content quality scoring rubric."""
return load_file_safe(SKILL_DIR / "scoring-rubrics" / "content-quality.md")
def load_voice_references():
"""Load voice/style references for content generation."""
voice_config = load_file_safe(VOICE_CONFIG_FILE)
style_guide = load_file_safe(STYLE_GUIDE_FILE)
return voice_config, style_guide
def call_anthropic(client, messages, system=None, model="claude-sonnet-4-20250514", max_tokens=2000):
"""Call Anthropic API."""
kwargs = {"model": model, "max_tokens": max_tokens, "messages": messages}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
return response.content[0].text.strip()
def llm_generate_draft(client, atom, platform, fmt, voice_config, style_guide):
"""Generate a draft using Claude API."""
platform_instructions = {
"x": "Write an X article (long-form X post). Include at least one ASCII diagram in a code block. Keep paragraphs to 1-3 sentences. End with a natural CTA.",
"linkedin": "Write a LinkedIn post. Hook must work before the 'see more' fold (first 2-3 lines). Use line breaks for readability. Professional but personal. 800-1500 chars.",
"youtube_short": "Write a YouTube Short script. Format: [HOOK] (0:00-0:03), [SETUP] (0:03-0:13), [PAYOFF] (0:13-0:40), [CTA] (0:40-0:45). Include visual directions. 30-60 seconds total.",
"newsletter": "Write a newsletter section. Subject line + scannable body. Headers, bullets, bold for skimmers. End with 'why this matters'.",
}
system_parts = ["You are a content writer creating platform-native content. Follow the configured voice and style EXACTLY."]
if voice_config:
system_parts.append(f"\nVOICE CONFIGURATION:\n{voice_config}")
if style_guide:
system_parts.append(f"\nSTYLE GUIDE:\n{style_guide[:2000]}")
system_parts.append("""
RULES:
- Short punchy sentences. Max 15 words.
- Specific numbers always. Never vague.
- Contrarian angles backed by data.
- No corporate speak. No "I'm excited to share."
- Personal stories and specific examples.
- Every sentence earns its place.""")
system = "\n".join(system_parts)
topic_tags = atom.get('tags', [])
prompt = f"""Create a {platform} draft from this content atom.
PLATFORM INSTRUCTIONS:
{platform_instructions.get(platform, platform_instructions['x'])}
SOURCE CONTENT:
{clean_content(atom['content'])}
SOURCE: {atom.get('source_title', 'unknown')}
TAGS: {', '.join(topic_tags)}
Write ONLY the draft content. No preamble, no explanation."""
return call_anthropic(client, [{"role": "user", "content": prompt}], system=system)
def expert_panel_score(client, draft_text, platform, expert_panel, rubric, voice_config):
"""Run expert panel scoring. Returns (score, feedback_dict)."""
system = f"""You are simulating 10 domain experts reviewing content for quality.
EXPERT PANEL:
{expert_panel}
SCORING RUBRIC:
{rubric}
VOICE REFERENCE:
{voice_config[:1000] if voice_config else 'No specific voice config provided.'}"""
prompt = f"""Score this {platform} draft. Each of 11 experts scores 0-100 on the rubric criteria.
Expert #11 is the AI Writing Detector (Humanizer) — scores how AI-generated the draft sounds.
BANNED AI VOCABULARY (flag any occurrence):
delve, tapestry, landscape (abstract), leverage, multifaceted, nuanced, pivotal, realm, robust, seamless, testament, transformative, underscore (verb), utilize, whilst, keen, embark, comprehensive, intricate, commendable, meticulous, paramount, groundbreaking, innovative, cutting-edge, synergy, holistic, paradigm, ecosystem, Additionally, crucial, enduring, enhance, fostering, garner, highlight (verb), interplay, intricacies, showcase, vibrant, valuable, profound, renowned, breathtaking, nestled, stunning
AI PATTERNS TO CHECK:
- Significance inflation ("pivotal moment", "is a testament", "stands as")
- Superficial -ing phrases ("highlighting", "showcasing", "underscoring")
- Promotional language ("boasts", "vibrant", "commitment to")
- Vague attributions ("Experts believe", "Industry reports")
- Formulaic "despite challenges... continues to" structures
- Copula avoidance ("serves as" instead of "is")
- Negative parallelisms ("It's not just X, it's Y")
- Rule-of-three forcing (triple adjectives/clauses)
- Em dash overuse (max 1 per 200 words)
- Filler phrases ("In order to", "It is important to note")
- Excessive hedging ("could potentially")
- Generic positive conclusions ("The future looks bright")
If the Humanizer expert scores below 70, the draft MUST be flagged for revision.
DRAFT:
{draft_text}
Respond in this EXACT JSON format (no other text):
{{
"average_score": <number>,
"expert_scores": [<11 numbers>],
"weaknesses": ["<specific weakness 1>", "<specific weakness 2>", ...],
"line_feedback": ["<specific line-by-line fix 1>", "<specific line-by-line fix 2>", ...],
"strengths": ["<strength 1>", "<strength 2>"],
"ai_patterns_detected": ["<pattern 1>", "<pattern 2>", ...],
"humanizer_score": <number>
}}
Be harsh. Score honestly."""
response = call_anthropic(client, [{"role": "user", "content": prompt}], system=system, max_tokens=1500)
try:
json_match = re.search(r'\{[\s\S]*\}', response)
if json_match:
result = json.loads(json_match.group())
return result.get("average_score", 0), result
else:
return 0, {"error": "No JSON in response"}
except json.JSONDecodeError:
return 0, {"error": "Invalid JSON", "raw": response[:500]}
def expert_panel_revise(client, draft_text, platform, feedback, voice_config, style_guide):
"""Revise draft based on expert feedback."""
system_parts = ["You are revising content based on expert feedback."]
if voice_config:
system_parts.append(f"\nVOICE CONFIGURATION:\n{voice_config}")
system_parts.append("""
RULES:
- Fix every weakness identified
- Keep all strengths
- Maintain configured voice exactly
- Short punchy sentences, specific numbers, contrarian angles""")
system = "\n".join(system_parts)
weaknesses = feedback.get("weaknesses", [])
line_fixes = feedback.get("line_feedback", [])
ai_patterns = feedback.get("ai_patterns_detected", [])
ai_section = ""
if ai_patterns:
ai_section = f"""
AI PATTERNS DETECTED (MUST FIX ALL):
{chr(10).join(f'- {p}' for p in ai_patterns)}
BANNED VOCABULARY (replace every occurrence):
delve, tapestry, landscape (abstract), leverage, multifaceted, nuanced, pivotal, realm, robust, seamless, testament, transformative, underscore (verb), utilize, whilst, keen, embark, comprehensive, intricate, commendable, meticulous, paramount, groundbreaking, innovative, cutting-edge, synergy, holistic, paradigm, ecosystem, Additionally, crucial, enduring, enhance, fostering, garner, highlight (verb), interplay, intricacies, showcase, vibrant, valuable, profound, renowned, breathtaking, nestled, stunning
"""
prompt = f"""Revise this {platform} draft based on expert feedback.
CURRENT DRAFT:
{draft_text}
WEAKNESSES TO FIX:
{chr(10).join(f'- {w}' for w in weaknesses)}
SPECIFIC LINE FIXES:
{chr(10).join(f'- {f}' for f in line_fixes)}
{ai_section}
CURRENT SCORE: {feedback.get('average_score', 'unknown')}
TARGET SCORE: {EXPERT_PANEL_THRESHOLD}+
Write ONLY the revised draft. No preamble."""
return call_anthropic(client, [{"role": "user", "content": prompt}], system=system)
def process_draft_with_expert_panel(client, atom, platform, fmt, voice_config, style_guide):
"""Full expert panel pipeline: generate → score → revise loop."""
expert_panel = load_expert_panel(platform)
rubric = load_scoring_rubric()
print(f" Generating {platform} draft via Claude...")
draft_text = llm_generate_draft(client, atom, platform, fmt, voice_config, style_guide)
iterations = []
best_draft = draft_text
best_score = 0
for iteration in range(1, EXPERT_PANEL_MAX_ITERATIONS + 1):
print(f" Expert panel scoring (iteration {iteration})...")
score, feedback = expert_panel_score(client, draft_text, platform, expert_panel, rubric, voice_config)
print(f" Score: {score}/100")
iteration_log = {
"iteration": iteration,
"score": score,
"weaknesses": feedback.get("weaknesses", []),
"line_feedback": feedback.get("line_feedback", []),
"strengths": feedback.get("strengths", []),
}
iterations.append(iteration_log)
if score > best_score:
best_score = score
best_draft = draft_text
if score >= EXPERT_PANEL_THRESHOLD:
print(f" ✓ Passed threshold ({score} >= {EXPERT_PANEL_THRESHOLD})")
break
if iteration < EXPERT_PANEL_MAX_ITERATIONS:
print(f" Revising based on feedback...")
draft_text = expert_panel_revise(client, draft_text, platform, feedback, voice_config, style_guide)
key_improvements = []
for it in iterations:
for w in it.get("weaknesses", []):
key_improvements.append(f"Iter {it['iteration']}: Fixed — {w}")
return best_draft, best_score, len(iterations), key_improvements, iterations
def rewrite_with_llm(drafts, use_expert_panel=False, expert_panel_top_n=10):
"""Rewrite drafts using Claude API, optionally with expert panel."""
try:
import anthropic
except ImportError:
print("ERROR: anthropic package not installed. Run: pip install anthropic")
return drafts
api_key = get_anthropic_key()
if not api_key:
return drafts
client = anthropic.Anthropic(api_key=api_key)
voice_config, style_guide = load_voice_references()
rewritten = []
for i, draft in enumerate(drafts):
atom = {"content": draft["atom_content"], "source_title": draft.get("atom_source", ""),
"tags": [], "atom_type": ""}
if use_expert_panel and i < expert_panel_top_n:
print(f"\n [{i+1}/{len(drafts)}] Expert panel: {draft['format']} (atom {draft['atom_id'][:8]})")
try:
import time as _time
_start = _time.time()
new_text, score, iters, improvements, iter_log = process_draft_with_expert_panel(
client, atom, draft["platform"], draft["format"],
voice_config, style_guide
)
_elapsed = _time.time() - _start
draft["draft"] = new_text
draft["hook"] = extract_hook(new_text, 200)
draft["char_count"] = len(new_text)
draft["expert_score"] = score
draft["iterations"] = iters
draft["key_improvements"] = improvements
draft["iteration_log"] = iter_log
status = "" if score >= EXPERT_PANEL_THRESHOLD else f"⚠ ({score})"
print(f" {status} Final: {score}/100 after {iters} iteration(s) [{_elapsed:.1f}s]")
except Exception as e:
print(f" ✗ Expert panel failed ({type(e).__name__}): {e}")
try:
new_text = llm_generate_draft(client, atom, draft["platform"], draft["format"],
voice_config, style_guide)
draft["draft"] = new_text
draft["hook"] = extract_hook(new_text, 200)
draft["char_count"] = len(new_text)
print(f" ↳ Fell back to simple LLM rewrite")
except Exception as e2:
print(f" ✗ LLM rewrite also failed: {e2}")
else:
print(f"\n [{i+1}/{len(drafts)}] LLM rewrite: {draft['format']} (atom {draft['atom_id'][:8]})")
try:
new_text = llm_generate_draft(client, atom, draft["platform"], draft["format"],
voice_config, style_guide)
draft["draft"] = new_text
draft["hook"] = extract_hook(new_text, 200)
draft["char_count"] = len(new_text)
print(f" ✓ Rewrote")
except Exception as e:
print(f" ✗ LLM rewrite failed: {e}")
rewritten.append(draft)
return rewritten
def main():
parser = argparse.ArgumentParser(description="Transform content atoms into platform-native drafts")
parser.add_argument("--atoms", type=str, help="Path to atoms JSON file")
parser.add_argument("--top-n", type=int, default=10, help="Number of top atoms to process")
parser.add_argument("--template-only", action="store_true", help="Use template-based generation (no LLM)")
parser.add_argument("--no-expert-panel", action="store_true", help="Disable expert panel quality gate")
parser.add_argument("--expert-panel-top-n", type=int, default=10, help="Apply expert panel to top N drafts")
parser.add_argument("--output", type=str, help="Output file path")
args = parser.parse_args()
use_llm = not args.template_only
use_expert_panel = use_llm and not args.no_expert_panel
atoms = load_atoms(args.atoms)
print(f"Loaded {len(atoms)} atoms")
top_atoms = rank_atoms(atoms, args.top_n)
print(f"Selected top {len(top_atoms)} atoms by repurpose_score × missing platforms")
all_drafts = []
for atom in top_atoms:
drafts = generate_drafts_for_atom(atom)
all_drafts.extend(drafts)
missing = atom.get("platforms_missing", [])
print(f" Atom {atom['id'][:8]}: {len(drafts)} drafts ({', '.join(missing)})")
print(f"\nGenerated {len(all_drafts)} total drafts")
if use_llm:
mode = "LLM + Expert Panel" if use_expert_panel else "LLM only"
print(f"\n{'='*60}")
print(f"Rewriting with {mode}...")
print(f"{'='*60}")
all_drafts = rewrite_with_llm(all_drafts, use_expert_panel=use_expert_panel,
expert_panel_top_n=args.expert_panel_top_n)
by_platform = {}
for d in all_drafts:
by_platform[d["platform"]] = by_platform.get(d["platform"], 0) + 1
print(f"\n{'='*60}")
print("Drafts by platform:")
for p, c in sorted(by_platform.items()):
print(f" {p}: {c}")
scored = [d for d in all_drafts if d.get("expert_score")]
if scored:
avg = sum(d["expert_score"] for d in scored) / len(scored)
passed = sum(1 for d in scored if d["expert_score"] >= EXPERT_PANEL_THRESHOLD)
print(f"\nExpert panel: {len(scored)} scored, {passed} passed (≥{EXPERT_PANEL_THRESHOLD}), avg {avg:.1f}")
today = datetime.now().strftime("%Y-%m-%d")
output_path = Path(args.output) if args.output else DATA_DIR / f"content-drafts-{today}.json"
latest_path = DATA_DIR / "content-drafts-latest.json"
output = {
"generated_at": datetime.now(timezone.utc).isoformat(),
"atom_count": len(top_atoms),
"draft_count": len(all_drafts),
"used_llm": use_llm,
"used_expert_panel": use_expert_panel,
"expert_panel_threshold": EXPERT_PANEL_THRESHOLD if use_expert_panel else None,
"drafts": all_drafts,
}
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w") as f:
json.dump(output, f, indent=2)
with open(latest_path, "w") as f:
json.dump(output, f, indent=2)
print(f"\nSaved to {output_path}")
print(f"Saved to {latest_path}")
if scored:
print(f"\n{'='*60}")
print("TOP DRAFTS BY SCORE:")
print(f"{'='*60}")
for d in sorted(scored, key=lambda x: x["expert_score"], reverse=True)[:5]:
print(f"\n[{d['platform'].upper()}] Score: {d['expert_score']}/100 | Iterations: {d['iterations']}")
print(f"Hook: {d['hook'][:100]}...")
if d.get("key_improvements"):
print(f"Key improvements: {d['key_improvements'][0]}")
print(f"---")
print(d["draft"][:300])
print("...\n")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,499 @@
#!/usr/bin/env python3
"""
Editorial Brain Top-down clip discovery using LLM analysis.
Instead of bottom-up keyword matching, this gives the full transcript to an LLM
and asks it to find the best clip-worthy moments like a human editor would.
Two-pass approach:
1. Sonnet scans transcript chunks cheaply, finds candidate moments
2. Sonnet scores candidates on hook/build/payoff/clean-cut (0-100)
3. Only 90+ clips get cut
Usage:
python editorial-brain.py --url "https://youtube.com/watch?v=..." [--max-clips 5]
python editorial-brain.py --vtt /path/to/file.vtt --video-id ID [--max-clips 5]
"""
import argparse
import json
import os
import re
import subprocess
import sys
import urllib.request
from pathlib import Path
# ── Configuration ──
ANTHROPIC_API_KEY = os.environ.get('ANTHROPIC_API_KEY', '')
SCRIPT_DIR = Path(__file__).resolve().parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = Path(os.environ.get("CONTENT_OPS_DATA_DIR", PROJECT_DIR / "data"))
CLIPS_DIR = DATA_DIR / "clips"
# Model configuration
DEFAULT_MODEL = os.environ.get("EDITORIAL_BRAIN_MODEL", "claude-sonnet-4-20250514")
def call_claude(prompt, model=None, max_tokens=4000):
"""Call Claude API."""
model = model or DEFAULT_MODEL
data = json.dumps({
"model": model,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}]
}).encode()
req = urllib.request.Request(
"https://api.anthropic.com/v1/messages",
data=data,
headers={
"Content-Type": "application/json",
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01"
}
)
with urllib.request.urlopen(req, timeout=120) as resp:
result = json.loads(resp.read())
return result['content'][0]['text']
def download_vtt(url):
"""Download VTT subtitles from YouTube."""
video_id = re.search(r'(?:v=|/)([a-zA-Z0-9_-]{11})', url).group(1)
vtt_path = f"/tmp/editorial_{video_id}.en.vtt"
if os.path.exists(vtt_path):
return vtt_path, video_id
subprocess.run([
'yt-dlp', '--write-auto-subs', '--sub-lang', 'en', '--sub-format', 'vtt',
'--skip-download', '--output', f'/tmp/editorial_{video_id}.%(ext)s', url
], capture_output=True, check=True)
return vtt_path, video_id
def parse_vtt(vtt_path):
"""Parse YouTube auto-caption VTT into clean, deduplicated transcript.
YouTube auto-captions use a scrolling format where each block contains
the previous line + new text. We filter out repeat blocks (< 20ms duration)
and strip overlapping prefixes to get clean text.
"""
content = open(vtt_path).read()
blocks = content.split('\n\n')
segments = []
prev_clean = ''
for block in blocks:
lines = block.strip().split('\n')
if not lines:
continue
ts = re.match(r'(\d{2}:\d{2}:\d{2}\.\d{3})\s*-->\s*(\d{2}:\d{2}:\d{2}\.\d{3})', lines[0])
if not ts:
continue
p1 = ts.group(1).split(':')
p2 = ts.group(2).split(':')
s1 = int(p1[0]) * 3600 + int(p1[1]) * 60 + float(p1[2])
s2 = int(p2[0]) * 3600 + int(p2[1]) * 60 + float(p2[2])
if s2 - s1 < 0.02:
continue
raw_text = '\n'.join(lines[1:])
clean = re.sub(r'<[^>]+>', '', raw_text).strip()
clean = re.sub(r'\s+', ' ', clean)
if not clean or clean == prev_clean:
continue
new_text = clean
if prev_clean:
for overlap_len in range(min(len(prev_clean), len(clean)), 0, -1):
if clean[:overlap_len] == prev_clean[-overlap_len:]:
new_text = clean[overlap_len:].strip()
break
if new_text:
segments.append({'start': s1, 'end': s2, 'text': new_text})
prev_clean = clean
return segments
def build_readable_transcript(segments):
"""Build a human-readable transcript with timestamps every ~30s."""
output = ''
last_ts = -30
for seg in segments:
if seg['start'] - last_ts >= 30:
m, s = divmod(int(seg['start']), 60)
output += f'\n\n[{m}:{s:02d}] '
last_ts = seg['start']
output += seg['text'] + ' '
return output
def chunk_transcript(transcript_text, chunk_size=12000):
"""Split transcript into chunks at timestamp boundaries."""
chunks = []
remaining = transcript_text
while remaining:
if len(remaining) <= chunk_size:
chunks.append(remaining)
break
break_at = remaining.rfind('\n\n[', 0, chunk_size)
if break_at < chunk_size * 0.3:
break_at = chunk_size
chunks.append(remaining[:break_at])
remaining = remaining[break_at:]
return chunks
def find_moments_full_transcript(full_transcript, video_title=""):
"""Analyze the ENTIRE transcript in one call."""
prompt = f"""You are a legendary short-form video editor (think: the team behind Hormozi's clips, Chris Williamson's best moments).
Read this FULL transcript of "{video_title}" and find the 3-5 BEST moments that could become viral 30-60 second clips.
CRITICAL RULES:
- ONLY identify moments that ACTUALLY EXIST in the transcript below
- Quote the EXACT words from the transcript do not paraphrase or invent
- Each moment must have a clear HOOK BUILD PAYOFF arc
- A stranger scrolling at 2am should stop, watch the whole clip, and feel smarter
What makes a 90+ clip:
- HOOK (0-3s): Pattern interrupt shocking stat, bold claim, provocative question
- BUILD (3-30s): Stakes rise story tension, framework develops, insight escalates
- PAYOFF (last 5-10s): The insight LANDS counterintuitive truth, surprising number, emotional resolution
- CLEAN END: Cut immediately after the payoff. Silence > trailing off.
FULL TRANSCRIPT:
{full_transcript}
Return a JSON array of the best moments (3-5 max). For each:
{{
"start_timestamp": "[M:SS] exact timestamp from transcript",
"end_timestamp": "[M:SS] where to cut",
"hook_quote": "EXACT opening words from transcript",
"payoff_quote": "EXACT closing words/punchline from transcript",
"why_viral": "One sentence on why this stops scrolls",
"estimated_score": 0-100,
"narrative_arc": "Hook: ... → Build: ... → Payoff: ..."
}}
Be EXTREMELY selective. If nothing scores above 70, return fewer moments or an empty array. Quality > quantity."""
try:
response = call_claude(prompt, max_tokens=3000)
json_match = re.search(r'\[[\s\S]*\]', response)
if json_match:
moments = json.loads(json_match.group())
for m in moments:
m['hook'] = m.get('hook_quote', m.get('hook', ''))
m['payoff'] = m.get('payoff_quote', m.get('payoff', ''))
m['suggested_clip_text'] = m.get('narrative_arc', '')
return moments
return []
except Exception as e:
print(f" ⚠️ Full transcript analysis failed: {e}")
return []
def find_moments_in_chunk(chunk_text, chunk_idx, video_title=""):
"""Ask LLM to find clip-worthy moments in a transcript chunk."""
prompt = f"""You are a legendary short-form video editor.
Analyze this transcript section from "{video_title}" and find ANY moments that could become a viral 30-60 second clip.
A great clip moment has:
- A clear HOOK (bold claim, shocking stat, provocative question, emotional statement)
- A STORY ARC or BUILD (tension rises, framework develops, stakes increase)
- A PAYOFF (insight lands, number drops, counterintuitive truth revealed, punchline hits)
- Works STANDALONE a stranger with zero context would stop scrolling and watch
TRANSCRIPT SECTION:
{chunk_text}
Return a JSON array of moments found. If no moments qualify, return an empty array.
For each moment:
{{
"start_timestamp": "[M:SS] from the transcript",
"end_timestamp": "[M:SS] approximate end",
"hook": "The opening line/moment that grabs attention",
"payoff": "How this moment resolves/lands",
"why_viral": "One sentence on why this would stop a scroll",
"estimated_score": 0-100,
"suggested_clip_text": "The key 2-3 sentences a viewer would remember"
}}
Be SELECTIVE. Most transcript sections have 0-1 clip-worthy moments. Only include moments you'd bet could score 70+."""
try:
response = call_claude(prompt, max_tokens=2000)
json_match = re.search(r'\[[\s\S]*\]', response)
if json_match:
return json.loads(json_match.group())
return []
except Exception as e:
print(f" ⚠️ Chunk {chunk_idx} failed: {e}")
return []
def score_and_refine_moment(moment, full_transcript_context, video_title=""):
"""Deep-score a candidate moment and suggest exact trim points."""
prompt = f"""You are scoring a potential short-form clip from "{video_title}".
CANDIDATE MOMENT:
Hook: {moment.get('hook', 'N/A')}
Payoff: {moment.get('payoff', 'N/A')}
Why viral: {moment.get('why_viral', 'N/A')}
Key text: {moment.get('suggested_clip_text', 'N/A')}
SURROUNDING TRANSCRIPT (for context):
{full_transcript_context}
Score this clip candidate on a 0-100 scale:
- HOOK (0-25): Does the first sentence stop the scroll?
- BUILD (0-25): Does tension/interest rise through the middle?
- PAYOFF (0-25): Does the insight LAND? Would the viewer feel smarter/moved?
- CLEAN CUT (0-25): Can this end on a strong note without trailing off?
Also provide:
- Exact start quote (the first words of the clip)
- Exact end quote (the last words before cutting)
- Any adjustments to improve the score
Return JSON:
{{
"total_score": 0-100,
"hook_score": 0-25,
"build_score": 0-25,
"payoff_score": 0-25,
"clean_cut_score": 0-25,
"start_quote": "exact first words",
"end_quote": "exact last words",
"adjustments": "how to improve",
"would_you_post_this": true/false,
"reason": "one line summary"
}}"""
try:
response = call_claude(prompt, max_tokens=1500)
json_match = re.search(r'\{[\s\S]*\}', response)
if json_match:
return json.loads(json_match.group())
return {"total_score": 0, "reason": "Failed to parse"}
except Exception as e:
return {"total_score": 0, "reason": f"API error: {e}"}
def get_context_around_timestamp(segments, timestamp_str, context_seconds=180):
"""Get clean transcript text around a timestamp."""
parts = timestamp_str.replace('[', '').replace(']', '').split(':')
if len(parts) == 2:
target_sec = int(parts[0]) * 60 + int(parts[1])
elif len(parts) == 3:
target_sec = int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
else:
target_sec = 0
context = ''
last_ts = -15
for seg in segments:
if target_sec - context_seconds <= seg['start'] <= target_sec + context_seconds:
if seg['start'] - last_ts >= 15:
m, s = divmod(int(seg['start']), 60)
context += f'\n[{m}:{s:02d}] '
last_ts = seg['start']
context += seg['text'] + ' '
return context[:5000]
def cut_clip(video_url, start_sec, duration_sec, output_path):
"""Download video and cut a clip using ffmpeg."""
video_id = re.search(r'(?:v=|/)([a-zA-Z0-9_-]{11})', video_url).group(1)
video_cache = f"/tmp/editorial_{video_id}.mp4"
if not os.path.exists(video_cache):
print(f" ⬇️ Downloading video...")
subprocess.run([
'yt-dlp', '--format', 'best[height<=720]',
'--output', video_cache, '--no-playlist', video_url
], capture_output=True, check=True)
CLIPS_DIR.mkdir(parents=True, exist_ok=True)
cmd = [
'ffmpeg', '-y',
'-ss', str(start_sec),
'-i', video_cache,
'-t', str(duration_sec),
'-vf', 'crop=ih*9/16:ih,scale=1080:1920',
'-c:a', 'aac', '-b:a', '128k',
output_path
]
subprocess.run(cmd, capture_output=True, check=True)
return os.path.exists(output_path)
def timestamp_to_seconds(ts_str):
"""Convert timestamp string like '14:31' to seconds."""
parts = ts_str.replace('[', '').replace(']', '').strip().split(':')
if len(parts) == 2:
return int(parts[0]) * 60 + int(parts[1])
elif len(parts) == 3:
return int(parts[0]) * 3600 + int(parts[1]) * 60 + int(parts[2])
return 0
def main():
parser = argparse.ArgumentParser(description='Editorial Brain — LLM-powered clip discovery')
parser.add_argument('--url', help='YouTube URL')
parser.add_argument('--vtt', help='VTT file path')
parser.add_argument('--video-id', help='Video ID (required with --vtt)')
parser.add_argument('--title', default='', help='Video title')
parser.add_argument('--max-clips', type=int, default=5, help='Max clips to produce')
parser.add_argument('--min-score', type=int, default=90, help='Minimum score threshold')
parser.add_argument('--skip-cut', action='store_true', help='Skip video cutting (analysis only)')
parser.add_argument('--output', help='Output JSON path')
args = parser.parse_args()
if not ANTHROPIC_API_KEY:
print("❌ Set ANTHROPIC_API_KEY environment variable")
sys.exit(1)
output_path = args.output or str(DATA_DIR / "editorial-clips-latest.json")
# Step 1: Get transcript
if args.url:
print(f"📥 Downloading subtitles...")
vtt_path, video_id = download_vtt(args.url)
elif args.vtt:
vtt_path = args.vtt
video_id = args.video_id or 'unknown'
else:
parser.print_help()
sys.exit(1)
print(f"📝 Parsing transcript...")
segments = parse_vtt(vtt_path)
print(f" {len(segments)} segments")
readable = build_readable_transcript(segments)
chunks = chunk_transcript(readable)
print(f" {len(chunks)} chunks for analysis")
# Step 2: Scan for moments
all_moments = []
if len(readable) < 80000:
print(f"\n🔍 Pass 1: Full-transcript analysis (single call, {len(readable)//1000}K chars)...")
moments = find_moments_full_transcript(readable, args.title)
all_moments = moments
print(f" Found {len(moments)} candidate(s)")
else:
print(f"\n🔍 Pass 1: Chunked analysis ({len(chunks)} chunks)...")
for i, chunk in enumerate(chunks):
moments = find_moments_in_chunk(chunk, i, args.title)
if moments:
print(f" Chunk {i+1}/{len(chunks)}: Found {len(moments)} candidate(s)")
for m in moments:
m['chunk_idx'] = i
all_moments.append(m)
else:
print(f" Chunk {i+1}/{len(chunks)}: No moments")
print(f"\n📊 Pass 1 complete: {len(all_moments)} total candidates")
if not all_moments:
print("❌ No clip-worthy moments found in this episode")
sys.exit(0)
all_moments.sort(key=lambda x: x.get('estimated_score', 0), reverse=True)
top_candidates = all_moments[:min(10, len(all_moments))]
for m in top_candidates:
print(f" [{m.get('start_timestamp', '?')}] Score ~{m.get('estimated_score', '?')}: {m.get('hook', '?')[:60]}")
# Step 3: Deep-score candidates (Pass 2)
print(f"\n🎯 Pass 2: Deep-scoring top {len(top_candidates)} candidates...")
scored = []
for i, moment in enumerate(top_candidates):
ts = moment.get('start_timestamp', '0:00')
context = get_context_around_timestamp(segments, ts)
score = score_and_refine_moment(moment, context, args.title)
moment['deep_score'] = score
total = score.get('total_score', 0)
scored.append(moment)
status = "" if total >= args.min_score else ""
print(f" {status} [{ts}] Score: {total}/100 — {score.get('reason', '?')[:80]}")
passed = [m for m in scored if m.get('deep_score', {}).get('total_score', 0) >= args.min_score]
print(f"\n🏆 {len(passed)} clips scored {args.min_score}+")
# Step 4: Cut clips
results = {
'video_id': video_id,
'title': args.title,
'url': args.url or '',
'total_candidates': len(all_moments),
'scored': len(scored),
'passed': len(passed),
'threshold': args.min_score,
'clips': []
}
if passed and not args.skip_cut and args.url:
print(f"\n✂️ Cutting {len(passed)} clips...")
for i, moment in enumerate(passed[:args.max_clips]):
start_sec = timestamp_to_seconds(moment.get('start_timestamp', '0:00'))
end_sec = timestamp_to_seconds(moment.get('end_timestamp', '0:00'))
duration = max(30, min(60, end_sec - start_sec)) if end_sec > start_sec else 45
clip_id = f"{video_id}_editorial_{i+1}"
clip_output = str(CLIPS_DIR / f"{clip_id}.mp4")
try:
cut_clip(args.url, start_sec, duration, clip_output)
print(f"{clip_id}.mp4 ({duration}s)")
results['clips'].append({
'id': clip_id,
'path': clip_output,
'start': start_sec,
'duration': duration,
'score': moment['deep_score'],
'hook': moment.get('hook', ''),
'payoff': moment.get('payoff', ''),
})
except Exception as e:
print(f" ❌ Cut failed: {e}")
results['all_scored'] = [{
'timestamp': m.get('start_timestamp', '?'),
'score': m.get('deep_score', {}).get('total_score', 0),
'hook': m.get('hook', ''),
'payoff': m.get('payoff', ''),
'reason': m.get('deep_score', {}).get('reason', ''),
'adjustments': m.get('deep_score', {}).get('adjustments', ''),
} for m in scored]
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with open(output_path, 'w') as f:
json.dump(results, f, indent=2)
print(f"\n💾 Saved to {output_path}")
return 0 if passed else 1
if __name__ == '__main__':
sys.exit(main())

View file

@ -0,0 +1,420 @@
#!/usr/bin/env python3
"""
Quote Mining Engine Extract viral-worthy quotes from podcasts and notes.
Scans RSS feeds and local markdown/text files to extract the most quotable,
contrarian, and viral-worthy moments. Outputs scored candidates ready to publish.
Usage:
python quote-mining-engine.py --days 90 --top 50 --min-score 60
python quote-mining-engine.py --feeds feeds.json --notes-dir ./notes/
"""
import argparse
import json
import os
import re
import sys
import hashlib
from datetime import datetime, timedelta, timezone
from pathlib import Path
from html import unescape
import feedparser
# ── Configuration ──
SCRIPT_DIR = Path(__file__).resolve().parent
PROJECT_DIR = SCRIPT_DIR.parent
DATA_DIR = Path(os.environ.get("CONTENT_OPS_DATA_DIR", PROJECT_DIR / "data"))
OUTPUT_PATH = DATA_DIR / "quote-mining-latest.json"
# Configure feeds via environment variable or JSON file
# Format: {"Feed Name": "https://feed-url.com/rss", ...}
FEEDS_FILE = os.environ.get("QUOTE_MINING_FEEDS_FILE", str(PROJECT_DIR / "config" / "feeds.json"))
# Directory containing meeting notes / transcripts (markdown files)
NOTES_DIR = os.environ.get("QUOTE_MINING_NOTES_DIR", "")
# Speaker name to look for in meeting notes (configurable)
SPEAKER_NAME = os.environ.get("QUOTE_MINING_SPEAKER", "")
# ── Viral scoring heuristics ──
CONTRARIAN_SIGNALS = [
r"\b(?:wrong|myth|lie|dead|overrated|underrated|nobody|everyone)\b",
r"\b(?:stop|quit|don\'t|never|avoid|mistake|fail)\b",
r"\b(?:secret|hidden|overlooked|surprising|counterintuitive)\b",
r"\b(?:actually|truth|reality|real reason)\b",
r"\b(?:unpopular opinion|hot take|controversial)\b",
]
SPECIFICITY_SIGNALS = [
r"\$[\d,.]+[MBKmk]?",
r"\b\d{1,3}%\b",
r"\b\d+x\b",
r"\b(?:doubled|tripled|10x|100x)\b",
r"\b\d{4,}\b",
r"\b(?:case study|example|data|study|research)\b",
]
EMOTIONAL_TRIGGERS = [
r"\b(?:fear|afraid|scared|worried|anxious)\b",
r"\b(?:love|hate|obsessed|passionate)\b",
r"\b(?:shocking|insane|crazy|wild|unbelievable|mindblowing)\b",
r"\b(?:broke|rich|wealthy|millionaire|billionaire)\b",
r"\b(?:fired|hired|quit|resigned)\b",
r"\b(?:AI|artificial intelligence|ChatGPT|GPT|automation)\b",
]
SHAREABILITY_SIGNALS = [
r"\b(?:how to|step.by.step|framework|playbook|strategy)\b",
r"\b(?:lesson|learned|mistake|regret)\b",
r"\b(?:why (?:most|nobody|everyone))\b",
r"\b(?:the (?:one|only|best|worst|biggest))\b",
r"\bhack\b",
]
def score_text(text: str) -> dict:
"""Score a text blob for viral potential. Returns breakdown + total."""
t = text.lower()
def count_matches(patterns):
return sum(1 for p in patterns if re.search(p, t, re.I))
contrarian = min(count_matches(CONTRARIAN_SIGNALS) * 15, 35)
specificity = min(count_matches(SPECIFICITY_SIGNALS) * 12, 30)
emotional = min(count_matches(EMOTIONAL_TRIGGERS) * 12, 25)
shareability = min(count_matches(SHAREABILITY_SIGNALS) * 12, 25)
words = len(text.split())
if words <= 15:
length_bonus = 10
elif words <= 30:
length_bonus = 5
else:
length_bonus = 0
question_bonus = 8 if re.search(r"\?", text) else 0
number_bonus = 8 if re.search(r"\b\d+\b", text) else 0
howto_bonus = 8 if re.search(r"^(?:how|why|what|when|the\s+\d)", text, re.I) else 0
total = min(contrarian + specificity + emotional + shareability + length_bonus + question_bonus + number_bonus + howto_bonus, 100)
return {
"contrarian": contrarian,
"specificity": specificity,
"emotional": emotional,
"shareability": shareability,
"total": total,
}
def suggest_platform(score_breakdown: dict, text: str) -> str:
"""Suggest X, LinkedIn, or both based on content characteristics."""
if score_breakdown["specificity"] >= 15 and score_breakdown["shareability"] >= 10:
return "both"
if score_breakdown["emotional"] >= 15 or len(text.split()) <= 20:
return "X"
if score_breakdown["specificity"] >= 10 or score_breakdown["shareability"] >= 10:
return "LinkedIn"
if score_breakdown["total"] >= 60:
return "both"
return "X"
def generate_hook(quote: str) -> str:
"""Generate a punchy X-ready opening line from a quote."""
q = quote.strip().rstrip(".")
words = q.split()
if len(words) <= 20:
return q + "."
short = " ".join(words[:15])
for sep in [". ", ", ", "", " - ", ": "]:
idx = short.rfind(sep)
if idx > 20:
return short[: idx + len(sep)].strip().rstrip(",") + "..."
return short + "..."
def strip_html(text: str) -> str:
"""Remove HTML tags and decode entities."""
text = re.sub(r"<[^>]+>", " ", text)
text = unescape(text)
text = re.sub(r"\s+", " ", text).strip()
return text
def make_id(text: str) -> str:
return hashlib.md5(text.encode()).hexdigest()[:10]
def load_feeds() -> dict:
"""Load RSS feed configuration."""
feeds_path = Path(FEEDS_FILE)
if feeds_path.exists():
try:
with open(feeds_path) as f:
return json.load(f)
except Exception as e:
print(f" ⚠ Error loading feeds config: {e}")
# Check environment variable for inline JSON
feeds_env = os.environ.get("QUOTE_MINING_FEEDS", "")
if feeds_env:
try:
return json.loads(feeds_env)
except Exception:
pass
print(" ⚠ No feeds configured. Set QUOTE_MINING_FEEDS_FILE or QUOTE_MINING_FEEDS env var.")
print(" Example feeds.json: {\"My Podcast\": \"https://feeds.example.com/rss\"}")
return {}
# ── RSS Feed Processing ──
def fetch_feed_quotes(feed_name: str, feed_url: str, since: datetime) -> list:
"""Parse an RSS feed and extract quotable candidates."""
print(f" Fetching {feed_name}...")
feed = feedparser.parse(feed_url)
candidates = []
for entry in feed.entries:
pub = entry.get("published_parsed") or entry.get("updated_parsed")
if not pub:
continue
pub_dt = datetime(*pub[:6], tzinfo=timezone.utc)
if pub_dt < since:
continue
title = entry.get("title", "").strip()
desc = strip_html(entry.get("description", "") or entry.get("summary", ""))
date_str = pub_dt.strftime("%Y-%m-%d")
if title:
scores = score_text(title + " " + desc[:200])
context_sentence = desc[:200].split(".")[0].strip() + "." if desc else title
candidates.append({
"id": make_id(title + date_str),
"quote_text": title,
"source": f"{feed_name}{title} ({date_str})",
"viral_score": scores["total"],
"score_breakdown": scores,
"suggested_platform": suggest_platform(scores, title),
"hook_version": generate_hook(title),
"context": context_sentence,
"type": "podcast_title",
})
if desc and len(desc) > 50:
sentences = re.split(r"(?<=[.!?])\s+", desc)
for sent in sentences:
sent = sent.strip()
if len(sent) < 30 or len(sent) > 300:
continue
if any(skip in sent.lower() for skip in [
"subscribe", "leave a review", "click here", "sign up",
"sponsor", "brought to you", "check out", "visit us",
"follow us", "download", "episode is", "links mentioned",
"get a free", "use code", "http", "www.", ".com/",
]):
continue
s = score_text(sent)
if s["total"] >= 30:
candidates.append({
"id": make_id(sent + date_str),
"quote_text": sent,
"source": f"{feed_name}{title} ({date_str})",
"viral_score": s["total"],
"score_breakdown": s,
"suggested_platform": suggest_platform(s, sent),
"hook_version": generate_hook(sent),
"context": f"From episode: {title}",
"type": "podcast_description",
})
print(f"{len(candidates)} candidates from {feed_name}")
return candidates
# ── Notes Processing ──
def scan_notes(notes_dir: str, since: datetime, speaker: str = "") -> list:
"""Scan meeting notes/transcripts for quotable moments."""
notes_path = Path(notes_dir)
if not notes_path.exists():
print(f" ⚠ Notes directory not found: {notes_dir}, skipping.")
return []
print(f" Scanning notes in {notes_dir}...")
candidates = []
for fpath in sorted(notes_path.glob("**/*.md")):
m = re.match(r"(\d{4}-\d{2}-\d{2})", fpath.name)
if m:
file_date = datetime.strptime(m.group(1), "%Y-%m-%d").replace(tzinfo=timezone.utc)
if file_date < since:
continue
else:
# If no date in filename, include by default
file_date = datetime.now(timezone.utc)
try:
text = fpath.read_text(errors="replace")
except Exception:
continue
meeting_name = fpath.stem.replace("_", " ").lstrip("0123456789- ")
notable_lines = []
for line in text.split("\n"):
line = line.strip()
if not line or len(line) < 30:
continue
# Match lines attributed to configured speaker
if speaker and re.match(rf"(?:{re.escape(speaker)})\s*:", line, re.I):
content = re.sub(rf"^(?:{re.escape(speaker)})\s*:\s*", "", line, flags=re.I)
notable_lines.append(content.strip())
# Grab bullet points with viral signals
elif re.match(r"[\*\-]\s+", line):
bullet = re.sub(r"^[\*\-]\s+", "", line).strip()
if len(bullet) > 30 and any(
re.search(p, bullet, re.I)
for p in CONTRARIAN_SIGNALS + SPECIFICITY_SIGNALS + EMOTIONAL_TRIGGERS
):
notable_lines.append(bullet)
for line in notable_lines:
if len(line) < 20 or len(line) > 500:
continue
if any(skip in line.lower() for skip in [
"let me share my screen", "can you hear me", "hold on",
"one second", "sorry about that", "let me pull up",
"next slide", "any questions", "sounds good",
]):
continue
s = score_text(line)
if s["total"] >= 25:
date_str = file_date.strftime("%Y-%m-%d")
candidates.append({
"id": make_id(line + date_str),
"quote_text": line,
"source": f"Notes — {meeting_name} ({date_str})",
"viral_score": s["total"],
"score_breakdown": s,
"suggested_platform": suggest_platform(s, line),
"hook_version": generate_hook(line),
"context": f"From: {meeting_name}",
"type": "meeting_notes",
})
print(f"{len(candidates)} candidates from notes")
return candidates
# ── Main ──
def main():
parser = argparse.ArgumentParser(description="Quote Mining Engine")
parser.add_argument("--days", type=int, default=90, help="Look back N days (default: 90)")
parser.add_argument("--top", type=int, default=50, help="Return top N quotes (default: 50)")
parser.add_argument("--min-score", type=int, default=40, help="Minimum viral score (default: 40)")
parser.add_argument("--output", type=str, default=str(OUTPUT_PATH), help="Output JSON path")
parser.add_argument("--feeds", type=str, help="Path to feeds JSON config file")
parser.add_argument("--notes-dir", type=str, help="Directory of meeting notes to scan")
parser.add_argument("--speaker", type=str, help="Speaker name to extract from notes")
args = parser.parse_args()
since = datetime.now(timezone.utc) - timedelta(days=args.days)
print(f"🔍 Quote Mining Engine — scanning last {args.days} days (since {since.strftime('%Y-%m-%d')})\n")
all_candidates = []
# 1. Podcast RSS feeds
feeds_file = args.feeds or FEEDS_FILE
if args.feeds:
os.environ["QUOTE_MINING_FEEDS_FILE"] = args.feeds
feeds = load_feeds() if not args.feeds else json.load(open(args.feeds))
if feeds:
print("📡 Fetching podcast feeds...")
for name, url in feeds.items():
try:
all_candidates.extend(fetch_feed_quotes(name, url, since))
except Exception as e:
print(f" ⚠ Error fetching {name}: {e}")
# 2. Meeting notes
notes_dir = args.notes_dir or NOTES_DIR
speaker = args.speaker or SPEAKER_NAME
if notes_dir:
print("\n📝 Scanning meeting notes...")
try:
all_candidates.extend(scan_notes(notes_dir, since, speaker))
except Exception as e:
print(f" ⚠ Error scanning notes: {e}")
# 3. Deduplicate
seen = set()
unique = []
for c in all_candidates:
if c["id"] not in seen:
seen.add(c["id"])
unique.append(c)
all_candidates = unique
# 4. Filter by min score
filtered = [c for c in all_candidates if c["viral_score"] >= args.min_score]
# 5. Sort and take top N
filtered.sort(key=lambda x: x["viral_score"], reverse=True)
top = filtered[: args.top]
# 6. Clean output
output = []
for c in top:
output.append({
"quote_text": c["quote_text"],
"source": c["source"],
"viral_score": c["viral_score"],
"suggested_platform": c["suggested_platform"],
"hook_version": c["hook_version"],
"context": c["context"],
})
# 7. Save
os.makedirs(os.path.dirname(args.output), exist_ok=True)
with open(args.output, "w") as f:
json.dump(output, f, indent=2)
# 8. Summary
print(f"\n{'='*60}")
print(f"📊 QUOTE MINING SUMMARY")
print(f"{'='*60}")
print(f" Total candidates found: {len(all_candidates)}")
print(f" Above min score ({args.min_score}): {len(filtered)}")
print(f" Top quotes saved: {len(output)}")
print(f" Output: {args.output}")
print()
if output:
print(f"🏆 Top 10 Quotes:")
print(f"{'-'*60}")
for i, q in enumerate(output[:10], 1):
print(f" {i:2d}. [{q['viral_score']:3d}] {q['quote_text'][:80]}")
print(f"{q['source'][:60]}")
print(f" Platform: {q['suggested_platform']} | Hook: {q['hook_version'][:50]}...")
print()
else:
print(" ⚠ No quotes met the minimum score threshold.")
print(f" Try lowering --min-score (currently {args.min_score})")
return 0
if __name__ == "__main__":
sys.exit(main())

7
finance-ops/.env.example Normal file
View file

@ -0,0 +1,7 @@
# AI Finance Ops - Environment Variables
# No API keys required — all analysis runs locally.
# Optional: Override default data directories
# CFO_INPUT_DIR=./data/uploads
# CFO_HISTORY_DIR=./data/history
# SCENARIO_OUTPUT=./data/scenarios.json

128
finance-ops/README.md Normal file
View file

@ -0,0 +1,128 @@
# AI Finance Ops
> Your AI CFO that finds hidden costs in 30 minutes.
Upload your QuickBooks exports. Get a full executive CFO briefing with anomaly detection, burn rate analysis, vendor concentration risk, and actionable recommendations. Or point it at a codebase and get a development cost estimate with organizational overhead modeling and AI ROI analysis.
## What's Inside
### CFO Briefing Generator
Drop in your QuickBooks exports (P&L, Balance Sheet, General Ledger, Expenses by Vendor, Cash Flow, etc.) and get:
- **Executive financial summary** with traffic-light status indicators (🟢🟡🔴)
- **Profitability analysis** — gross margin, net margin, operating income
- **People cost breakdown** — salaries vs contractors, payroll taxes, benefits
- **Tool & subscription audit** — find the SaaS bloat
- **Customer concentration risk** — flag dangerous client dependencies
- **Month-over-month comparison** — automatic trend detection
- **Anomaly alerts** — expenses that spike, new vendors with big spend, owner draws
- **Scenario modeling** — base/bull/bear case projections with monthly burns
### Codebase Cost Estimator
Point it at any codebase and get:
- **Development hours estimate** by code type and complexity
- **Market rate research** with current-year data
- **Organizational overhead modeling** — solo founder through enterprise
- **Full team cost** — PM, design, QA, DevOps, not just engineering
- **AI ROI analysis** — what did each hour of Claude produce in value?
## Quick Start
```bash
# 1. Install dependencies
pip install -r requirements.txt
# 2. Copy env template
cp .env.example .env
# 3. Drop QuickBooks exports into a folder
mkdir -p data/uploads
# Copy your CSV/XLSX files there
# 4. Run CFO analysis
python scripts/cfo-analyzer.py --input data/uploads/
# 5. Or estimate a codebase
# Use the SKILL.md workflow with Claude Code
```
## Supported QuickBooks Reports
Any subset works — P&L alone is enough to start:
| Report | What It Adds |
|--------|-------------|
| P&L Summary | Revenue, COGS, expenses, net income (core) |
| P&L by Customer | Client concentration analysis |
| P&L Detail | Transaction-level drill-down |
| Balance Sheet | Assets, liabilities, equity position |
| Cash Flow Statement | Operating/investing/financing flows |
| General Ledger | Full account transaction history |
| Expenses by Vendor | Vendor-level spend breakdown |
| Transaction List by Vendor | Detailed vendor transactions |
| Bill Payments | AP payment history |
| Account List | Chart of accounts |
## How It Works
The CFO analyzer:
1. **Auto-detects** report types by scanning file headers
2. **Parses** QuickBooks CSV/XLSX formats (handles dollar signs, commas, negative formats)
3. **Computes KPIs** against benchmarks for your business size
4. **Compares** to prior periods if history exists
5. **Generates** a formatted executive briefing with status indicators
The scenario modeler:
1. **Reads** the latest financial analysis
2. **Models** base case (status quo), bull case (growth targets hit), and bear case (lose top clients)
3. **Projects** 12 months forward with monthly P&L
4. **Identifies** the fastest cost levers to pull
## Benchmark Thresholds
Built-in benchmarks for B2B services businesses:
| Metric | 🟢 Healthy | 🟡 Watch | 🔴 Action Needed |
|--------|-----------|---------|-----------------|
| Gross Margin | >60% | 45-60% | <45% |
| Net Margin | >10% | 0-10% | Negative |
| People Costs (% rev) | <65% | 65-75% | >75% |
| Tool/Sub Costs (% rev) | <8% | 8-12% | >12% |
| Client Concentration | No client >15% | One at 15-25% | One >25% |
| Cash Runway | >3 months | 1-3 months | <1 month |
All thresholds are configurable in `references/metrics-guide.md`.
## File Structure
```
finance-ops/
├── README.md # This file
├── SKILL.md # Claude Code skill definition
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── scripts/
│ ├── cfo-analyzer.py # Main CFO briefing generator
│ └── scenario-modeler.py # Base/bull/bear projections
└── references/
├── metrics-guide.md # KPI thresholds and benchmarks
├── quickbooks-formats.md # QB export format specs
├── rates.md # Developer productivity rates
├── org-overhead.md # Organizational overhead factors
├── team-cost.md # Full team cost multipliers
├── claude-roi.md # AI ROI calculation method
└── output-template.md # Cost estimate output format
```
## Customization
**Adjust for your business size:** Edit `references/metrics-guide.md` to change the revenue range and benchmark thresholds. A $500K startup has different healthy ranges than a $10M agency.
**Add report types:** The file detection in `cfo-analyzer.py` uses header scanning. Add new patterns to `detect_file_type()` for custom QB report layouts.
**Change categories:** Expense categorization keywords are in `compute_kpis()`. Adjust the keyword lists to match your chart of accounts.
## License
MIT

120
finance-ops/SKILL.md Normal file
View file

@ -0,0 +1,120 @@
---
name: finance-ops
description: "AI-powered financial analysis suite. Generates executive CFO briefings from QuickBooks exports (P&L, Balance Sheet, General Ledger, Cash Flow, etc.) with anomaly detection, burn rate, runway analysis, and scenario modeling. Also estimates codebase development costs with organizational overhead and AI ROI analysis. Triggers on: 'CFO briefing', 'financial analysis', 'cost briefing', 'expense review', 'runway analysis', 'burn rate', 'cost estimate', 'how much would this cost to build', 'development cost', 'Claude ROI'."
---
# AI Finance Ops
Two tools: CFO Briefing Generator and Codebase Cost Estimator.
---
## Tool 1: CFO Briefing Generator
Generate executive financial summaries from QuickBooks exports.
### Workflow
#### 1. Ingest Files
Place QuickBooks export files (CSV, XLSX, XLS) in a working directory. Accepted report types (any subset works — P&L alone is sufficient):
- **P&L Summary** — Revenue, COGS, expenses, net income (MOST IMPORTANT)
- **P&L by Customer** — Revenue breakdown by client
- **P&L Detail** — Transaction-level detail (XLSX)
- **Balance Sheet** — Assets, liabilities, equity
- **General Ledger** — All account transactions
- **Expenses by Vendor** — Vendor-level expense breakdown
- **Transaction List by Vendor** — Detailed vendor transactions
- **Bill Payments** — AP payment history
- **Cash Flow Statement** — Operating/investing/financing flows (XLSX)
- **Account List** — Chart of accounts
#### 2. Run Analysis
```bash
python3 scripts/cfo-analyzer.py --input ./data/uploads/ [--period YYYY-MM]
```
Options:
- `--input DIR` — Directory with QB exports
- `--period YYYY-MM` — Override period label (default: auto-detected from files)
- `--history DIR` — History directory for MoM comparison (default: `./data/history/`)
- `--no-history` — Skip saving to history
The script:
1. Auto-detects file types by scanning headers
2. Parses each file into structured data
3. Computes all KPIs (see `references/metrics-guide.md` for definitions and healthy ranges)
4. Loads prior period from history for MoM comparison
5. Saves current period to history
6. Outputs formatted executive summary to stdout
#### 3. Scenario Modeling (Optional)
After running the CFO analysis, model base/bull/bear scenarios:
```bash
python3 scripts/scenario-modeler.py --input ./data/financial-latest.json
```
This generates 12-month projections for:
- **Base case** — current trajectory continues
- **Bull case** — growth targets met (new product revenue + new clients)
- **Bear case** — lose top clients
#### 4. Deliver Summary
The script outputs a formatted briefing with emoji status indicators (🟢🟡🔴), suitable for Slack, email, or any messaging surface.
### File Format Details
See `references/quickbooks-formats.md` for expected CSV/XLSX column formats and detection heuristics.
### Metric Thresholds
See `references/metrics-guide.md` for healthy ranges, red/yellow/green thresholds, and benchmark context. Adjust thresholds for your business size and type.
---
## Tool 2: Codebase Cost Estimator
Estimate full development cost of a codebase.
### Workflow
#### Step 1: Analyze the Codebase
Read the entire codebase. Catalog total lines of code by language/type, architectural complexity, advanced features, testing coverage, and documentation quality.
#### Step 2: Calculate Development Hours
Apply productivity rates from `references/rates.md`. Calculate base hours per code type, then apply overhead multipliers for architecture, debugging, review, docs, integration, and learning curve.
#### Step 3: Research Market Rates
Use web search to find current hourly rates for the relevant specializations. Build a rate table with low / median / high for the project's tech stack.
#### Step 4: Calculate Organizational Overhead
Convert raw dev hours to calendar time using efficiency factors from `references/org-overhead.md`. Show estimates across company types (Solo through Enterprise).
#### Step 5: Calculate Full Team Cost
Apply supporting role ratios and team multipliers from `references/team-cost.md`. Show role-by-role breakdown, plus summary across all company stages.
#### Step 6: Generate Cost Estimate
Output the full estimate using the template in `references/output-template.md`. Include all sections: codebase metrics, dev hours, calendar time, market rates, engineering cost, full team cost, grand total summary, and assumptions.
#### Step 7: AI ROI Analysis (Optional)
If the codebase was built with AI assistance, calculate value per AI hour using `references/claude-roi.md`. Determine active hours via git history clustering, calculate speed multiplier vs human developer, and compute cost savings and ROI.
### Key Principles
- Present professionally, suitable for stakeholders
- Include confidence level (low/medium/high) and key assumptions
- Highlight highest-complexity areas that drive cost
- Always show ranges (low/avg/high), never a single number
- Search for CURRENT year market rates, don't use stale data

View file

@ -0,0 +1,98 @@
# Claude ROI — Value Per Claude Hour
The most important metric for AI-assisted development. Answers: "What did each hour of Claude's actual working time produce?"
## Step 1: Determine Actual Claude Clock Time
### Method 1: Git History (preferred)
Run `git log --format="%ai" | sort` to get all commit timestamps. Then:
1. First commit = project start
2. Last commit = current state
3. Total calendar days = last - first
4. Cluster commits into sessions: group commits within 4-hour windows as one session
5. Estimate session duration using commit density:
| Commits in Window | Estimated Session Duration |
|-------------------|---------------------------|
| 1-2 commits | ~1 hour |
| 3-5 commits | ~2 hours |
| 6-10 commits | ~3 hours |
| 10+ commits | ~4 hours |
### Method 2: File Modification Timestamps (no git)
```bash
find . -name "*.ts" -o -name "*.swift" -o -name "*.py" | xargs stat -f "%Sm" | sort
```
Apply same session clustering logic.
### Method 3: Fallback Estimate (no timestamps)
Assume Claude writes 200-500 lines of meaningful code per hour (much faster than humans).
`Claude active hours ≈ Total LOC ÷ 350`
## Step 2: Calculate Value per Claude Hour
`Value per Claude Hour = Total Code Value (from team cost) ÷ Estimated Claude Active Hours`
Calculate across scenarios:
| Code Value Scenario | Claude Hours (est.) | Value per Claude Hour |
|--------------------|--------------------|-----------------------|
| Engineering only (avg) | [X] hrs | $[X,XXX]/hr |
| Full team equivalent (Growth Co) | [X] hrs | $[X,XXX]/hr |
| Full team equivalent (Enterprise) | [X] hrs | $[X,XXX]/hr |
## Step 3: Claude Efficiency vs. Human Developer
**Speed Multiplier:**
`Speed Multiplier = Human Dev Hours ÷ Claude Active Hours`
Example: Human needs 500 hours, Claude did it in 20 hours → 25x faster
**Cost Efficiency:**
```
Human Cost = Human Hours × $150/hr
Claude Cost = Subscription ($20-200/month) + API costs
Savings = Human Cost - Claude Cost
ROI = Savings ÷ Claude Cost
```
## Output Format
```
### Claude ROI Analysis
Project Timeline:
- First commit / project start: [date]
- Latest commit: [date]
- Total calendar time: [X] days ([X] weeks)
Claude Active Hours Estimate:
- Total sessions identified: [X] sessions
- Estimated active hours: [X] hours
- Method: [git clustering / file timestamps / LOC estimate]
Value per Claude Hour:
| Value Basis | Total Value | Claude Hours | $/Claude Hour |
|-------------|-------------|--------------|---------------|
| Engineering only | $[X] | [X] hrs | $[X,XXX]/hr |
| Full team (Growth Co) | $[X] | [X] hrs | $[X,XXX]/hr |
Speed vs. Human Developer:
- Estimated human hours for same work: [X] hours
- Claude active hours: [X] hours
- Speed multiplier: [X]x (Claude was [X]x faster)
Cost Comparison:
- Human developer cost: $[X] (at $150/hr avg)
- Estimated Claude cost: $[X] (subscription + API)
- Net savings: $[X]
- ROI: [X]x (every $1 spent on Claude produced $[X] of value)
The headline: Claude worked ~[X] hours and produced $[X] in professional
development value — roughly $[X,XXX] per Claude hour.
```

View file

@ -0,0 +1,62 @@
# Financial Metrics Guide
Healthy ranges for B2B services/agency businesses. Adjust thresholds for your revenue range.
## Revenue & Profitability
| Metric | 🟢 Green | 🟡 Yellow | 🔴 Red |
|---|---|---|---|
| TTM Revenue Growth | >10% YoY | 0-10% YoY | Declining |
| Net Income Margin | >10% | 0-10% | Negative |
| Gross Margin | >60% | 45-60% | <45% |
| Revenue per Employee | >$150K | $100-150K | <$100K |
| Client Concentration | No client >15% | One client 15-25% | One client >25% |
## Cost Structure (% of Revenue)
| Metric | 🟢 Green | 🟡 Yellow | 🔴 Red |
|---|---|---|---|
| Total People Costs | <65% | 65-75% | >75% |
| COGS (direct delivery) | <50% | 50-60% | >60% |
| Sales & Marketing | 10-20% | 20-30% | >30% |
| G&A | <15% | 15-25% | >25% |
| Tool/Subscription Costs | <8% | 8-12% | >12% |
| Contractor % of People | <30% | 30-50% | >50% |
## Cash & Liquidity
| Metric | 🟢 Green | 🟡 Yellow | 🔴 Red |
|---|---|---|---|
| Cash Runway (months) | >3 months | 1-3 months | <1 month |
| AR Days Outstanding | <45 days | 45-60 days | >60 days |
| AP Days Outstanding | <45 days | 45-60 days | >60 days |
| Current Ratio (CA/CL) | >1.5 | 1.0-1.5 | <1.0 |
## Debt & Obligations
| Metric | 🟢 Green | 🟡 Yellow | 🔴 Red |
|---|---|---|---|
| Debt-to-Revenue | <0.5x | 0.5-1.0x | >1.0x |
| Debt Service Coverage | >2.0x | 1.0-2.0x | <1.0x |
| Interest as % of Revenue | <3% | 3-5% | >5% |
## Anomaly Detection
Flag these automatically:
- Any single expense line item >$5,000 not in payroll/rent/amortization
- Any category with >10% MoM increase
- Any new vendor with >$2,000 spend
- Owner expenses >$10K/month
- Recruiting spend >$8K/month sustained
- Revenue from any single client >20% of total
## Scaling Thresholds by Revenue
These benchmarks shift as you grow:
| Revenue Range | Target Gross Margin | Target People % | Target G&A % |
|---|---|---|---|
| <$1M | >50% | <70% | <20% |
| $1-3M | >55% | <65% | <18% |
| $3-10M | >60% | <60% | <15% |
| $10M+ | >65% | <55% | <12% |

View file

@ -0,0 +1,33 @@
# Organizational Overhead
## Weekly Time Allocation (Typical Developer)
| Activity | Hours/Week | Notes |
|----------|-----------|-------|
| Pure coding time | 20-25 | Actual focused development |
| Daily standups | 1.25 | 15 min × 5 days |
| Weekly team sync | 1-2 | All-hands, team meetings |
| 1:1s with manager | 0.5-1 | Weekly or biweekly |
| Sprint planning/retro | 1-2 | Per week average |
| Code reviews (giving) | 2-3 | Reviewing teammates' work |
| Slack/email/async | 3-5 | Communication overhead |
| Context switching | 2-4 | Interruptions, task switching |
| Ad-hoc meetings | 1-2 | Unplanned discussions |
| Admin/HR/tooling | 1-2 | Timesheets, tools, access |
## Coding Efficiency by Company Type
| Company Type | Efficiency | Coding Hrs/Week |
|-------------|-----------|-----------------|
| Startup (lean) | 60-70% | 24-28 |
| Growth company | 50-60% | 20-24 |
| Enterprise | 40-50% | 16-20 |
| Large bureaucracy | 30-40% | 12-16 |
## Calendar Time Formula
```
Calendar Weeks = Raw Dev Hours ÷ (40 × Efficiency Factor)
```
Example: 3,288 raw hours at 50% efficiency = 3,288 ÷ 20 = 164 weeks (~3.2 years)

View file

@ -0,0 +1,120 @@
# Output Template
Use this structure for the final estimate. Replace all [X] placeholders with calculated values.
```markdown
## [Project Name] - Development Cost Estimate
Analysis Date: [Current Date]
Codebase Version: [From project status/README]
### Codebase Metrics
- Total Lines of Code: [number]
- [Language 1]: [number] lines
- [Language 2]: [number] lines
- Tests: [number] lines
- Documentation: [number] lines
- Complexity Factors:
- Advanced frameworks: [list key ones]
- System-level programming: [if applicable]
- GPU programming: [if applicable]
- Third-party integrations: [list]
### Development Time Estimate
Base Development Hours: [number] hours
- [Phase/Module 1]: [hours] hours
- [Phase/Module 2]: [hours] hours
- [Phase/Module 3]: [hours] hours
Overhead Multipliers:
- Architecture & Design: +[X]% ([hours] hours)
- Debugging & Troubleshooting: +[X]% ([hours] hours)
- Code Review & Refactoring: +[X]% ([hours] hours)
- Documentation: +[X]% ([hours] hours)
- Integration & Testing: +[X]% ([hours] hours)
- Learning Curve: +[X]% ([hours] hours)
Total Estimated Hours: [number] hours
### Realistic Calendar Time (with Organizational Overhead)
| Company Type | Efficiency | Coding Hrs/Week | Calendar Weeks | Calendar Time |
|--------------|------------|-----------------|----------------|---------------|
| Solo/Startup (lean) | 65% | 26 hrs | [X] weeks | ~[X] months |
| Growth Company | 55% | 22 hrs | [X] weeks | ~[X] years |
| Enterprise | 45% | 18 hrs | [X] weeks | ~[X] years |
| Large Bureaucracy | 35% | 14 hrs | [X] weeks | ~[X] years |
### Market Rate Research
Senior Developer Rates ([current year]):
- Low end: $[X]/hour (remote, mid-level market)
- Average: $[X]/hour (standard US market)
- High end: $[X]/hour (SF Bay Area, NYC, specialized)
Recommended Rate: $[X]/hour
*Rationale*: [Why this rate for this project's tech stack]
### Total Cost Estimate
| Scenario | Hourly Rate | Total Hours | Total Cost |
|----------|-------------|-------------|------------|
| Low-end | $[X] | [hours] | $[X,XXX] |
| Average | $[X] | [hours] | $[X,XXX] |
| High-end | $[X] | [hours] | $[X,XXX] |
Recommended Estimate (Engineering Only): $[X,XXX] - $[X,XXX]
### Full Team Cost (All Roles)
| Company Stage | Team Multiplier | Engineering Cost | Full Team Cost |
|---------------|-----------------|------------------|----------------|
| Solo/Founder | 1.0x | $[X] | $[X] |
| Lean Startup | 1.45x | $[X] | $[X] |
| Growth Company | 2.2x | $[X] | $[X] |
| Enterprise | 2.65x | $[X] | $[X] |
Role Breakdown (Growth Company Example):
| Role | Hours | Rate | Cost |
|------|-------|------|------|
| Engineering | [X] hrs | $[X]/hr | $[X] |
| Product Management | [X] hrs | $[X]/hr | $[X] |
| UX/UI Design | [X] hrs | $[X]/hr | $[X] |
| Engineering Management | [X] hrs | $[X]/hr | $[X] |
| QA/Testing | [X] hrs | $[X]/hr | $[X] |
| Project Management | [X] hrs | $[X]/hr | $[X] |
| Technical Writing | [X] hrs | $[X]/hr | $[X] |
| DevOps/Platform | [X] hrs | $[X]/hr | $[X] |
| TOTAL | [X] hrs | | $[X] |
### Grand Total Summary
| Metric | Solo | Lean Startup | Growth Co | Enterprise |
|--------|------|--------------|-----------|------------|
| Calendar Time | [X] | [X] | [X] | [X] |
| Total Human Hours | [X] | [X] | [X] | [X] |
| Total Cost | $[X] | $[X] | $[X] | $[X] |
### Assumptions
1. Rates based on US market averages ([year])
2. Full-time equivalent allocation for all roles
3. Includes complete implementation of [scope]
4. Does not include:
- Marketing & sales
- Legal & compliance
- Office/equipment
- Hosting/infrastructure
- Ongoing maintenance post-launch
### Comparison: AI-Assisted Development
Estimated time savings with Claude Code: [X]%
Effective hourly rate with AI assistance: ~$[X]/hour equivalent productivity
```
Present professionally, suitable for stakeholders. Include confidence level (low/medium/high) and key assumptions. Highlight highest-complexity areas that drive cost.

View file

@ -0,0 +1,81 @@
# QuickBooks Export Formats
## File Detection Heuristics
The analyzer identifies report type by scanning the first 10 lines for signature patterns:
| Report Type | Detection Pattern |
|---|---|
| P&L Summary | "Profit and Loss" in header, two-column (account, total) |
| P&L by Customer | "Profit and Loss by Customer" in header, multi-column with customer names |
| P&L Detail | "Profit and Loss Detail" in sheet name or header; columns: Date, Transaction Type, Num, Name, Class, Memo, Split, Amount, Balance |
| Balance Sheet | "Balance Sheet" in header |
| Cash Flow Statement | "Statement of Cash Flows" in sheet name or header; monthly columns |
| General Ledger | "General Ledger" in header; columns include Date, Account, Debit, Credit |
| Expenses by Vendor | "Expenses by Vendor" in header |
| Transaction List by Vendor | "Transaction List by Vendor" in header |
| Bill Payments | "Bill Payment" in header or transaction types dominated by "Bill Payment" |
| Account List | "Account List" in header; columns include Account, Type, Balance |
## P&L Summary CSV Format
```
Profit and Loss,
Your Company Name,
"January, 2025-December, 2025",
<blank line>
Distribution account,Total
Income,
40000 Revenue,
40010 Consulting Revenue,"250,000.00"
...
Total for 40000 Revenue,"$2,000,000.00"
Total for Income,"$2,000,000.00"
Cost of Goods Sold,
...
Total for Cost of Goods Sold,"$900,000.00"
Gross Profit,"$1,100,000.00"
Expenses,
...
Total for Expenses,"$850,000.00"
Net Operating Income,"$250,000.00"
Other Income,
...
Net Income,"$200,000.00"
```
Key parsing rules:
- Dollar values may have `$` prefix, commas, quotes, negative in parens or with `-`
- "Total for X" lines aggregate their parent category
- Indentation (spaces) indicates hierarchy depth
- Account numbers (5 digits) prefix account names
## P&L by Customer CSV Format
Same structure as P&L Summary but with customer names as column headers. The last column is "Total". Revenue and COGS broken down per customer.
## P&L Detail XLSX Format
Columns: (blank), Date, Transaction Type, Num, Name, Class, Memo/Description, Split, Amount, Balance
- Hierarchical account names in column A
- Transaction rows have Date populated
- Subtotal rows show "Total for [Account]"
## Cash Flow Statement XLSX Format
Monthly columns (e.g., "Feb 12-28, 2025", "Mar 2025", ..., "Total")
Sections: OPERATING ACTIVITIES, INVESTING ACTIVITIES, FINANCING ACTIVITIES
Values stored as Excel formulas (e.g., `=-236705.50`) — use `data_only=False` and parse formula strings, or open with `data_only=True`.
Note: openpyxl with `data_only=True` may return None for formula cells unless the file was last saved by Excel. Parse formula strings as fallback: strip `=` prefix, evaluate simple numeric expressions.
## Common Parsing Pitfalls
1. **Comma-separated thousands in quotes**: `"1,250,000.00"` — strip commas before float conversion
2. **Dollar signs**: `"$2,000,000.00"` — strip `$`
3. **Negative values**: Both `-50,000.00` and `($50,000.00)` formats
4. **Empty rows**: QB exports include blank lines between sections
5. **Header rows**: First 3-5 rows are company name, report name, date range
6. **"(deleted)" in names**: Customer/vendor names may include "(deleted)"
7. **Formula cells in XLSX**: May need formula string parsing as fallback

View file

@ -0,0 +1,29 @@
# Hourly Productivity Rates by Code Type
## Lines Per Hour (Senior Developer, 5+ years)
| Code Type | Lines/Hour | Notes |
|-----------|-----------|-------|
| Simple CRUD/UI | 30-50 | Forms, basic views, standard patterns |
| Complex business logic | 20-30 | State machines, workflows, validations |
| GPU/Metal/shader programming | 10-20 | Metal, CUDA, OpenGL, compute shaders |
| Native C/C++ interop | 10-20 | FFI, bridging headers, memory management |
| Video/audio processing | 10-15 | AVFoundation, CoreMedia, codecs |
| System extensions/plugins | 8-12 | Kernel extensions, DAL plugins, drivers |
| Comprehensive tests | 25-40 | Unit, integration, snapshot tests |
| Infrastructure/DevOps | 15-25 | CI/CD, Docker, Terraform, scripts |
| API integrations | 20-30 | REST, GraphQL, WebSocket clients |
| Data processing/ETL | 15-25 | Parsers, transformers, pipelines |
## Overhead Multipliers
| Activity | % of Coding Time |
|----------|-----------------|
| Architecture & design | +15-20% |
| Debugging & troubleshooting | +25-30% |
| Code review & refactoring | +10-15% |
| Documentation | +10-15% |
| Integration & testing | +20-25% |
| Learning curve (new frameworks) | +10-20% |
Total overhead typically: 1.9x-2.25x raw coding hours.

View file

@ -0,0 +1,35 @@
# Full Team Cost Calculation
## Supporting Role Ratios
| Role | Ratio to Eng Hours | Typical Rate | Notes |
|------|-------------------|--------------|-------|
| Product Management | 0.25-0.40x | $125-200/hr | PRDs, roadmap, stakeholder mgmt |
| UX/UI Design | 0.20-0.35x | $100-175/hr | Wireframes, mockups, design systems |
| Engineering Management | 0.12-0.20x | $150-225/hr | 1:1s, hiring, performance, strategy |
| QA/Testing | 0.15-0.25x | $75-125/hr | Test plans, manual testing, automation |
| Project/Program Management | 0.08-0.15x | $100-150/hr | Schedules, dependencies, status |
| Technical Writing | 0.05-0.10x | $75-125/hr | User docs, API docs, internal docs |
| DevOps/Platform | 0.10-0.20x | $125-200/hr | CI/CD, infra, deployments |
## Team Composition by Company Stage
| Stage | PM | Design | EM | QA | PgM | Docs | DevOps |
|-------|-----|--------|-----|-----|------|------|--------|
| Solo/Founder | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
| Lean Startup | 15% | 15% | 5% | 5% | 0% | 0% | 5% |
| Growth Company | 30% | 25% | 15% | 20% | 10% | 5% | 15% |
| Enterprise | 40% | 35% | 20% | 25% | 15% | 10% | 20% |
## Full Team Multiplier
| Stage | Multiplier |
|-------|-----------|
| Solo/Founder | 1.0x (just engineering) |
| Lean Startup | ~1.45x engineering cost |
| Growth Company | ~2.2x engineering cost |
| Enterprise | ~2.65x engineering cost |
Formula: `Full Team Cost = Engineering Cost × Team Multiplier`
Example: $500K engineering cost at Growth Company = $500K × 2.2 = $1.1M total team cost

View file

@ -0,0 +1,3 @@
# CFO Briefing Analyzer
pandas>=2.0.0
openpyxl>=3.1.0

View file

@ -0,0 +1,821 @@
#!/usr/bin/env python3
"""
CFO Briefing Analyzer Parse QuickBooks exports and generate executive financial summaries.
Outputs formatted briefing to stdout. Stores history for MoM comparison.
Usage:
python3 cfo-analyzer.py --input ./data/uploads/
python3 cfo-analyzer.py --input ./data/uploads/ --period 2025-01
python3 cfo-analyzer.py --input ./data/uploads/ --no-history
"""
import argparse
import json
import os
import re
import sys
from datetime import datetime
from pathlib import Path
from typing import Any
import pandas as pd
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def parse_dollar(val: Any) -> float:
"""Convert QB dollar string to float. Handles $, commas, parens for negatives, formula strings."""
if val is None:
return 0.0
if isinstance(val, (int, float)):
return float(val)
s = str(val).strip()
if not s or s == '-' or s.lower() == 'none':
return 0.0
# Handle Excel formula strings like =-236705.50
if s.startswith('='):
s = s[1:]
try:
return float(s)
except ValueError:
return 0.0
neg = False
if s.startswith('(') and s.endswith(')'):
neg = True
s = s[1:-1]
s = s.replace('$', '').replace(',', '').replace('"', '').strip()
if not s:
return 0.0
try:
v = float(s)
return -v if neg else v
except ValueError:
return 0.0
def status_emoji(value: float, green_range: tuple, yellow_range: tuple) -> str:
"""Return 🟢/🟡/🔴 based on thresholds. green_range/yellow_range are (min, max) inclusive."""
gmin, gmax = green_range
ymin, ymax = yellow_range
if gmin <= value <= gmax:
return "🟢"
if ymin <= value <= ymax:
return "🟡"
return "🔴"
def pct(part: float, whole: float) -> float:
return (part / whole * 100) if whole else 0.0
def fmt_k(val: float) -> str:
"""Format as $XXK or $X.XM."""
if abs(val) >= 1_000_000:
return f"${val/1_000_000:.2f}M"
return f"${val/1_000:.0f}K"
def fmt_pct(val: float) -> str:
return f"{val:.1f}%"
# ---------------------------------------------------------------------------
# File detection & parsing
# ---------------------------------------------------------------------------
def detect_file_type(filepath: Path) -> str | None:
"""Detect QB report type from file content."""
ext = filepath.suffix.lower()
if ext in ('.xlsx', '.xls'):
try:
import openpyxl
wb = openpyxl.load_workbook(str(filepath), data_only=False)
for name in wb.sheetnames:
nl = name.lower()
if 'cash flow' in nl or 'statement of cash' in nl:
return 'cash_flow'
if 'profit and loss detail' in nl or 'p&l detail' in nl:
return 'pl_detail'
if 'profit and loss' in nl or 'p&l' in nl:
return 'pl_summary'
if 'balance sheet' in nl:
return 'balance_sheet'
if 'general ledger' in nl:
return 'general_ledger'
ws = wb.active
for row in ws.iter_rows(max_row=5, values_only=True):
text = ' '.join(str(c).lower() for c in row if c)
if 'statement of cash flow' in text:
return 'cash_flow'
if 'profit and loss detail' in text:
return 'pl_detail'
if 'profit and loss' in text:
return 'pl_summary'
if 'balance sheet' in text:
return 'balance_sheet'
except Exception:
pass
return None
if ext != '.csv':
return None
try:
with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
head = ''.join(f.readline() for _ in range(10)).lower()
except Exception:
return None
if 'profit and loss by customer' in head or 'profit and loss by job' in head:
return 'pl_by_customer'
if 'profit and loss' in head:
return 'pl_summary'
if 'balance sheet' in head:
return 'balance_sheet'
if 'general ledger' in head:
return 'general_ledger'
if 'expenses by vendor' in head:
return 'expenses_by_vendor'
if 'transaction list by vendor' in head:
return 'transactions_by_vendor'
if 'bill payment' in head:
return 'bill_payments'
if 'account list' in head:
return 'account_list'
return None
def detect_period(filepath: Path) -> str | None:
"""Try to extract period from file header."""
ext = filepath.suffix.lower()
text = ''
if ext == '.csv':
try:
with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
text = ''.join(f.readline() for _ in range(5))
except Exception:
pass
elif ext in ('.xlsx', '.xls'):
try:
import openpyxl
wb = openpyxl.load_workbook(str(filepath), data_only=False)
ws = wb.active
for row in ws.iter_rows(max_row=5, values_only=True):
text += ' '.join(str(c) for c in row if c) + '\n'
except Exception:
pass
# Look for patterns like "February, 2025-January, 2026" or "March 2025"
m = re.search(r'(\w+),?\s*(\d{4})\s*[-]\s*(\w+),?\s*(\d{4})', text)
if m:
end_month, end_year = m.group(3), m.group(4)
try:
dt = datetime.strptime(f"{end_month} {end_year}", "%B %Y")
return dt.strftime("%Y-%m")
except ValueError:
pass
m = re.search(r'(\w+ \d{4})', text)
if m:
try:
dt = datetime.strptime(m.group(1), "%B %Y")
return dt.strftime("%Y-%m")
except ValueError:
pass
return None
# ---------------------------------------------------------------------------
# P&L Summary parser
# ---------------------------------------------------------------------------
def parse_pl_summary(filepath: Path) -> dict:
"""Parse P&L Summary CSV into structured data."""
results: dict[str, Any] = {
'revenue': {},
'cogs': {},
'expenses': {},
'other_income': {},
'other_expenses': {},
'totals': {}
}
with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
lines = f.readlines()
section = None
for line in lines:
line = line.strip()
if not line:
continue
# Split on first comma that's not inside quotes
parts = []
in_quote = False
current = ''
for ch in line:
if ch == '"':
in_quote = not in_quote
elif ch == ',' and not in_quote:
parts.append(current.strip().strip('"'))
current = ''
continue
current += ch
parts.append(current.strip().strip('"'))
account = parts[0] if parts else ''
value = parse_dollar(parts[-1]) if len(parts) > 1 else 0.0
# Track sections
if account == 'Income':
section = 'revenue'
continue
elif account == 'Cost of Goods Sold':
section = 'cogs'
continue
elif account == 'Expenses':
section = 'expenses'
continue
elif account == 'Other Income':
section = 'other_income'
continue
elif account == 'Other Expenses':
section = 'other_expenses'
continue
# Capture totals
if account.startswith('Total for Income'):
results['totals']['total_income'] = value
section = None
elif account.startswith('Total for Cost of Goods Sold'):
results['totals']['total_cogs'] = value
elif account.startswith('Gross Profit'):
results['totals']['gross_profit'] = value
elif account.startswith('Total for Expenses'):
results['totals']['total_expenses'] = value
elif account == 'Net Operating Income':
results['totals']['net_operating_income'] = value
elif account.startswith('Total for Other Income'):
results['totals']['total_other_income'] = value
elif account.startswith('Total for Other Expenses'):
results['totals']['total_other_expenses'] = value
elif account == 'Net Income':
results['totals']['net_income'] = value
elif account.startswith('Net Other Income'):
results['totals']['net_other_income'] = value
elif section and value != 0.0 and not account.startswith('Total for'):
# Store individual line items
clean_name = re.sub(r'^\d{5}\s+', '', account).strip()
if section in results:
results[section][clean_name] = value
return results
# ---------------------------------------------------------------------------
# P&L by Customer parser
# ---------------------------------------------------------------------------
def parse_pl_by_customer(filepath: Path) -> dict:
"""Parse P&L by Customer CSV. Returns revenue by customer."""
customers: dict[str, float] = {}
with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
lines = f.readlines()
# Find header row with customer names
header_idx = None
headers = []
for i, line in enumerate(lines):
if 'Distribution account' in line:
header_idx = i
import csv
reader = csv.reader([line])
headers = next(reader)
break
if not headers:
return {'customers': customers}
# Find Total for Income row
for line in lines[header_idx+1:]:
if 'Total for Income' in line or 'Total for' in line and 'Revenue' in line:
import csv
reader = csv.reader([line])
values = next(reader)
for j, val in enumerate(values):
if j > 0 and j < len(headers) and headers[j] and headers[j] != 'Total':
v = parse_dollar(val)
if v > 0:
name = headers[j].replace(' (deleted)', '').strip()
customers[name] = v
break
return {'customers': customers}
# ---------------------------------------------------------------------------
# Cash Flow XLSX parser
# ---------------------------------------------------------------------------
def parse_cash_flow(filepath: Path) -> dict:
"""Parse Cash Flow Statement XLSX."""
import openpyxl
wb = openpyxl.load_workbook(str(filepath), data_only=False)
ws = wb.active
rows = list(ws.iter_rows(values_only=True))
result: dict[str, Any] = {'monthly_net_income': {}, 'monthly_net_cash': {}}
# Find header row with month columns
headers = []
header_row_idx = None
for i, row in enumerate(rows):
vals = [str(c) for c in row if c]
for v in vals:
if re.search(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{4}', str(v)):
headers = list(row)
header_row_idx = i
break
if headers:
break
if not headers:
return result
for row in rows[header_row_idx+1:]:
label = str(row[0]).strip() if row[0] else ''
if 'Net Income' in label and 'reconcile' not in label.lower():
for j, val in enumerate(row[1:], 1):
if j < len(headers) and headers[j]:
month_str = str(headers[j])
v = parse_dollar(val)
result['monthly_net_income'][month_str] = v
if 'net cash' in label.lower() or 'net change in cash' in label.lower():
for j, val in enumerate(row[1:], 1):
if j < len(headers) and headers[j]:
month_str = str(headers[j])
v = parse_dollar(val)
result['monthly_net_cash'][month_str] = v
return result
# ---------------------------------------------------------------------------
# KPI computation
# ---------------------------------------------------------------------------
def compute_kpis(pl_data: dict, customer_data: dict | None, cash_flow_data: dict | None) -> dict:
"""Compute all KPIs from parsed data."""
kpis: dict[str, Any] = {}
totals = pl_data.get('totals', {})
revenue_items = pl_data.get('revenue', {})
cogs_items = pl_data.get('cogs', {})
expense_items = pl_data.get('expenses', {})
other_exp = pl_data.get('other_expenses', {})
other_inc = pl_data.get('other_income', {})
# --- Revenue ---
total_revenue = totals.get('total_income', 0)
kpis['total_revenue'] = total_revenue
# Revenue by service line
service_revenue = {}
for k, v in revenue_items.items():
if v != 0:
service_revenue[k] = v
kpis['revenue_by_service'] = dict(sorted(service_revenue.items(), key=lambda x: -x[1]))
# --- COGS & Gross Margin ---
total_cogs = totals.get('total_cogs', 0)
gross_profit = totals.get('gross_profit', total_revenue - total_cogs)
kpis['total_cogs'] = total_cogs
kpis['gross_profit'] = gross_profit
kpis['gross_margin_pct'] = pct(gross_profit, total_revenue)
# --- Net Income ---
kpis['net_income'] = totals.get('net_income', 0)
kpis['net_operating_income'] = totals.get('net_operating_income', 0)
kpis['net_margin_pct'] = pct(kpis['net_income'], total_revenue)
# --- People Costs ---
salary_total = 0
contractor_total = 0
payroll_tax_total = 0
benefits_total = 0
all_items = {}
all_items.update(cogs_items)
all_items.update(expense_items)
for k, v in all_items.items():
kl = k.lower()
if 'salari' in kl or 'wages' in kl or 'payroll -' in kl or 'payroll - ' in kl:
salary_total += v
elif 'contractor' in kl:
contractor_total += v
elif 'payroll tax' in kl:
payroll_tax_total += v
elif 'benefit' in kl or 'insurance' in kl and 'benefit' in kl:
benefits_total += v
elif 'commission' in kl:
contractor_total += v
people_total = salary_total + contractor_total + payroll_tax_total + benefits_total
kpis['people_costs'] = {
'total': people_total,
'salaries': salary_total,
'contractors': contractor_total,
'payroll_taxes': payroll_tax_total,
'benefits': benefits_total,
'pct_of_revenue': pct(people_total, total_revenue),
'contractor_pct': pct(contractor_total, people_total) if people_total else 0
}
# --- Tool/Subscription Costs ---
tool_total = 0
tool_items = {}
for k, v in all_items.items():
kl = k.lower()
if any(w in kl for w in ['subscription', 'tools', 'hosting', 'it hosting', 'software']):
tool_total += v
tool_items[k] = v
kpis['tool_costs'] = {
'total': tool_total,
'pct_of_revenue': pct(tool_total, total_revenue),
'items': dict(sorted(tool_items.items(), key=lambda x: -x[1]))
}
# --- Expense Categories ---
total_expenses = totals.get('total_expenses', 0)
kpis['total_opex'] = total_expenses
categories = {
'Sales & Growth': 0,
'Marketing & Branding': 0,
'G&A': 0,
'Facilities': 0,
'Other Expenses': 0,
}
for k, v in expense_items.items():
kl = k.lower()
if any(w in kl for w in ['sales', 'commission', 'growth']):
categories['Sales & Growth'] += v
elif any(w in kl for w in ['marketing', 'advertising', 'writing', 'content', 'podcast', 'branding', 'networking']):
categories['Marketing & Branding'] += v
elif any(w in kl for w in ['rent', 'facilit']):
categories['Facilities'] += v
else:
categories['G&A'] += v
kpis['expense_categories'] = {k: {'amount': v, 'pct_of_revenue': pct(v, total_revenue)} for k, v in categories.items() if v}
# --- Interest & Debt ---
interest = other_exp.get('Interest Expense', 0)
amortization = other_exp.get('Amortization Expense', 0)
kpis['interest_expense'] = interest
kpis['amortization'] = amortization
kpis['interest_pct_of_revenue'] = pct(interest, total_revenue)
# --- Other Income ---
kpis['other_income'] = {k: v for k, v in other_inc.items() if v}
kpis['total_other_income'] = totals.get('total_other_income', 0)
# --- Notable Expense Items (anomaly detection) ---
notable = {}
for k, v in all_items.items():
kl = k.lower()
if v > 5000 and not any(w in kl for w in ['salari', 'wages', 'payroll', 'rent', 'amortization']):
notable[k] = v
kpis['notable_expenses'] = dict(sorted(notable.items(), key=lambda x: -x[1])[:15])
# --- Recruiting ---
recruiting = 0
for k, v in all_items.items():
if 'recruit' in k.lower():
recruiting += v
kpis['recruiting_spend'] = recruiting
# --- Owner Expenses ---
owner_total = 0
for k, v in all_items.items():
if 'owner' in k.lower() or k.startswith('OE -'):
owner_total += v
kpis['owner_expenses'] = owner_total
# --- Customer Data ---
if customer_data and customer_data.get('customers'):
custs = customer_data['customers']
sorted_custs = sorted(custs.items(), key=lambda x: -x[1])
kpis['top_customers'] = sorted_custs[:10]
kpis['customer_count'] = len([c for c in custs.values() if c > 0])
if sorted_custs and total_revenue:
top_pct = sorted_custs[0][1] / total_revenue * 100
kpis['top_customer_concentration'] = (sorted_custs[0][0], top_pct)
# --- Cash Flow Data ---
if cash_flow_data:
monthly_ni = cash_flow_data.get('monthly_net_income', {})
if monthly_ni:
vals = list(monthly_ni.values())
kpis['monthly_net_income'] = monthly_ni
if len(vals) >= 3:
kpis['last_3mo_avg_ni'] = sum(vals[-3:]) / 3
kpis['monthly_burn'] = abs(min(vals)) if any(v < 0 for v in vals) else 0
return kpis
# ---------------------------------------------------------------------------
# MoM comparison
# ---------------------------------------------------------------------------
def load_prior_period(history_dir: Path, current_period: str) -> dict | None:
"""Load most recent prior period from history."""
if not history_dir.exists():
return None
files = sorted(history_dir.glob('*.json'), reverse=True)
for f in files:
if f.stem != current_period:
try:
with open(f) as fh:
return json.load(fh)
except Exception:
continue
return None
def compute_variance(current: float, prior: float) -> str:
"""Format MoM variance."""
if prior == 0:
return "N/A"
change = ((current - prior) / abs(prior)) * 100
arrow = "" if change > 0 else "" if change < 0 else ""
return f"{arrow} {abs(change):.1f}%"
# ---------------------------------------------------------------------------
# Output formatting
# ---------------------------------------------------------------------------
def format_briefing(kpis: dict, prior: dict | None, period: str) -> str:
"""Format KPIs into executive briefing with status indicators."""
lines = []
lines.append(f"*📊 CFO Briefing — {period}*")
lines.append("=" * 40)
rev = kpis['total_revenue']
# --- Revenue ---
lines.append("")
lines.append("*💰 Revenue*")
lines.append(f"• Total Revenue: *{fmt_k(rev)}*" + (f" ({compute_variance(rev, prior.get('total_revenue', 0))} MoM)" if prior else ""))
if kpis.get('revenue_by_service'):
top3 = list(kpis['revenue_by_service'].items())[:5]
for name, val in top3:
lines.append(f" · {name}: {fmt_k(val)} ({fmt_pct(pct(val, rev))})")
# --- Profitability ---
gm = kpis['gross_margin_pct']
nm = kpis['net_margin_pct']
gm_status = status_emoji(gm, (60, 999), (45, 60))
nm_status = status_emoji(nm, (10, 999), (0, 10))
lines.append("")
lines.append("*📈 Profitability*")
lines.append(f"{gm_status} Gross Margin: *{fmt_pct(gm)}* (Gross Profit: {fmt_k(kpis['gross_profit'])})")
lines.append(f"{nm_status} Net Income: *{fmt_k(kpis['net_income'])}* ({fmt_pct(nm)} margin)")
lines.append(f"• Net Operating Income: {fmt_k(kpis['net_operating_income'])}")
if prior:
lines.append(f" MoM: Revenue {compute_variance(rev, prior.get('total_revenue', 0))}, "
f"Net Income {compute_variance(kpis['net_income'], prior.get('net_income', 0))}")
# --- People Costs ---
pc = kpis['people_costs']
pc_status = status_emoji(pc['pct_of_revenue'], (0, 65), (65, 75))
contr_status = status_emoji(pc['contractor_pct'], (0, 30), (30, 50))
lines.append("")
lines.append("*👥 People Costs*")
lines.append(f"{pc_status} Total: *{fmt_k(pc['total'])}* ({fmt_pct(pc['pct_of_revenue'])} of revenue)")
lines.append(f" · Salaries: {fmt_k(pc['salaries'])}")
lines.append(f" · {contr_status} Contractors: {fmt_k(pc['contractors'])} ({fmt_pct(pc['contractor_pct'])} of people costs)")
lines.append(f" · Payroll Taxes: {fmt_k(pc['payroll_taxes'])}")
lines.append(f" · Benefits: {fmt_k(pc['benefits'])}")
# --- Tools & Subscriptions ---
tc = kpis['tool_costs']
tc_status = status_emoji(tc['pct_of_revenue'], (0, 8), (8, 12))
lines.append("")
lines.append("*🔧 Tools & Subscriptions*")
lines.append(f"{tc_status} Total: *{fmt_k(tc['total'])}* ({fmt_pct(tc['pct_of_revenue'])} of revenue)")
if tc['items']:
for name, val in list(tc['items'].items())[:5]:
lines.append(f" · {name}: {fmt_k(val)}")
# --- Expense Categories ---
if kpis.get('expense_categories'):
lines.append("")
lines.append("*📋 Operating Expenses*")
lines.append(f"• Total OpEx: *{fmt_k(kpis['total_opex'])}* ({fmt_pct(pct(kpis['total_opex'], rev))} of revenue)")
for cat, data in kpis['expense_categories'].items():
lines.append(f" · {cat}: {fmt_k(data['amount'])} ({fmt_pct(data['pct_of_revenue'])})")
# --- Interest & Debt ---
int_status = status_emoji(kpis['interest_pct_of_revenue'], (0, 3), (3, 5))
lines.append("")
lines.append("*🏦 Debt & Interest*")
lines.append(f"{int_status} Interest Expense: *{fmt_k(kpis['interest_expense'])}* ({fmt_pct(kpis['interest_pct_of_revenue'])} of revenue)")
lines.append(f"• Amortization (non-cash): {fmt_k(kpis['amortization'])}")
# --- Notable Items ---
if kpis.get('recruiting_spend'):
lines.append(f"• Recruiting: {fmt_k(kpis['recruiting_spend'])}")
if kpis.get('owner_expenses'):
lines.append(f"• Owner Expenses: {fmt_k(kpis['owner_expenses'])}")
# --- Customer Concentration ---
if kpis.get('top_customers'):
lines.append("")
lines.append("*🏢 Top Customers*")
for name, val in kpis['top_customers'][:5]:
conc = pct(val, rev)
conc_flag = " ⚠️" if conc > 15 else ""
lines.append(f" · {name}: {fmt_k(val)} ({fmt_pct(conc)}){conc_flag}")
if kpis.get('top_customer_concentration'):
cname, cpct = kpis['top_customer_concentration']
conc_status = status_emoji(100 - cpct, (75, 100), (60, 75))
lines.append(f"{conc_status} Top client concentration: {cname} at {fmt_pct(cpct)}")
if kpis.get('customer_count'):
lines.append(f"• Active customers: {kpis['customer_count']}")
# --- Other Income ---
if kpis.get('other_income'):
lines.append("")
lines.append("*📥 Other Income*")
for k, v in kpis['other_income'].items():
lines.append(f" · {k}: {fmt_k(v)}")
# --- Monthly Trends ---
if kpis.get('monthly_net_income'):
lines.append("")
lines.append("*📉 Monthly Net Income Trend*")
for month, val in kpis['monthly_net_income'].items():
ml = month.lower()
if 'total' in ml:
continue
if val == 0.0:
continue
indicator = "" if val > 0 else ""
lines.append(f" · {month}: {fmt_k(val)} {indicator}")
# --- Alerts ---
alerts = []
if kpis['gross_margin_pct'] < 45:
alerts.append("🔴 Gross margin critically low (<45%). Target: 60%+")
if pc['pct_of_revenue'] > 75:
alerts.append("🔴 People costs >75% of revenue. Target: 55-65%")
if pc['contractor_pct'] > 50:
alerts.append("🟡 Contractor spend >50% of people costs — dependency risk")
if kpis['interest_pct_of_revenue'] > 5:
alerts.append("🔴 Interest >5% of revenue — debt load is heavy")
if kpis.get('recruiting_spend', 0) > 80000:
alerts.append("🟡 Recruiting spend elevated — review if active hires justify")
if kpis.get('owner_expenses', 0) > 100000:
alerts.append("🟡 Owner expenses >$100K TTM")
if kpis['net_income'] < 0:
deficit = abs(kpis['net_income'])
alerts.append(f"🔴 Operating at a loss: {fmt_k(deficit)} deficit")
if alerts:
lines.append("")
lines.append("*⚠️ Alerts*")
for a in alerts:
lines.append(f"{a}")
lines.append("")
lines.append("_Data from QuickBooks exports. Review with your finance team before acting._")
return '\n'.join(lines)
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description='CFO Briefing Analyzer')
parser.add_argument('--input', '-i', default='./data/uploads/',
help='Directory containing QB export files')
parser.add_argument('--period', '-p', help='Override period label (YYYY-MM)')
parser.add_argument('--history', default='./data/history/',
help='History directory for MoM comparison')
parser.add_argument('--no-history', action='store_true', help='Skip saving to history')
args = parser.parse_args()
input_dir = Path(args.input)
history_dir = Path(args.history)
if not input_dir.exists():
print(f"Error: Input directory not found: {input_dir}", file=sys.stderr)
sys.exit(1)
# Find and classify files
files = list(input_dir.glob('*.csv')) + list(input_dir.glob('*.xlsx')) + list(input_dir.glob('*.xls'))
if not files:
print(f"Error: No CSV/XLSX files found in {input_dir}", file=sys.stderr)
sys.exit(1)
classified: dict[str, Path] = {}
period = args.period
for f in files:
ftype = detect_file_type(f)
if ftype:
classified[ftype] = f
if not period:
p = detect_period(f)
if p:
period = p
print(f" Detected: {f.name}{ftype}", file=sys.stderr)
else:
print(f" Skipped (unknown format): {f.name}", file=sys.stderr)
if not period:
period = datetime.now().strftime("%Y-%m")
print(f" Period: {period}", file=sys.stderr)
print(f" Files classified: {len(classified)}", file=sys.stderr)
# Parse files
pl_data = {}
customer_data = None
cash_flow_data = None
if 'pl_summary' in classified:
pl_data = parse_pl_summary(classified['pl_summary'])
elif 'pl_by_customer' not in classified:
print("Warning: No P&L file found. Output will be limited.", file=sys.stderr)
pl_data = {'revenue': {}, 'cogs': {}, 'expenses': {}, 'other_income': {}, 'other_expenses': {}, 'totals': {}}
if 'pl_by_customer' in classified:
customer_data = parse_pl_by_customer(classified['pl_by_customer'])
if not pl_data.get('totals'):
pl_data = parse_pl_summary(classified['pl_by_customer'])
if 'cash_flow' in classified:
cash_flow_data = parse_cash_flow(classified['cash_flow'])
# Compute KPIs
kpis = compute_kpis(pl_data, customer_data, cash_flow_data)
# Load prior period
prior = None
if history_dir.exists():
prior = load_prior_period(history_dir, period)
# Save current period
if not args.no_history:
history_dir.mkdir(parents=True, exist_ok=True)
history_file = history_dir / f"{period}.json"
save_data = {
'period': period,
'total_revenue': kpis['total_revenue'],
'gross_profit': kpis['gross_profit'],
'gross_margin_pct': kpis['gross_margin_pct'],
'net_income': kpis['net_income'],
'net_margin_pct': kpis['net_margin_pct'],
'total_cogs': kpis['total_cogs'],
'total_opex': kpis['total_opex'],
'people_costs_total': kpis['people_costs']['total'],
'people_costs_pct': kpis['people_costs']['pct_of_revenue'],
'tool_costs_total': kpis['tool_costs']['total'],
'tool_costs_pct': kpis['tool_costs']['pct_of_revenue'],
'interest_expense': kpis['interest_expense'],
'recruiting_spend': kpis.get('recruiting_spend', 0),
'owner_expenses': kpis.get('owner_expenses', 0),
'customer_count': kpis.get('customer_count', 0),
'timestamp': datetime.now().isoformat(),
}
with open(history_file, 'w') as f:
json.dump(save_data, f, indent=2)
print(f" Saved history: {history_file}", file=sys.stderr)
# Output briefing
briefing = format_briefing(kpis, prior, period)
print(briefing)
if __name__ == '__main__':
main()

View file

@ -0,0 +1,238 @@
#!/usr/bin/env python3
"""
Scenario Modeler Models base/bull/bear cases from financial analysis.
Takes a JSON file with financial summary data and projects 12-month scenarios.
Outputs to stdout and saves JSON for further analysis.
Usage:
python3 scenario-modeler.py --input ./data/financial-latest.json
python3 scenario-modeler.py --input ./data/financial-latest.json --output ./data/scenarios.json
"""
import argparse
import json
import os
import sys
from datetime import datetime
def load_financial_data(input_path: str) -> dict:
"""Load financial summary JSON."""
with open(input_path) as f:
return json.load(f)
def model_base_case(data: dict) -> dict:
"""Current trajectory continues. No growth, no cuts."""
monthly_rev = data["total_revenue"] / 12
monthly_cogs = data["total_cogs"] / 12
monthly_opex = data["total_opex"] / 12
monthly_other = data.get("other_expenses", 0) / 12 - data.get("other_income", 0) / 12
monthly_net = data["net_income"] / 12
projections = []
for m in range(1, 13):
projections.append({
"month": m,
"revenue": round(monthly_rev, 2),
"total_costs": round(monthly_cogs + monthly_opex + monthly_other, 2),
"net_income": round(monthly_net, 2),
"cumulative_pl": round(monthly_net * m, 2),
})
monthly_burn = abs(monthly_net) if monthly_net < 0 else 0
breakeven_monthly_cut = monthly_burn # How much to cut monthly to break even
return {
"name": "Base Case — Status Quo",
"description": "Current trajectory continues. No new clients, no lost clients, no cost changes.",
"assumptions": [
f"Revenue stays flat at ~${monthly_rev:,.0f}/mo",
f"COGS stays at ~${monthly_cogs:,.0f}/mo",
f"OpEx stays at ~${monthly_opex:,.0f}/mo",
"No new hires, no layoffs",
],
"monthly_burn": round(monthly_burn, 2),
"annual_projected_loss": round(data["net_income"], 2) if data["net_income"] < 0 else 0,
"annual_projected_profit": round(data["net_income"], 2) if data["net_income"] > 0 else 0,
"months_to_breakeven": "N/A (already profitable)" if monthly_net > 0 else "Never (at current trajectory)",
"key_levers": [
f"Cut ${breakeven_monthly_cut:,.0f}/mo from costs to break even" if monthly_burn > 0 else "Maintain current discipline",
f"Or grow revenue {round(monthly_burn / monthly_rev * 100, 1)}% while holding costs flat" if monthly_burn > 0 and monthly_rev > 0 else "Focus on margin expansion",
"Audit subscriptions and contractor spend for quick wins",
],
"projections": projections,
}
def model_bull_case(data: dict, new_product_arr: float = 500000, new_clients: int = 3, avg_client_mrr: float = 15000) -> dict:
"""Growth targets met: new product revenue + new agency clients."""
monthly_rev = data["total_revenue"] / 12
monthly_cogs = data["total_cogs"] / 12
monthly_opex = data["total_opex"] / 12
monthly_other = data.get("other_expenses", 0) / 12 - data.get("other_income", 0) / 12
product_monthly = new_product_arr / 12
new_clients_monthly = new_clients * avg_client_mrr
# Product has SaaS margins (~80%), services have ~50% margin
product_cogs = product_monthly * 0.20
services_cogs = new_clients_monthly * 0.50
projections = []
for m in range(1, 13):
# Ramp: product over 6 months, clients added quarterly
product_ramp = min(m / 6, 1.0)
client_ramp = min(m, new_clients) / new_clients
month_rev = monthly_rev + (product_monthly * product_ramp) + (new_clients_monthly * client_ramp)
month_costs = monthly_cogs + monthly_opex + monthly_other + (product_cogs * product_ramp) + (services_cogs * client_ramp)
month_net = month_rev - month_costs
projections.append({
"month": m,
"revenue": round(month_rev, 2),
"total_costs": round(month_costs, 2),
"net_income": round(month_net, 2),
"cumulative_pl": round(sum(p["net_income"] for p in projections) + month_net, 2),
})
breakeven_month = None
for p in projections:
if p["net_income"] > 0:
breakeven_month = p["month"]
break
return {
"name": "Bull Case — Product + Growth",
"description": f"New product hits ${new_product_arr/1000:.0f}K ARR, add {new_clients} clients at ${avg_client_mrr/1000:.0f}K/mo.",
"assumptions": [
f"Product ramps to ${product_monthly:,.0f}/mo over 6 months",
f"{new_clients} new clients at ${avg_client_mrr:,.0f}/mo each, added quarterly",
"Product has 80% gross margin (SaaS)",
"New services clients at 50% margin",
"No additional OpEx needed (existing team absorbs)",
],
"additional_annual_revenue": round((product_monthly + new_clients_monthly) * 12, 2),
"monthly_profit_at_full_ramp": round(projections[-1]["net_income"], 2) if projections[-1]["net_income"] > 0 else 0,
"months_to_breakeven": breakeven_month if breakeven_month else ">12 months",
"key_levers": [
"Product-market fit and sales execution",
f"Services pipeline — need 1 new ${avg_client_mrr/1000:.0f}K client per quarter",
"Keep OpEx flat during growth phase",
"SaaS margins dramatically improve blended margin",
],
"projections": projections,
}
def model_bear_case(data: dict, pct_revenue_lost: float = 0.30) -> dict:
"""Lose significant portion of revenue (e.g., top clients churn)."""
monthly_rev = data["total_revenue"] / 12
monthly_cogs = data["total_cogs"] / 12
monthly_opex = data["total_opex"] / 12
monthly_other = data.get("other_expenses", 0) / 12 - data.get("other_income", 0) / 12
lost_revenue = data["total_revenue"] * pct_revenue_lost
monthly_lost = lost_revenue / 12
# Save ~45% of lost revenue in COGS (team partially redeployed)
monthly_saved_cogs = monthly_lost * 0.45
projections = []
for m in range(1, 13):
month_rev = monthly_rev - monthly_lost
month_costs = (monthly_cogs - monthly_saved_cogs) + monthly_opex + monthly_other
month_net = month_rev - month_costs
projections.append({
"month": m,
"revenue": round(month_rev, 2),
"total_costs": round(month_costs, 2),
"net_income": round(month_net, 2),
"cumulative_pl": round(month_net * m, 2),
})
return {
"name": f"Bear Case — Lose {pct_revenue_lost*100:.0f}% Revenue",
"description": f"Top clients churn, losing {pct_revenue_lost*100:.0f}% of revenue.",
"assumptions": [
f"Lose ${lost_revenue:,.0f}/yr ({pct_revenue_lost*100:.0f}% of revenue)",
"COGS reduces ~45% of lost revenue (team partially redeployed)",
"OpEx stays fixed (can't cut fast enough)",
"No replacement clients in forecast period",
],
"lost_annual_revenue": round(lost_revenue, 2),
"new_annual_revenue": round(data["total_revenue"] - lost_revenue, 2),
"monthly_burn": round(abs(projections[0]["net_income"]), 2) if projections[0]["net_income"] < 0 else 0,
"annual_projected_loss": round(projections[0]["net_income"] * 12, 2),
"months_to_breakeven": "Requires major restructuring",
"key_levers": [
f"Immediate need: cut ${abs(projections[0]['net_income']):,.0f}/mo in costs",
"Reduce headcount in affected service lines",
"Accelerate sales pipeline to replace lost revenue",
"Consider consolidating service lines",
],
"required_monthly_cost_cuts": round(abs(projections[0]["net_income"]), 2) if projections[0]["net_income"] < 0 else 0,
"projections": projections,
}
def main():
parser = argparse.ArgumentParser(description='Financial Scenario Modeler')
parser.add_argument('--input', '-i', required=True,
help='Path to financial summary JSON (output from cfo-analyzer history)')
parser.add_argument('--output', '-o', default=None,
help='Output path for scenarios JSON (default: stdout only)')
parser.add_argument('--product-arr', type=float, default=500000,
help='Bull case: new product ARR target (default: 500000)')
parser.add_argument('--new-clients', type=int, default=3,
help='Bull case: number of new clients (default: 3)')
parser.add_argument('--client-mrr', type=float, default=15000,
help='Bull case: average new client MRR (default: 15000)')
parser.add_argument('--bear-loss-pct', type=float, default=0.30,
help='Bear case: percentage of revenue lost (default: 0.30)')
args = parser.parse_args()
print("🔮 Scenario Modeler — Building projections...", file=sys.stderr)
data = load_financial_data(args.input)
scenarios = {
"base_case": model_base_case(data),
"bull_case": model_bull_case(data, args.product_arr, args.new_clients, args.client_mrr),
"bear_case": model_bear_case(data, args.bear_loss_pct),
"generated_at": datetime.now().isoformat(),
"based_on_period": data.get("period", "Unknown"),
}
# Summary comparison
scenarios["summary"] = {
"base_monthly_burn": scenarios["base_case"]["monthly_burn"],
"bull_monthly_profit": scenarios["bull_case"].get("monthly_profit_at_full_ramp", 0),
"bear_monthly_burn": scenarios["bear_case"]["monthly_burn"],
"current_net_income": data["net_income"],
}
if args.output:
os.makedirs(os.path.dirname(args.output) or '.', exist_ok=True)
with open(args.output, "w") as f:
json.dump(scenarios, f, indent=2)
print(f"✅ Scenarios saved to {args.output}", file=sys.stderr)
# Print summary to stdout
print(f"\n{'='*60}")
for case_key in ["base_case", "bull_case", "bear_case"]:
case = scenarios[case_key]
print(f"\n📌 {case['name']}")
print(f" {case['description']}")
if case.get("monthly_burn"):
print(f" Monthly burn: ${case['monthly_burn']:,.0f}")
if case.get("monthly_profit_at_full_ramp"):
print(f" Monthly profit at ramp: ${case['monthly_profit_at_full_ramp']:,.0f}")
print(f" Breakeven: {case['months_to_breakeven']}")
print(f" Key levers:")
for lever in case.get("key_levers", []):
print(f"{lever}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,73 @@
# Growth Engine Configuration
# Copy this file to .env and fill in your values
# ── Core Settings ──────────────────────────────────────────────────────────────
# Where experiment data is stored (default: ./data/experiments)
GROWTH_ENGINE_DATA_DIR=./data/experiments
# Comma-separated list of agent/channel names to track
GROWTH_ENGINE_AGENTS=content,email,linkedin,seo,blog
# Agents with high-volume data (need only 10 samples/variant for significance)
HIGH_VOLUME_AGENTS=content,email
# Agents with low-volume data (need 30 samples/variant for significance)
LOW_VOLUME_AGENTS=seo,linkedin,blog
# ── Statistical Thresholds ─────────────────────────────────────────────────────
# p-value threshold for declaring a winner (default: 0.05)
P_WINNER=0.05
# p-value threshold for "trending" early signal (default: 0.10)
P_TREND=0.10
# Minimum % lift required for "keep" decision (default: 15.0)
LIFT_WIN=15.0
# Number of bootstrap resamples for confidence intervals (default: 1000)
BOOTSTRAP_ITERATIONS=1000
# Maximum variants allowed in batch mode (default: 10)
BATCH_MODE_MAX_VARIANTS=10
# ── Pacing Alert: Pipeline API ─────────────────────────────────────────────────
# Your pipeline/CRM dashboard API endpoint
PIPELINE_API_URL=
# Bearer token for pipeline API authentication
PIPELINE_AUTH_TOKEN=
# ── Pacing Alert: Recruiting API ───────────────────────────────────────────────
# Your recruiting/candidate dashboard API endpoint
RECRUITING_API_URL=
# Bearer token for recruiting API authentication
RECRUITING_AUTH_TOKEN=
# ── Pacing Alert: Email Platform ───────────────────────────────────────────────
# Email sending platform API base URL (e.g., Instantly, Lemlist, Smartlead)
EMAIL_API_URL=
# Bearer token for email platform API
EMAIL_AUTH_TOKEN=
# Campaign IDs as JSON objects: {"Campaign Name": "campaign-uuid"}
OUTBOUND_CAMPAIGNS={}
RECRUITING_CAMPAIGNS={}
# ── Pacing Alert: Targets ─────────────────────────────────────────────────────
# Minimum leads to stage per day (alert if below this)
DAILY_LEAD_TARGET=10
# Weekly candidate sourcing target
WEEKLY_CANDIDATE_TARGET=400
# ── Timezone ───────────────────────────────────────────────────────────────────
# UTC offset for local time display (e.g., -7 for PDT, -8 for PST)
TZ_OFFSET=-7
# Label for display
TZ_LABEL=PDT

328
growth-engine/README.md Normal file
View file

@ -0,0 +1,328 @@
# 🧬 Growth Engine
**What if your marketing experiments ran themselves?**
Autonomous growth experimentation for AI agents. Inspired by [Karpathy's autoresearch](https://x.com/karpathy/status/1886192184808149383) pattern applied to marketing: create experiments with hypotheses, collect data, run statistical analysis, auto-promote winners to a living playbook, and suggest what to test next.
Your AI agents stop guessing and start *knowing* what works.
---
## What You Get
- **🔬 Experiment Engine** — Create A/B or batch (up to 10 variants) experiments with hypotheses, track them, and let the math decide winners
- **📊 Bootstrap CI + Mann-Whitney U** — Real statistical rigor, not vibes. Non-parametric tests that work with small samples and non-normal distributions
- **📖 Auto-Playbook** — Winners automatically promote to a living playbook of empirically proven best practices
- **💡 Next-Experiment Suggestions** — The system knows what you haven't tested yet and suggests what to run next
- **📈 Weekly Scorecard** — Automated report across all channels: wins, trends, running experiments, discards
- **⚠️ Pacing Alerts** — Monitor campaign health, lead staging rates, and candidate pipelines against targets
---
## Quick Start
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
### 2. Set environment variables
```bash
cp .env.example .env
# Edit .env with your configuration
```
### 3. Run your first experiment
```bash
# Create an experiment
python3 experiment-engine.py create \
--agent content \
--hypothesis "Thread posts get 2x impressions vs single posts" \
--variable "format" \
--variants '["thread", "single"]' \
--metric "impressions" \
--cycle-hours 8
# Log data as it comes in
python3 experiment-engine.py log \
--agent content \
--experiment-id EXP-CONTENT-001 \
--variant "thread" \
--metrics '{"impressions": 4500, "clicks": 120, "replies": 8}'
# Score when you have enough data
python3 experiment-engine.py score \
--agent content \
--experiment-id EXP-CONTENT-001
# Check your playbook of proven winners
python3 experiment-engine.py playbook --agent content
# What should you test next?
python3 experiment-engine.py suggest --agent content
```
---
## Commands
### experiment-engine.py
The core engine. Manages the full experiment lifecycle.
| Command | Description |
|---------|-------------|
| `create` | Create a new A/B or batch experiment with a hypothesis |
| `log` | Log a data point (metrics) for a running experiment variant |
| `score` | Run statistical analysis. Auto-promotes winners to playbook |
| `list` | List experiments by agent, optionally filtered by status |
| `playbook` | Show the living playbook of empirically proven best practices |
| `suggest` | Suggest untested variables to experiment on next |
**Batch mode** — test up to 10 variants simultaneously:
```bash
python3 experiment-engine.py create \
--agent email \
--hypothesis "Which subject line style drives highest open rate?" \
--variable "subject_line_style" \
--variants '["question", "number", "how-to", "curiosity-gap", "personalized"]' \
--metric "open_rate" \
--batch-mode
```
### autogrowth-weekly-scorecard.py
Generates a weekly report across all agents/channels.
```bash
# Current week scorecard
python3 autogrowth-weekly-scorecard.py
# Two weeks ago
python3 autogrowth-weekly-scorecard.py --weeks 2
# Save to file
python3 autogrowth-weekly-scorecard.py --output reports/week-12.md
```
### pacing-alert.py
Monitors campaign health and pacing against targets.
```bash
# Formatted text output
python3 pacing-alert.py
# JSON output for integrations
python3 pacing-alert.py --json
```
---
## Example Output
### Experiment Scoring
```
🏆 EXP-CONTENT-003: KEEP — 'thread' +23.4% lift (p=0.0312, 95% CI [8.2, 41.7]%)
📖 Playbook updated: format → 'thread'
```
### Playbook
```
📖 CONTENT PLAYBOOK — Empirically Proven Best Practices
format: 'thread' (+23.4% on impressions, p=0.0312, 95% CI [8.2, 41.7])
Source: EXP-CONTENT-003 | Promoted: 2026-03-15
hook_style: 'contrarian' (+18.7% on clicks, p=0.0421, 95% CI [5.1, 34.2])
Source: EXP-CONTENT-007 | Promoted: 2026-03-22
```
### Weekly Scorecard
```
# AutoGrowth Weekly Scorecard — Week of Mar 17 Mar 23, 2026
## Summary
- Total experiments active: 4
- New experiments launched: 2
- Experiments completed: 3 (2 kept, 1 discarded)
- Total data points collected: 847
## 🏆 Big Wins (keep status this week)
### EXP-EMAIL-012 (email)
- Tested: subject_line_style → variant: question
- Metric value: 0.3420 | Sample n: 156
- Lift: 31.2% | p-value: 0.008
## 📈 Trending (watch these)
- EXP-CONTENT-015 (content) — variant `data_hook` leading at 0.1250 | 42 samples so far
```
### Pacing Alert
```
⚠️ Pacing Alert — Thu Mar 27 2:15 PM PDT
🟢 📧 Outbound Pipeline:
• 12 leads staged today | 8 approved | 6 sent
• Campaigns: 🟢 3/3 sending | 450 emails/day
🟡 🔍 Recruiting Pipeline:
• 5 candidates added today | 187 this week | target: 400/week
• Campaigns: 🟢 5/5 sending | 200 emails/day
```
---
## How It Works
### The Autoresearch Loop
```
┌─────────────────────────────────────────────┐
│ │
│ 1. HYPOTHESIZE │
│ "Thread posts get 2x impressions" │
│ │ │
│ ▼ │
│ 2. EXPERIMENT │
│ Run variants, collect data points │
│ │ │
│ ▼ │
│ 3. ANALYZE │
│ Bootstrap CI + Mann-Whitney U │
│ p < 0.05 + lift 15% = winner
│ │ │
│ ▼ │
│ 4. PROMOTE or DISCARD │
│ Winner → playbook (auto) │
│ Loser → discard pile (learned) │
│ │ │
│ ▼ │
│ 5. SUGGEST NEXT │
│ System identifies untested variables │
│ └──────────── loops back to 1 ─────┘ │
│ │
└─────────────────────────────────────────────┘
```
### Statistical Methods
- **Mann-Whitney U test** — Non-parametric. Works with small samples. No normality assumption needed.
- **Bootstrap confidence intervals** — 1,000 resamples to estimate the true lift range.
- **Dual threshold** — Both statistical significance (p < 0.05) AND practical significance (≥15% lift) required to declare a winner. No more "statistically significant but useless" results.
- **Trending detection** — Early signal detection at p < 0.10 with 15+ samples, so you know what's promising before it's conclusive.
### Configurable Thresholds
| Parameter | Default | What It Controls |
|-----------|---------|-----------------|
| `P_WINNER` | 0.05 | p-value threshold for declaring a winner |
| `P_TREND` | 0.10 | p-value threshold for "trending" status |
| `LIFT_WIN` | 15.0% | Minimum lift required for "keep" decision |
| `BOOTSTRAP_ITERATIONS` | 1000 | Number of bootstrap resamples for CI |
---
## Configuration
All configuration is via environment variables. See `.env.example` for the full list.
### Core Settings
| Variable | Description | Default |
|----------|-------------|---------|
| `GROWTH_ENGINE_DATA_DIR` | Where experiment data is stored | `./data/experiments` |
| `GROWTH_ENGINE_AGENTS` | Comma-separated list of agent names | `content,email,linkedin,seo,blog` |
| `HIGH_VOLUME_AGENTS` | Agents with fast data (fewer samples needed) | `content,email` |
| `LOW_VOLUME_AGENTS` | Agents with slow data (more samples needed) | `seo,linkedin,blog` |
### Pacing Alert Settings
| Variable | Description | Default |
|----------|-------------|---------|
| `PIPELINE_API_URL` | Your pipeline/CRM API endpoint | — |
| `PIPELINE_AUTH_TOKEN` | Bearer token for pipeline API | — |
| `EMAIL_API_URL` | Email platform API base URL | — |
| `EMAIL_AUTH_TOKEN` | Bearer token for email platform | — |
| `OUTBOUND_CAMPAIGNS` | JSON map of campaign name → ID | `{}` |
| `DAILY_LEAD_TARGET` | Minimum leads staged per day | `10` |
| `WEEKLY_CANDIDATE_TARGET` | Candidate sourcing weekly target | `400` |
---
## Integrating with Your AI Agents
The growth engine is designed to be called by AI agents (Claude Code, GPT, etc.) as part of their workflow:
```python
# In your agent's post-publishing hook:
import subprocess
# After publishing a social post, log the experiment data
subprocess.run([
"python3", "experiment-engine.py", "log",
"--agent", "content",
"--experiment-id", current_experiment_id,
"--variant", variant_used,
"--metrics", json.dumps({"impressions": post_impressions, "clicks": post_clicks})
])
# Periodically score experiments
subprocess.run([
"python3", "experiment-engine.py", "score",
"--agent", "content",
"--experiment-id", current_experiment_id
])
# Before creating new content, check the playbook
result = subprocess.run(
["python3", "experiment-engine.py", "playbook", "--agent", "content"],
capture_output=True, text=True
)
# Parse playbook rules and apply them to new content
```
---
## Project Structure
```
growth-engine/
├── experiment-engine.py # Core experiment lifecycle engine
├── autogrowth-weekly-scorecard.py # Weekly report generator
├── pacing-alert.py # Campaign pacing monitor
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── SKILL.md # Claude Code skill definition
├── README.md # This file
└── data/ # Auto-created experiment data
└── experiments/
├── content/
│ ├── experiments.json # Experiment definitions + data
│ ├── playbook.json # Proven winners
│ └── active.json # Currently running experiments
├── email/
├── seo/
└── ...
```
---
## License
MIT
---
<p align="center">
<b>Built by <a href="https://www.singlegrain.com">Single Grain</a></b><br>
We help companies grow with AI-powered marketing. This is how we do it internally.
</p>

130
growth-engine/SKILL.md Normal file
View file

@ -0,0 +1,130 @@
# Growth Engine
Autonomous growth experimentation framework based on Karpathy's autoresearch pattern applied to marketing. Creates experiments with hypotheses, logs data points, runs statistical analysis (bootstrap CI + Mann-Whitney U), auto-promotes winners to a living playbook, and suggests next experiments. Supports batch mode (up to 10 variants simultaneously).
## Usage
Use this skill when:
- Creating or managing A/B or multivariate experiments for any marketing channel
- Logging experiment data points after content is published or campaigns run
- Scoring experiments to determine statistical winners
- Checking the playbook for proven best practices before creating new content
- Generating weekly scorecards across all channels
- Monitoring campaign pacing and health
Do NOT use for:
- One-off content creation (use the playbook output as input, but don't run the engine)
- Non-experiment analytics or reporting
- Campaign setup in external platforms (this tracks experiments, not campaign config)
## Commands
### Create an experiment
```bash
python3 experiment-engine.py create \
--agent <agent_name> \
--hypothesis "What you expect to happen" \
--variable "<variable_name>" \
--variants '["variant_a", "variant_b"]' \
--metric "<primary_metric>" \
--cycle-hours 24
```
Add `--batch-mode` for 3-10 variant tests. Add `--min-samples N` to override auto-detection.
### Log a data point
```bash
python3 experiment-engine.py log \
--agent <agent_name> \
--experiment-id <EXP-ID> \
--variant "<variant_name>" \
--metrics '{"metric_name": value}'
```
### Score an experiment
```bash
python3 experiment-engine.py score --agent <agent_name> --experiment-id <EXP-ID>
```
Statuses: `running``trending``keep` (winner) or `discard` (loser)
Winners auto-promote to the playbook. Requires p < 0.05 AND 15% lift.
### List experiments
```bash
python3 experiment-engine.py list --agent <agent_name> [--status running|trending|keep|discard]
```
### Check the playbook
```bash
python3 experiment-engine.py playbook --agent <agent_name>
```
Always check the playbook before creating new content to apply proven best practices.
### Suggest next experiments
```bash
python3 experiment-engine.py suggest --agent <agent_name>
```
### Generate weekly scorecard
```bash
python3 autogrowth-weekly-scorecard.py [--weeks N] [--output file.md]
```
### Check campaign pacing
```bash
python3 pacing-alert.py [--json]
```
Exit code 0 = on pace, 1 = alerts present.
## Workflow
1. Before creating content: `playbook` → apply proven rules
2. When publishing: `log` → record which variant was used and its metrics
3. Periodically: `score` → check if experiments have reached statistical significance
4. Weekly: `autogrowth-weekly-scorecard.py` → review all channels
5. After completing experiments: `suggest` → pick the next variable to test
## Configuration
### Required Environment Variables
| Variable | Description |
|----------|-------------|
| `GROWTH_ENGINE_DATA_DIR` | Data directory (default: `./data/experiments`) |
| `GROWTH_ENGINE_AGENTS` | Comma-separated agent names (default: `content,email,linkedin,seo,blog`) |
### Optional Tuning
| Variable | Default | Description |
|----------|---------|-------------|
| `HIGH_VOLUME_AGENTS` | `content,email` | Agents needing only 10 samples/variant |
| `LOW_VOLUME_AGENTS` | `seo,linkedin,blog` | Agents needing 30 samples/variant |
| `P_WINNER` | `0.05` | p-value threshold for winner |
| `P_TREND` | `0.10` | p-value threshold for trending |
| `LIFT_WIN` | `15.0` | Minimum % lift for keep decision |
| `BOOTSTRAP_ITERATIONS` | `1000` | Bootstrap resamples for CI |
| `BATCH_MODE_MAX_VARIANTS` | `10` | Max variants in batch mode |
### Pacing Alert Variables
| Variable | Description |
|----------|-------------|
| `PIPELINE_API_URL` | Pipeline/CRM API endpoint |
| `PIPELINE_AUTH_TOKEN` | Bearer token for pipeline API |
| `RECRUITING_API_URL` | Recruiting API endpoint |
| `RECRUITING_AUTH_TOKEN` | Bearer token for recruiting API |
| `EMAIL_API_URL` | Email platform API base URL |
| `EMAIL_AUTH_TOKEN` | Bearer token for email platform |
| `OUTBOUND_CAMPAIGNS` | JSON: `{"name": "campaign-id"}` |
| `RECRUITING_CAMPAIGNS` | JSON: `{"name": "campaign-id"}` |
| `DAILY_LEAD_TARGET` | Leads/day target (default: 10) |
| `WEEKLY_CANDIDATE_TARGET` | Candidates/week target (default: 400) |
### Dependencies
```
pip install numpy scipy
```

View file

@ -0,0 +1,306 @@
#!/usr/bin/env python3
"""
AutoGrowth Weekly Scorecard Generator
Reads experiment results and playbook data across all agents and generates
a weekly report showing wins, trends, running experiments, and discards.
Works with both JSON (from experiment-engine.py) and TSV data formats.
Usage:
python3 autogrowth-weekly-scorecard.py # Current week
python3 autogrowth-weekly-scorecard.py --weeks 2 # Two weeks back
python3 autogrowth-weekly-scorecard.py --output report.md # Write to file
"""
import argparse
import csv
import os
import sys
from datetime import datetime, timedelta
from pathlib import Path
from collections import defaultdict
# ── Configuration ──────────────────────────────────────────────────────────────
# Base directory for experiment data. Must match experiment-engine.py setting.
BASE_DIR = Path(os.environ.get("GROWTH_ENGINE_DATA_DIR", "./data/experiments"))
# Agent names to scan. Customize to match your agent taxonomy.
AGENTS = os.environ.get("GROWTH_ENGINE_AGENTS", "content,email,linkedin,seo,blog").split(",")
RESULTS_COLS = ["experiment_id", "variable", "variant", "metric_value", "sample_n", "status", "date", "description"]
PLAYBOOK_COLS = ["experiment_id", "agent", "channel", "rule", "lift_pct", "p_value", "date_added", "notes"]
def parse_tsv(filepath, expected_cols):
"""Parse a TSV file, return list of dicts. Gracefully handles missing/empty files."""
rows = []
if not filepath.exists():
return rows
try:
with open(filepath, "r", encoding="utf-8") as f:
content = f.read().strip()
if not content:
return rows
reader = csv.DictReader(content.splitlines(), delimiter="\t")
if reader.fieldnames and reader.fieldnames[0].startswith("#"):
return rows
for row in reader:
rows.append(dict(row))
except Exception:
pass
return rows
def safe_float(val, default=0.0):
try:
return float(val)
except (TypeError, ValueError):
return default
def safe_int(val, default=0):
try:
return int(val)
except (TypeError, ValueError):
return default
def week_range(weeks_back=1):
"""Return (start_date, end_date) for the target week (Mon-Sun)."""
today = datetime.now().date()
this_monday = today - timedelta(days=today.weekday())
start = this_monday - timedelta(weeks=weeks_back - 1)
end = start + timedelta(days=6)
return start, end
def in_week(date_str, start, end):
"""Check if a date string falls within the week range."""
if not date_str:
return True
for fmt in ("%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d", "%d-%m-%Y"):
try:
d = datetime.strptime(date_str.strip(), fmt).date()
return start <= d <= end
except ValueError:
continue
return True # include if unparseable
def load_all_results(weeks_back=1):
"""Load all results TSV rows across agents, filtered by week."""
start, end = week_range(weeks_back)
all_rows = []
for agent in AGENTS:
filepath = BASE_DIR / agent.strip() / "results.tsv"
rows = parse_tsv(filepath, RESULTS_COLS)
for row in rows:
row["_agent"] = agent.strip()
date_val = row.get("date", "")
if in_week(date_val, start, end):
all_rows.append(row)
return all_rows, start, end
def load_all_playbooks():
"""Load all playbook TSV rows across agents."""
all_rows = []
for agent in AGENTS:
filepath = BASE_DIR / agent.strip() / "playbook.tsv"
rows = parse_tsv(filepath, PLAYBOOK_COLS)
for row in rows:
row["_agent"] = agent.strip()
all_rows.append(row)
return all_rows
def generate_scorecard(weeks_back=1):
results, start, end = load_all_results(weeks_back)
playbook = load_all_playbooks()
week_label = f"{start.strftime('%b %d')} {end.strftime('%b %d, %Y')}"
lines = []
lines.append(f"# AutoGrowth Weekly Scorecard — Week of {week_label}")
lines.append("")
# ── Summary ──────────────────────────────────────────────────────────────
if not results:
active = new = completed = kept = discarded = data_points = 0
else:
statuses = [r.get("status", "").strip().lower() for r in results]
running_statuses = {"running", "active", "in_progress", "in progress"}
keep_statuses = {"keep", "winner", "kept", "significant"}
discard_statuses = {"discard", "discarded", "loser", "no_effect", "no effect"}
new_statuses = {"new", "launched"}
active = sum(1 for s in statuses if s in running_statuses)
new = sum(1 for s in statuses if s in new_statuses)
kept = sum(1 for s in statuses if s in keep_statuses)
discarded = sum(1 for s in statuses if s in discard_statuses)
completed = kept + discarded
data_points = sum(safe_int(r.get("sample_n", 0)) for r in results)
lines.append("## Summary")
lines.append(f"- Total experiments active: {active}")
lines.append(f"- New experiments launched: {new}")
lines.append(f"- Experiments completed: {completed} ({kept} kept, {discarded} discarded)")
lines.append(f"- Total data points collected: {data_points:,}")
lines.append("")
# ── Big Wins ─────────────────────────────────────────────────────────────
lines.append("## 🏆 Big Wins (keep status this week)")
keep_statuses_set = {"keep", "winner", "kept", "significant"}
winners = [r for r in results if r.get("status", "").strip().lower() in keep_statuses_set]
winners.sort(key=lambda r: safe_float(r.get("metric_value", 0)), reverse=True)
if not winners:
lines.append("No data yet")
else:
for r in winners:
exp_id = r.get("experiment_id", "?")
agent = r.get("_agent", "?")
variable = r.get("variable", "?")
variant = r.get("variant", "?")
metric = safe_float(r.get("metric_value", 0))
n = safe_int(r.get("sample_n", 0))
desc = r.get("description", "")
lines.append(f"### {exp_id} ({agent})")
lines.append(f"- **Tested:** {variable} → variant: {variant}")
lines.append(f"- **Metric value:** {metric:.4f} | **Sample n:** {n:,}")
if desc:
lines.append(f"- **Description:** {desc}")
pb_match = [p for p in playbook if p.get("experiment_id", "") == exp_id]
if pb_match:
rule = pb_match[0].get("rule", "")
lift = pb_match[0].get("lift_pct", "")
p_val = pb_match[0].get("p_value", "")
lines.append(f"- **Playbook rule:** {rule}")
if lift:
lines.append(f"- **Lift:** {lift}% | **p-value:** {p_val}")
lines.append("")
# ── Trending ──────────────────────────────────────────────────────────────
lines.append("## 📈 Trending (watch these)")
trending_statuses = {"trending", "watch", "promising"}
trending = [r for r in results if r.get("status", "").strip().lower() in trending_statuses]
trending.sort(key=lambda r: safe_float(r.get("metric_value", 0)), reverse=True)
if not trending:
lines.append("No data yet")
else:
for r in trending:
exp_id = r.get("experiment_id", "?")
agent = r.get("_agent", "?")
variant = r.get("variant", "?")
metric = safe_float(r.get("metric_value", 0))
n = safe_int(r.get("sample_n", 0))
lines.append(f"- **{exp_id}** ({agent}) — variant `{variant}` leading at {metric:.4f} | {n:,} samples so far")
lines.append("")
# ── Running ───────────────────────────────────────────────────────────────
lines.append("## 🔬 Running (in progress)")
running_statuses_set = {"running", "active", "in_progress", "in progress"}
running = [r for r in results if r.get("status", "").strip().lower() in running_statuses_set]
if not running:
lines.append("No data yet")
else:
for r in running:
exp_id = r.get("experiment_id", "?")
agent = r.get("_agent", "?")
variable = r.get("variable", "?")
variant = r.get("variant", "?")
n = safe_int(r.get("sample_n", 0))
lines.append(f"- **{exp_id}** ({agent}): testing `{variable}` → `{variant}` — {n:,} samples")
lines.append("")
# ── Discarded ─────────────────────────────────────────────────────────────
lines.append("## ❌ Discarded (didn't work)")
discard_statuses_set = {"discard", "discarded", "loser", "no_effect", "no effect"}
discarded_rows = [r for r in results if r.get("status", "").strip().lower() in discard_statuses_set]
if not discarded_rows:
lines.append("No data yet")
else:
for r in discarded_rows:
exp_id = r.get("experiment_id", "?")
agent = r.get("_agent", "?")
desc = r.get("description", "No significant effect found")
lines.append(f"- **{exp_id}** ({agent}): {desc}")
lines.append("")
# ── Cumulative Playbook ───────────────────────────────────────────────────
lines.append("## 📊 Cumulative Playbook")
total_rules = len(playbook)
lines.append(f"- Total rules in playbook across all agents: {total_rules}")
lines.append("")
if playbook:
sorted_pb = sorted(playbook, key=lambda p: safe_float(p.get("lift_pct", 0)), reverse=True)
lines.append("**Top 3 biggest lifts ever found:**")
for i, p in enumerate(sorted_pb[:3], 1):
exp_id = p.get("experiment_id", "?")
agent = p.get("_agent", "?")
rule = p.get("rule", "?")
lift = p.get("lift_pct", "?")
lines.append(f"{i}. **{exp_id}** ({agent}) — {lift}% lift: {rule}")
else:
lines.append("No playbook rules yet — experiments still running.")
lines.append("")
# ── Next Week ─────────────────────────────────────────────────────────────
lines.append("## 📅 Next Week")
next_start = end + timedelta(days=1)
next_end = next_start + timedelta(days=6)
lines.append(f"Week of {next_start.strftime('%b %d')} {next_end.strftime('%b %d, %Y')}")
lines.append("")
planned_statuses = {"planned", "next", "queued", "upcoming"}
all_results_unfiltered = []
for agent in AGENTS:
filepath = BASE_DIR / agent.strip() / "results.tsv"
rows = parse_tsv(filepath, RESULTS_COLS)
for row in rows:
row["_agent"] = agent.strip()
all_results_unfiltered.append(row)
planned = [r for r in all_results_unfiltered if r.get("status", "").strip().lower() in planned_statuses]
if not planned:
lines.append("No new experiments scheduled yet. Add rows with status=planned to results.tsv files.")
else:
for r in planned:
exp_id = r.get("experiment_id", "?")
agent = r.get("_agent", "?")
variable = r.get("variable", "?")
variant = r.get("variant", "?")
lines.append(f"- **{exp_id}** ({agent}): launch `{variable}` test → `{variant}`")
lines.append("")
lines.append("---")
lines.append(f"*Generated {datetime.now().strftime('%Y-%m-%d %H:%M')}*")
return "\n".join(l for l in lines)
def main():
parser = argparse.ArgumentParser(description="AutoGrowth Weekly Scorecard Generator")
parser.add_argument("--weeks", type=int, default=1, help="How many weeks back to report (default: 1 = current week)")
parser.add_argument("--output", type=str, default=None, help="Write output to file instead of stdout")
args = parser.parse_args()
scorecard = generate_scorecard(weeks_back=args.weeks)
if args.output:
out_path = Path(args.output)
out_path.parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "w", encoding="utf-8") as f:
f.write(scorecard)
print(f"Scorecard written to {out_path}", file=sys.stderr)
else:
print(scorecard)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,493 @@
#!/usr/bin/env python3
"""
Experiment Engine Autonomous growth experimentation for AI agents.
Inspired by Karpathy's autoresearch pattern: create experiments with hypotheses,
log data points, run statistical analysis (bootstrap CI + Mann-Whitney U),
auto-promote winners to a living playbook, and suggest next experiments.
Supports batch mode (up to 10 variants simultaneously).
Usage:
# Create a new experiment
python3 experiment-engine.py create --agent content --hypothesis "Thread posts get 2x impressions vs single posts" \
--variable "format" --variants '["thread", "single"]' --metric "impressions" --cycle-hours 8
# Log a data point for a running experiment
python3 experiment-engine.py log --agent content --experiment-id EXP-001 --variant "thread" \
--metrics '{"impressions": 4500, "clicks": 120, "replies": 8}'
# Score an experiment (auto-promotes winner if criteria met)
python3 experiment-engine.py score --agent content --experiment-id EXP-001
# List active experiments for an agent
python3 experiment-engine.py list --agent content
# Get current best practices (promoted winners)
python3 experiment-engine.py playbook --agent content
# Suggest next experiment based on gaps
python3 experiment-engine.py suggest --agent content
"""
import argparse, json, os, sys
from datetime import datetime, timezone
from pathlib import Path
import numpy as np
from scipy import stats
# ── Configuration ──────────────────────────────────────────────────────────────
# Base directory for experiment data. Override with GROWTH_ENGINE_DATA_DIR env var.
BASE_DIR = Path(os.environ.get("GROWTH_ENGINE_DATA_DIR", "./data/experiments"))
# Define your agent/channel taxonomy. High-volume channels need fewer samples
# per variant because data arrives faster. Adjust to match your setup.
HIGH_VOLUME_AGENTS = set(os.environ.get("HIGH_VOLUME_AGENTS", "content,email").split(","))
LOW_VOLUME_AGENTS = set(os.environ.get("LOW_VOLUME_AGENTS", "seo,linkedin,blog").split(","))
# Batch mode: allow up to this many variants simultaneously (vs simple A/B)
BATCH_MODE_MAX_VARIANTS = int(os.environ.get("BATCH_MODE_MAX_VARIANTS", "10"))
# Map agent names to their marketing channels. Customize for your org.
AGENT_CHANNEL = {
"content": "social",
"email": "email",
"linkedin": "linkedin",
"seo": "seo",
"blog": "blog",
}
# Statistical parameters
BOOTSTRAP_ITERATIONS = int(os.environ.get("BOOTSTRAP_ITERATIONS", "1000"))
P_WINNER = float(os.environ.get("P_WINNER", "0.05")) # p-value threshold for declaring a winner
P_TREND = float(os.environ.get("P_TREND", "0.10")) # p-value threshold for "trending" status
LIFT_WIN = float(os.environ.get("LIFT_WIN", "15.0")) # minimum % lift required for "keep" decision
def get_min_samples(agent: str, override: int | None = None) -> int:
"""Return minimum samples per variant before scoring.
High-volume channels (email, social) need fewer samples (10).
Low-volume channels (SEO, blog) need more (30) for reliable signal.
Explicit override wins if > 3.
"""
if override is not None and override > 3:
return override
return 10 if agent in HIGH_VOLUME_AGENTS else 30
def bootstrap_lift_ci(a_vals, b_vals, n_iter=BOOTSTRAP_ITERATIONS, ci=95):
"""Bootstrap confidence interval for lift = (mean(b) - mean(a)) / mean(a) * 100.
Returns (lower_bound, upper_bound) as percentages, or (None, None) if baseline is zero.
"""
a = np.array(a_vals, dtype=float)
b = np.array(b_vals, dtype=float)
lifts = []
rng = np.random.default_rng(42)
for _ in range(n_iter):
sa = rng.choice(a, size=len(a), replace=True)
sb = rng.choice(b, size=len(b), replace=True)
baseline_mean = sa.mean()
if baseline_mean == 0:
continue
lifts.append((sb.mean() - baseline_mean) / baseline_mean * 100)
if not lifts:
return None, None
lo = float(np.percentile(lifts, (100 - ci) / 2))
hi = float(np.percentile(lifts, 100 - (100 - ci) / 2))
return round(lo, 1), round(hi, 1)
def get_agent_dir(agent):
d = BASE_DIR / agent
d.mkdir(parents=True, exist_ok=True)
return d
def load_json(path, default=None):
if path.exists():
return json.loads(path.read_text())
return default if default is not None else {}
def save_json(path, data):
path.write_text(json.dumps(data, indent=2, default=str))
def next_id(agent):
d = get_agent_dir(agent)
experiments = load_json(d / "experiments.json", [])
return f"EXP-{agent.upper()}-{len(experiments)+1:03d}"
def cmd_create(args):
d = get_agent_dir(args.agent)
experiments = load_json(d / "experiments.json", [])
exp_id = next_id(args.agent)
min_s = get_min_samples(args.agent, args.min_samples if args.min_samples != 3 else None)
variants = json.loads(args.variants)
batch_mode = getattr(args, "batch_mode", False)
if batch_mode and len(variants) > BATCH_MODE_MAX_VARIANTS:
print(f"⚠️ Batch mode capped at {BATCH_MODE_MAX_VARIANTS} variants (got {len(variants)})")
variants = variants[:BATCH_MODE_MAX_VARIANTS]
experiment = {
"id": exp_id,
"agent": args.agent,
"channel": AGENT_CHANNEL.get(args.agent, "unknown"),
"hypothesis": args.hypothesis,
"variable": args.variable,
"variants": variants,
"primary_metric": args.metric,
"cycle_hours": args.cycle_hours,
"min_samples": min_s,
"batch_mode": batch_mode,
"max_variants": BATCH_MODE_MAX_VARIANTS if batch_mode else 2,
"status": "running",
"created_at": datetime.now(timezone.utc).isoformat(),
"data_points": [],
"baseline_variant": variants[0],
"result": None,
"winner": None
}
experiments.append(experiment)
save_json(d / "experiments.json", experiments)
# Update active experiments index
active = load_json(d / "active.json", [])
active.append({"id": exp_id, "variable": args.variable, "variants": experiment["variants"],
"current_variant_idx": 0})
save_json(d / "active.json", active)
mode_str = f"BATCH ({len(variants)} variants)" if batch_mode else "A/B"
print(f"✅ Created {exp_id}: {args.hypothesis}")
print(f" Channel: {experiment['channel']} | Variable: {args.variable} | Mode: {mode_str}")
print(f" Variants: {experiment['variants']}")
print(f" Metric: {args.metric} | Cycle: {args.cycle_hours}h | Min samples/variant: {min_s}")
return exp_id
def cmd_log(args):
d = get_agent_dir(args.agent)
experiments = load_json(d / "experiments.json", [])
for exp in experiments:
if exp["id"] == args.experiment_id:
dp = {
"variant": args.variant,
"metrics": json.loads(args.metrics),
"logged_at": datetime.now(timezone.utc).isoformat(),
"notes": args.notes or ""
}
exp["data_points"].append(dp)
save_json(d / "experiments.json", experiments)
print(f"✅ Logged data point for {args.experiment_id} variant '{args.variant}': {dp['metrics']}")
return
print(f"❌ Experiment {args.experiment_id} not found")
sys.exit(1)
def cmd_score(args):
d = get_agent_dir(args.agent)
experiments = load_json(d / "experiments.json", [])
for exp in experiments:
if exp["id"] == args.experiment_id and exp["status"] in ("running", "active", "trending"):
# Group data points by variant
variant_data = {}
for dp in exp["data_points"]:
v = dp["variant"]
if v not in variant_data:
variant_data[v] = []
variant_data[v].append(dp["metrics"].get(exp["primary_metric"], 0))
baseline_v = exp["baseline_variant"]
min_samples = exp.get("min_samples",
get_min_samples(exp["agent"]) if "agent" in exp else 15)
# Enforce per-variant sample floor
insufficient = []
for v, data in variant_data.items():
if len(data) < min_samples:
insufficient.append((v, len(data)))
if insufficient:
for v, n in insufficient:
print(f"{exp['id']}: Variant '{v}' has {n}/{min_samples} samples. Need more data.")
# Check for trending signal even with fewer samples (need at least 15)
all_counts = {v: len(data) for v, data in variant_data.items()}
min_count = min(all_counts.values()) if all_counts else 0
if min_count >= 15 and baseline_v in variant_data:
baseline_vals = variant_data[baseline_v]
best_trend_v, best_trend_p = None, 1.0
for v, vals in variant_data.items():
if v == baseline_v or len(vals) < 15:
continue
_, p = stats.mannwhitneyu(baseline_vals, vals, alternative="less")
if p < P_TREND and p < best_trend_p:
best_trend_p = p
best_trend_v = v
if best_trend_v:
exp["status"] = "trending"
save_json(d / "experiments.json", experiments)
lift = (np.mean(variant_data[best_trend_v]) - np.mean(baseline_vals)) / np.mean(baseline_vals) * 100 if np.mean(baseline_vals) else 0
print(f"📈 {exp['id']}: TRENDING — '{best_trend_v}' p={best_trend_p:.3f}, lift={lift:.1f}% (needs more samples to confirm)")
return
if not variant_data:
print(f"{exp['id']}: No data points yet.")
return
baseline_vals = np.array(variant_data.get(baseline_v, []), dtype=float)
if len(baseline_vals) < min_samples:
print(f"{exp['id']}: Baseline variant '{baseline_v}' has {len(baseline_vals)}/{min_samples} samples.")
return
# Evaluate all non-baseline variants
results = []
for v, vals in variant_data.items():
if v == baseline_v:
continue
arr = np.array(vals, dtype=float)
baseline_mean = baseline_vals.mean()
variant_mean = arr.mean()
lift = ((variant_mean - baseline_mean) / baseline_mean * 100) if baseline_mean != 0 else 0
# Mann-Whitney U test (non-parametric, no normality assumption)
_, p_two = stats.mannwhitneyu(baseline_vals, arr, alternative="two-sided")
_, p_less = stats.mannwhitneyu(baseline_vals, arr, alternative="less")
ci_lo, ci_hi = bootstrap_lift_ci(baseline_vals.tolist(), arr.tolist())
if p_less < P_WINNER and lift >= LIFT_WIN:
status = "keep"
elif p_two < P_WINNER and lift < 0:
status = "crash" if lift <= -LIFT_WIN else "discard"
elif p_less < P_TREND and len(vals) >= 15:
status = "trending"
else:
status = "running"
results.append({
"variant": v,
"mean": round(float(variant_mean), 2),
"lift_pct": round(lift, 1),
"p_value": round(float(p_less), 4),
"ci_95": [ci_lo, ci_hi],
"n": len(vals),
"status": status
})
baseline_mean = float(baseline_vals.mean())
overall_result = {
"baseline": baseline_v,
"baseline_mean": round(baseline_mean, 2),
"baseline_n": len(baseline_vals),
"variants": results,
"scored_at": datetime.now(timezone.utc).isoformat(),
"min_samples": min_samples,
"thresholds": {"p_winner": P_WINNER, "p_trend": P_TREND, "lift_pct_required": LIFT_WIN}
}
winners = [r for r in results if r["status"] == "keep"]
crashes = [r for r in results if r["status"] in ("crash", "discard")]
trending = [r for r in results if r["status"] == "trending"]
if winners:
best = max(winners, key=lambda r: r["lift_pct"])
exp["status"] = "keep"
exp["winner"] = best["variant"]
exp["result"] = overall_result
save_json(d / "experiments.json", experiments)
# Auto-promote to playbook
playbook = load_json(d / "playbook.json", {})
playbook[exp["variable"]] = {
"best": best["variant"],
"metric": exp["primary_metric"],
"avg": best["mean"],
"improvement": best["lift_pct"],
"p_value": best["p_value"],
"ci_95": best["ci_95"],
"experiment_id": exp["id"],
"promoted_at": datetime.now(timezone.utc).isoformat()
}
save_json(d / "playbook.json", playbook)
# Remove from active index
active = load_json(d / "active.json", [])
active = [a for a in active if a["id"] != exp["id"]]
save_json(d / "active.json", active)
print(f"🏆 {exp['id']}: KEEP — '{best['variant']}' +{best['lift_pct']}% lift "
f"(p={best['p_value']}, 95% CI [{best['ci_95'][0]}, {best['ci_95'][1]}]%)")
print(f" 📖 Playbook updated: {exp['variable']}'{best['variant']}'")
elif all(r["status"] in ("crash", "discard") for r in results) and results:
worst = min(results, key=lambda r: r["lift_pct"])
exp["status"] = "discard"
exp["result"] = overall_result
save_json(d / "experiments.json", experiments)
active = load_json(d / "active.json", [])
active = [a for a in active if a["id"] != exp["id"]]
save_json(d / "active.json", active)
print(f"💀 {exp['id']}: DISCARD — baseline wins. Best variant: '{worst['variant']}' "
f"at {worst['lift_pct']}% (p={worst['p_value']})")
elif trending:
exp["status"] = "trending"
exp["result"] = overall_result
save_json(d / "experiments.json", experiments)
best_t = max(trending, key=lambda r: r["lift_pct"])
print(f"📈 {exp['id']}: TRENDING — '{best_t['variant']}' +{best_t['lift_pct']}% "
f"(p={best_t['p_value']}, n={best_t['n']}). Keep collecting data.")
else:
exp["status"] = "running"
exp["result"] = overall_result
save_json(d / "experiments.json", experiments)
for r in results:
print(f"{exp['id']}: '{r['variant']}' {r['lift_pct']:+.1f}% lift, p={r['p_value']} — running")
return
print(f"❌ Active experiment {args.experiment_id} not found")
def cmd_list(args):
d = get_agent_dir(args.agent)
experiments = load_json(d / "experiments.json", [])
status_filter = args.status or "all"
icons = {
"running": "🔬", "active": "🔬",
"trending": "📈",
"keep": "🏆", "promoted": "🏆",
"discard": "💀", "killed": "💀",
"crash": "🔴",
"inconclusive": "🤷"
}
for exp in experiments:
s = exp["status"]
if status_filter != "all" and s != status_filter:
aliases = {"active": "running", "promoted": "keep", "killed": "discard"}
if aliases.get(s) != status_filter and s != status_filter:
continue
dp_count = len(exp.get("data_points", []))
icon = icons.get(s, "")
ch = exp.get("channel", AGENT_CHANNEL.get(exp["agent"], "?"))
print(f"{icon} {exp['id']}: {exp['hypothesis']}")
print(f" Variable: {exp['variable']} | Channel: {ch} | Status: {s} | Data points: {dp_count}")
if exp.get("winner"):
result = exp.get("result", {})
lift = ""
if isinstance(result, dict):
for vr in result.get("variants", []):
if vr["variant"] == exp["winner"]:
lift = f" ({vr['lift_pct']:+.1f}% lift, p={vr['p_value']})"
break
print(f" Winner: {exp['winner']}{lift}")
print()
def cmd_playbook(args):
d = get_agent_dir(args.agent)
playbook = load_json(d / "playbook.json", {})
if not playbook:
print(f"📖 No playbook entries for {args.agent} yet. Run some experiments!")
return
print(f"📖 {args.agent.upper()} PLAYBOOK — Empirically Proven Best Practices\n")
for variable, entry in playbook.items():
p_str = f", p={entry['p_value']}" if "p_value" in entry else ""
ci_str = f", 95% CI {entry['ci_95']}" if "ci_95" in entry else ""
print(f" {variable}: '{entry['best']}' (+{entry['improvement']}% on {entry['metric']}{p_str}{ci_str})")
print(f" Source: {entry['experiment_id']} | Promoted: {entry['promoted_at'][:10]}")
print()
def cmd_suggest(args):
d = get_agent_dir(args.agent)
experiments = load_json(d / "experiments.json", [])
playbook = load_json(d / "playbook.json", {})
# Define testable categories per channel. Customize these for your business.
categories = {
"content": ["hook_style", "post_format", "cta_type", "post_time", "thread_length",
"emoji_usage", "data_vs_narrative", "question_vs_statement"],
"email": ["subject_line_style", "opener_type", "email_length", "personalization_depth",
"cta_style", "send_time", "follow_up_timing", "social_proof_type"],
"linkedin": ["inmail_opener", "role_framing", "company_pitch", "personalization_level",
"subject_line", "follow_up_cadence"],
"blog": ["headline_style", "content_format", "platform_priority", "visual_style",
"posting_time", "content_length"],
"seo": ["title_tag_format", "meta_description_style", "content_structure",
"internal_linking", "heading_format"]
}
tested = set(playbook.keys())
tested.update(e["variable"] for e in experiments if e["status"] in ("running", "active", "trending"))
agent_cats = categories.get(args.agent, [])
untested = [c for c in agent_cats if c not in tested]
min_s = get_min_samples(args.agent)
ch = AGENT_CHANNEL.get(args.agent, "?")
if untested:
print(f"💡 Suggested next experiments for {args.agent} ({ch}, min {min_s} samples/variant):")
for cat in untested[:3]:
print(f"{cat}")
else:
print(f"{args.agent} has tested all standard categories. Time for advanced experiments!")
def main():
parser = argparse.ArgumentParser(description="Experiment Engine — Autonomous growth experimentation")
sub = parser.add_subparsers(dest="command")
p_create = sub.add_parser("create", help="Create a new experiment")
p_create.add_argument("--agent", required=True, help="Agent/channel name (e.g., content, email, seo)")
p_create.add_argument("--hypothesis", required=True, help="What you're testing and expected outcome")
p_create.add_argument("--variable", required=True, help="The variable being tested (e.g., hook_style)")
p_create.add_argument("--variants", required=True, help="JSON array of variant names")
p_create.add_argument("--metric", required=True, help="Primary metric to optimize (e.g., impressions)")
p_create.add_argument("--cycle-hours", type=int, default=24, help="Hours per experiment cycle (default: 24)")
p_create.add_argument("--min-samples", type=int, default=3,
help="Override min samples/variant (default: auto based on channel volume)")
p_create.add_argument("--batch-mode", action="store_true",
help="Enable batch mode: up to 10 variants simultaneously")
p_log = sub.add_parser("log", help="Log a data point for a running experiment")
p_log.add_argument("--agent", required=True)
p_log.add_argument("--experiment-id", required=True)
p_log.add_argument("--variant", required=True)
p_log.add_argument("--metrics", required=True, help="JSON object of metric values")
p_log.add_argument("--notes", default="")
p_score = sub.add_parser("score", help="Score an experiment (auto-promotes winners)")
p_score.add_argument("--agent", required=True)
p_score.add_argument("--experiment-id", required=True)
p_list = sub.add_parser("list", help="List experiments for an agent")
p_list.add_argument("--agent", required=True)
p_list.add_argument("--status", default="all", help="Filter by status (running/trending/keep/discard/all)")
p_play = sub.add_parser("playbook", help="Show empirically proven best practices")
p_play.add_argument("--agent", required=True)
p_sug = sub.add_parser("suggest", help="Suggest next experiments based on gaps")
p_sug.add_argument("--agent", required=True)
args = parser.parse_args()
if not args.command:
parser.print_help()
return
{"create": cmd_create, "log": cmd_log, "score": cmd_score,
"list": cmd_list, "playbook": cmd_playbook, "suggest": cmd_suggest}[args.command](args)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,342 @@
#!/usr/bin/env python3
"""
pacing-alert.py API-based pacing check for marketing campaigns.
Monitors campaign health across channels:
- Checks pipeline/lead staging rates against daily targets
- Monitors email campaign sending status and capacity
- Tracks candidate sourcing pacing against weekly targets
- Reports cron job health
Exit 0 = on pace (all green), Exit 1 = alert needed (output contains issues).
Configure via environment variables (see .env.example).
Usage:
python3 pacing-alert.py # Run full pacing check
python3 pacing-alert.py --json # Output as JSON instead of formatted text
"""
import argparse
import sys
import json
import subprocess
from datetime import datetime, timezone, timedelta
import urllib.request
import urllib.error
import os
# ── Configuration ──────────────────────────────────────────────────────────────
# API authentication tokens (set via environment variables)
PIPELINE_API_URL = os.environ.get("PIPELINE_API_URL", "https://your-dashboard.example.com/api/pipeline")
PIPELINE_AUTH = os.environ.get("PIPELINE_AUTH_TOKEN", "") # Bearer token for pipeline API
RECRUITING_API_URL = os.environ.get("RECRUITING_API_URL", "https://your-dashboard.example.com/api/recruiting/candidates")
RECRUITING_AUTH = os.environ.get("RECRUITING_AUTH_TOKEN", "") # Bearer token for recruiting API
# Email platform API (e.g., Instantly, Lemlist, Smartlead)
EMAIL_API_URL = os.environ.get("EMAIL_API_URL", "") # e.g., https://api.your-email-platform.com/v2/campaigns
EMAIL_AUTH = os.environ.get("EMAIL_AUTH_TOKEN", "") # Bearer token for email platform
# Campaign IDs for outbound email. Format: JSON object {"Campaign Name": "campaign-uuid"}
OUTBOUND_CAMPAIGNS = json.loads(os.environ.get("OUTBOUND_CAMPAIGNS", "{}"))
RECRUITING_CAMPAIGNS = json.loads(os.environ.get("RECRUITING_CAMPAIGNS", "{}"))
# Pacing targets
DAILY_LEAD_TARGET = int(os.environ.get("DAILY_LEAD_TARGET", "10")) # Min leads staged per day
WEEKLY_CANDIDATE_TARGET = int(os.environ.get("WEEKLY_CANDIDATE_TARGET", "400")) # Candidates per week
# Timezone offset from UTC (e.g., -7 for PDT, -8 for PST)
TZ_OFFSET = int(os.environ.get("TZ_OFFSET", "-7"))
LOCAL_TZ = timezone(timedelta(hours=TZ_OFFSET))
TZ_LABEL = os.environ.get("TZ_LABEL", "PDT")
# ── Helpers ────────────────────────────────────────────────────────────────────
def api_get(url, auth):
"""Make authenticated GET request. Returns parsed JSON or error dict."""
headers = {"Content-Type": "application/json"}
if auth:
headers["Authorization"] = auth if auth.startswith("Bearer ") else f"Bearer {auth}"
req = urllib.request.Request(url, headers=headers)
try:
with urllib.request.urlopen(req, timeout=15) as r:
return json.loads(r.read())
except urllib.error.HTTPError as e:
return {"_error": f"HTTP {e.code}"}
except Exception as e:
return {"_error": str(e)}
def now_local():
return datetime.now(LOCAL_TZ)
def today_date():
return now_local().date()
def week_start():
now = now_local()
monday = now - timedelta(days=now.weekday())
return monday.replace(hour=0, minute=0, second=0, microsecond=0)
def parse_ts(ts_str):
"""Parse ISO timestamp string to local timezone datetime."""
if not ts_str:
return None
try:
ts_str = ts_str.replace("Z", "+00:00")
return datetime.fromisoformat(ts_str).astimezone(LOCAL_TZ)
except Exception:
return None
def is_today(ts_str):
dt = parse_ts(ts_str)
return dt is not None and dt.date() == today_date()
def is_this_week(ts_str):
dt = parse_ts(ts_str)
return dt is not None and dt >= week_start()
# ── Pipeline API ───────────────────────────────────────────────────────────────
def get_pipeline_stats():
"""Fetch pipeline/lead staging stats. Returns (stats_dict, error_string)."""
if not PIPELINE_AUTH:
return None, "PIPELINE_AUTH_TOKEN not configured"
data = api_get(f"{PIPELINE_API_URL}?page=1&limit=200", PIPELINE_AUTH)
if "_error" in data:
return None, data["_error"]
prospects = data.get("prospects", [])
stats = data.get("stats", {})
today_total = 0
today_approved = 0
today_sent = 0
for p in prospects:
created = p.get("queued_at") or p.get("created_at") or ""
if is_today(created):
today_total += 1
status = (p.get("status") or "").lower()
if status == "approved":
today_approved += 1
elif status == "sent":
today_sent += 1
return {
"today_total": today_total,
"today_approved": today_approved,
"today_sent": today_sent,
"total": stats.get("total", len(prospects)),
}, None
def get_recruiting_stats():
"""Fetch candidate sourcing stats with pagination. Returns (stats_dict, error_string)."""
if not RECRUITING_AUTH:
return None, "RECRUITING_AUTH_TOKEN not configured"
data = api_get(f"{RECRUITING_API_URL}?page=1&limit=50", RECRUITING_AUTH)
if "_error" in data:
return None, data["_error"]
stats = data.get("stats", {})
pagination = data.get("pagination", {})
total_pages = pagination.get("total_pages", 1)
today_total = 0
week_total = 0
def count_page(candidates):
nonlocal today_total, week_total
for c in candidates:
created = c.get("created_at") or c.get("createdAt") or ""
if is_today(created):
today_total += 1
if is_this_week(created):
week_total += 1
count_page(data.get("candidates", []))
# Paginate (stop early when we hit records older than this week)
max_pages = min(total_pages, 7)
for page in range(2, max_pages + 1):
pdata = api_get(f"{RECRUITING_API_URL}?page={page}&limit=50", RECRUITING_AUTH)
if "_error" in pdata:
break
candidates = pdata.get("candidates", [])
if not candidates:
break
last = candidates[-1]
last_ts = parse_ts(last.get("created_at") or "")
count_page(candidates)
if last_ts and last_ts < week_start():
break
return {
"today_total": today_total,
"week_total": week_total,
"stats_total": stats.get("total", "?"),
"stats_in_pipeline": stats.get("in_pipeline", "?"),
"stats_approved": stats.get("approved", "?"),
"stats_meetings": stats.get("meetings", "?"),
}, None
# ── Email Campaign Status ─────────────────────────────────────────────────────
NOT_SENDING_LABELS = {0: "sending", 2: "daily limit hit", 4: "issue"}
def get_campaign_status(campaign_id, name):
"""Check single email campaign health."""
if not EMAIL_AUTH:
return {"name": name, "error": "EMAIL_AUTH_TOKEN not configured", "sending": False, "active": False, "daily_limit": 0}
data = api_get(f"{EMAIL_API_URL}/{campaign_id}", EMAIL_AUTH)
if "_error" in data:
return {"name": name, "error": data["_error"], "sending": False, "active": False, "daily_limit": 0}
status = data.get("status", -1)
ns_status = data.get("not_sending_status", 0)
daily_limit = data.get("daily_limit", 0)
return {
"name": name,
"active": status == 1,
"ns_status": ns_status,
"ns_label": NOT_SENDING_LABELS.get(ns_status, f"unknown({ns_status})"),
"daily_limit": daily_limit,
"sending": status == 1 and ns_status == 0,
}
def get_campaigns_summary(campaigns_dict):
"""Get aggregate health for a set of campaigns."""
if not campaigns_dict:
return {"results": [], "sending_count": 0, "total": 0, "capacity": 0, "any_issue": False, "all_paused": True}
results = [get_campaign_status(cid, name) for name, cid in campaigns_dict.items()]
sending_count = sum(1 for r in results if r.get("sending"))
total_capacity = sum(r.get("daily_limit", 0) for r in results if r.get("sending"))
any_issue = any(r.get("ns_status", 0) in (2, 4) for r in results)
all_paused = all(not r.get("active") for r in results)
return {
"results": results,
"sending_count": sending_count,
"total": len(results),
"capacity": total_capacity,
"any_issue": any_issue,
"all_paused": all_paused,
}
# ── Pacing Logic ───────────────────────────────────────────────────────────────
def pace_icon(issues):
if issues == 0: return "🟢"
elif issues == 1: return "🟡"
else: return "🔴"
def pipeline_pace(today_total, campaign_summary):
issues = 0
if today_total == 0: issues += 2
elif today_total < DAILY_LEAD_TARGET // 2: issues += 1
if campaign_summary["sending_count"] == 0: issues += 2
elif campaign_summary["any_issue"]: issues += 1
return pace_icon(issues)
def recruiting_pace(week_total, campaign_summary):
issues = 0
if week_total < WEEKLY_CANDIDATE_TARGET // 4: issues += 2
elif week_total < WEEKLY_CANDIDATE_TARGET // 2: issues += 1
if campaign_summary["sending_count"] == 0: issues += 2
elif campaign_summary["any_issue"]: issues += 1
return pace_icon(issues)
def campaign_line(summary):
if summary["all_paused"]:
return "🔴 all paused | 0 emails/day"
elif summary["sending_count"] == 0:
return "🔴 not sending | 0 emails/day"
elif summary["any_issue"]:
return f"🟡 {summary['sending_count']}/{summary['total']} sending | {summary['capacity']:,} emails/day"
else:
return f"🟢 {summary['sending_count']}/{summary['total']} sending | {summary['capacity']:,} emails/day"
# ── Main ───────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="Campaign pacing alert")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
now = now_local()
date_str = now.strftime("%a %b %-d")
time_str = now.strftime("%-I:%M %p") + " " + TZ_LABEL
alerts = []
# Fetch data
pipeline_stats, pipeline_err = get_pipeline_stats()
recruiting_stats, recruiting_err = get_recruiting_stats()
outbound_summary = get_campaigns_summary(OUTBOUND_CAMPAIGNS)
recruiting_campaign_summary = get_campaigns_summary(RECRUITING_CAMPAIGNS)
if args.json:
output = {
"timestamp": now.isoformat(),
"pipeline": pipeline_stats or {"error": pipeline_err},
"recruiting": recruiting_stats or {"error": recruiting_err},
"outbound_campaigns": outbound_summary,
"recruiting_campaigns": recruiting_campaign_summary,
}
print(json.dumps(output, indent=2, default=str))
has_alerts = pipeline_err or recruiting_err or outbound_summary["any_issue"]
sys.exit(1 if has_alerts else 0)
lines = [f"⚠️ *Pacing Alert — {date_str} {time_str}*", ""]
# ── Pipeline / Outbound ──
if pipeline_err:
p_icon = "🔴"
p_line = f"API error: {pipeline_err}"
alerts.append(f"Pipeline API error: {pipeline_err}")
else:
pt = pipeline_stats["today_total"]
pa = pipeline_stats["today_approved"]
ps = pipeline_stats["today_sent"]
p_icon = pipeline_pace(pt, outbound_summary)
p_line = f"{pt} leads staged today | {pa} approved | {ps} sent"
if pt == 0:
alerts.append("Pipeline: 0 leads staged today")
lines.append(f"{p_icon} 📧 *Outbound Pipeline:*")
lines.append(f"{p_line}")
lines.append(f"• Campaigns: {campaign_line(outbound_summary)}")
lines.append("")
# ── Recruiting / Sourcing ──
if recruiting_err:
r_icon = "🔴"
r_line = f"API error: {recruiting_err}"
alerts.append(f"Recruiting API error: {recruiting_err}")
else:
rt = recruiting_stats["today_total"]
rw = recruiting_stats["week_total"]
r_icon = recruiting_pace(rw, recruiting_campaign_summary)
r_line = f"{rt} candidates added today | {rw} this week | target: {WEEKLY_CANDIDATE_TARGET}/week"
if rw < WEEKLY_CANDIDATE_TARGET // 4:
alerts.append(f"Recruiting: only {rw} candidates this week (target {WEEKLY_CANDIDATE_TARGET})")
lines.append(f"{r_icon} 🔍 *Recruiting Pipeline:*")
lines.append(f"{r_line}")
lines.append(f"• Campaigns: {campaign_line(recruiting_campaign_summary)}")
print("\n".join(lines))
if alerts:
sys.exit(1)
else:
sys.exit(0)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,2 @@
numpy>=1.24.0
scipy>=1.10.0

View file

@ -0,0 +1,28 @@
# Apollo People Search API
APOLLO_API_KEY=your_apollo_api_key_here
# LeadMagic Email Verification API
LEADMAGIC_API_KEY=your_leadmagic_api_key_here
# Instantly Cold Email Platform
INSTANTLY_API_KEY=your_instantly_api_key_here
# Email Sending (for cold-outbound-sender.py)
SENDER_EMAIL=you@yourdomain.com
SENDER_NAME=Your Name
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@yourdomain.com
SMTP_PASSWORD=your_app_password_here
# Competitive Monitor (optional)
# Path to a JSON file defining your competitors
# See scripts/competitive-monitor.py for the expected format
COMPETITORS_CONFIG=./competitors.json
# Cross-Signal Detector (optional)
DATA_DIR=./data/agent-outputs
OUTPUT_FILE=./data/cross-signals-latest.json
# Cross-Signal Detector: comma-separated words to exclude from company extraction
# SIGNAL_STOP_WORDS=YourCompany,InternalTool

159
outbound-engine/README.md Normal file
View file

@ -0,0 +1,159 @@
# AI Outbound Engine
From ICP definition to emails in inbox — fully automated cold outbound.
This skill category handles the complete cold outbound pipeline: defining your ideal customer profile, writing expert-scored email sequences, sourcing and verifying leads, deduplicating against existing campaigns, uploading to your email platform, and monitoring the competitive landscape.
## What's Inside
### 🎯 Cold Outbound Optimizer (`SKILL.md`)
Full campaign design workflow:
- **ICP Definition** — structured template to define exactly who you're targeting
- **Infrastructure Audit** — pulls sending account inventory, warmup scores, and capacity math from Instantly
- **Expert Panel Scoring** — 10 simulated outbound experts score your copy (recursive until 90+/100)
- **Sequence Copywriting** — subject lines, body copy, follow-ups, breakup emails — all Instantly-ready
- **Capacity Planning** — accounts × daily limits = pipeline projections
- **Implementation Docs** — step-by-step launch plan
Supports both "start from scratch" and "optimize existing campaigns" modes.
### 📥 Lead Pipeline (`scripts/lead-pipeline.py`)
End-to-end lead sourcing:
1. **Apollo People Search** — pull leads matching your ICP criteria
2. **LeadMagic Verification** — validate every email before sending
3. **Deduplication** — check against existing Instantly leads + exclusion lists
4. **Upload to Instantly** — batch upload with rate limiting and retry logic
### 🔍 Competitive Monitor (`scripts/competitive-monitor.py`)
Track competitors automatically:
- Pricing page change detection (diff-based)
- Blog post monitoring for recent content
- Generates weekly competitive intelligence reports
- Configurable competitor list — add any company you want to track
### 🔗 Cross-Signal Detector (`scripts/cross-signal-detector.py`)
Find overlapping signals across multiple data sources:
- Company overlap across SEO, sales, and outbound data
- Vertical alignment detection
- Keyword cluster correlation
- Confidence-scored recommendations for coordinated action
### 📧 Cold Outbound Sender (`scripts/cold-outbound-sender.py`)
Sends approved outbound emails:
- Reads from an approved prospects JSON file
- Daily send limits (configurable)
- Full send history tracking
- Dry-run mode for testing
### 🔧 Instantly Audit (`scripts/instantly-audit.py`)
Pull campaign health data from the Instantly v2 API:
- Sending account inventory and warmup scores
- Campaign performance (open rate, reply rate, positive reply rate)
- Capacity math (conservative vs aggressive projections)
- Flags: low warmup scores, underperforming campaigns, blockers
## Quick Start
### 1. Set Up Environment Variables
```bash
cp .env.example .env
# Fill in your API keys
```
### 2. Install Dependencies
```bash
pip install -r requirements.txt
```
### 3. Run the Lead Pipeline
```bash
python3 scripts/lead-pipeline.py \
--titles "VP Marketing,CMO,Head of Growth" \
--industries "SaaS,Marketing" \
--company-size "11,50" \
--locations "United States" \
--campaign-id "YOUR_CAMPAIGN_UUID" \
--volume 500 \
--dry-run
```
### 4. Audit Your Instantly Account
```bash
python3 scripts/instantly-audit.py --output report.md
```
### 5. Monitor Competitors
```bash
python3 scripts/competitive-monitor.py --output report.md
```
### 6. Detect Cross-Signals
```bash
python3 scripts/cross-signal-detector.py \
--data-dir ./data/agent-outputs \
--output cross-signals.json
```
## Architecture
```
ICP Definition
Expert Panel Scoring (recursive → 90+)
Apollo Search → LeadMagic Verify → Dedupe → Instantly Upload
│ │
▼ ▼
Competitive Monitor ◄──────────────► Cross-Signal Detector
Weekly Intelligence Report
```
## File Structure
```
outbound-engine/
├── README.md # This file
├── SKILL.md # Claude Code skill definition
├── .env.example # Environment variable template
├── requirements.txt # Python dependencies
├── scripts/
│ ├── lead-pipeline.py # Apollo → LeadMagic → Dedupe → Instantly
│ ├── instantly-audit.py # Instantly account health check
│ ├── competitive-monitor.py # Competitor tracking
│ ├── cross-signal-detector.py # Multi-source signal detection
│ └── cold-outbound-sender.py # Send approved outbound emails
└── references/
├── expert-panel.md # Default 10-expert scoring roster
├── copy-rules.md # Cold email copywriting rules
├── icp-template.md # ICP data collection template
└── instantly-rules.md # Instantly variable syntax & deliverability rules
```
## Requirements
- Python 3.9+
- API keys: Apollo, LeadMagic, Instantly (see `.env.example`)
- For the sender script: a configured email sending tool (e.g., `gog` CLI or SMTP)
- Claude Code or similar AI coding agent (for running the SKILL.md workflow)
## Customization
- **ICP**: Edit `references/icp-template.md` or provide parameters at runtime
- **Expert Panel**: Swap panelists in `references/expert-panel.md` for your industry
- **Competitors**: Configure the `COMPETITORS` dict in `competitive-monitor.py`
- **Send limits**: Adjust `MAX_PER_DAY` in `cold-outbound-sender.py`
- **Data sources**: Point `cross-signal-detector.py` at your own data directories
## License
MIT

157
outbound-engine/SKILL.md Normal file
View file

@ -0,0 +1,157 @@
---
name: cold-outbound-optimizer
description: Design, analyze, and optimize cold outbound email campaigns for Instantly. Handles end-to-end ICP definition, expert panel scoring (recursive to 90+), sequence copywriting, infrastructure audit, capacity planning, and implementation docs. Use when asked to build cold outbound sequences, optimize cold email, analyze outbound campaigns, build sales sequences, build Instantly sequences, create cold outbound strategies, or design email campaigns. Supports both "start from scratch" and "optimize existing" modes.
---
# Cold Outbound Optimizer
---
## Startup: Determine Mode
Ask the user:
1. Do you have an **existing Instantly account** with campaigns to audit, or are you **starting from scratch**?
2. Do you have an **Instantly API key**? (Required for audit mode.)
If API key provided → run `scripts/instantly-audit.py` to pull campaigns, account inventory, and warmup scores before proceeding.
---
## Phase 1: Discovery & Audit
### 1A — Infrastructure Check (if API key available)
Run `python3 scripts/instantly-audit.py --api-key <KEY>` and report:
- Active campaigns (name, status, reply rate, open rate)
- Sending accounts (count, warmup score, daily limit)
- Domain inventory
- Warmup gaps: any account with score <80 or <14 days warmup flag as NOT ready
### 1B — Performance Data
- Pull campaign analytics from Instantly
- Ask: "Do you have a spreadsheet with historical outbound data?" If yes, request link.
### 1C — ICP Definition
If no ICP defined, collect:
- **Titles:** Who are you targeting? (e.g., VP Marketing, Head of Growth)
- **Industries:** Which verticals?
- **Company size:** Employee count or revenue range?
- **Revenue floor:** Minimum ARR/revenue to qualify?
- **Anti-ICP:** Who to explicitly exclude?
Use `references/icp-template.md` as the collection template.
### 1D — Business Context
Collect:
- What do you sell? (One sentence, no jargon)
- What's the primary offer? (Free trial, audit, demo, consultation)
- Real URLs to reference (pricing page, case studies, relevant content)
- Any proof points? (Client results, stats, social proof)
### 1E — Expert Panel Config
Default: 10 experts (see `references/expert-panel.md`).
Ask: "Any industry-specific experts to add, or panelists to swap?" Confirm roster before scoring.
---
## Phase 2: Expert Panel Recursive Scoring
**Target: 90/100. Non-negotiable. Iterate until reached.**
### Round Structure
Each round produces:
1. **Score table** — all 10 panelists, individual score (0-100), one-line rationale
2. **Aggregate score** — average of all 10
3. **Top weaknesses** — ranked list of what's holding the copy back
4. **Changes made** — specific edits addressing each weakness
5. **Updated copy** — full revised sequence after changes
### Scoring Criteria (per panelist's lens — see `references/expert-panel.md`)
- Subject line curiosity / open rate potential
- First sentence pattern interrupt
- Body clarity and brevity
- CTA softness and specificity
- Sequence flow and follow-up logic
- Deliverability risk signals (spam words, link density)
- Personalization believability
### Rules
- Scores must be brutally honest. No padding to 90 without earning it.
- If round score < 90: identify top 3 weaknesses, revise copy, run next round.
- If round score ≥ 90: finalize copy and proceed to deliverables.
- Show every round in the final doc — the iteration trail is part of the value.
---
## Phase 3: Deliverables
### Strategy Doc
Create a document (Google Doc, Notion, or markdown) with:
1. **Pre-Analysis / Brutal Truth** — what the existing campaigns are doing wrong (or baseline if starting from scratch)
2. **ICP Summary** — confirmed targeting parameters
3. **Infrastructure Status** — account inventory, warmup readiness, capacity math
4. **Scoring Rounds** — full panel vote tables for every round
5. **Final Email Copy** — all steps for all campaigns, Instantly-ready format
6. **Implementation Plan** — step-by-step setup instructions
7. **Capacity Math** — accounts × daily send rate = pipeline projections
8. **Weekly Metrics Targets** — open rate, reply rate, positive reply rate, meetings booked
9. **STOP List** — what to kill immediately
10. **START List** — what to launch first
### Format Rules for Final Copy
Follow all rules in `references/instantly-rules.md` and `references/copy-rules.md`.
### Human Review Gate
**Do NOT push anything to Instantly automatically.** The doc is for human review. Get explicit approval before any API writes.
### Iteration
After review, collect feedback and re-run scoring on revised copy if needed.
---
## Capacity Math Formula
```
Accounts ready (score ≥80, ≥14 days warmup) × 30 emails/day = conservative daily volume
Accounts ready × 50 emails/day = aggressive daily volume
Daily volume × 22 working days = monthly send capacity
Monthly sends × expected reply rate = expected replies
Expected replies × qualification rate = pipeline opportunities
```
---
## Weekly Metrics Targets (Baselines)
| Metric | Good | Great |
|--------|------|-------|
| Open rate | 40%+ | 60%+ |
| Reply rate | 3%+ | 7%+ |
| Positive reply rate | 1%+ | 3%+ |
| Meeting rate | 0.5%+ | 1.5%+ |
Adjust targets based on niche and offer. Cold traffic to a free audit converts differently than a paid trial.
---
## Add-On Recommendations (mention but don't build)
- **LinkedIn automation:** HeyReach or similar for multi-channel sequences. Separate workflow.
- **Lead enrichment:** Clay or Apollo for personalization data before upload.
- **Lead pipeline:** Use `scripts/lead-pipeline.py` for Apollo → LeadMagic → Instantly automation.
---
## Reference Files
| File | Purpose |
|------|---------|
| `references/instantly-rules.md` | Variable syntax, sequence structure, deliverability rules |
| `references/expert-panel.md` | Default 10-expert roster with scoring lenses |
| `references/copy-rules.md` | Email copy rules (first sentence, CTA, stats framing) |
| `references/icp-template.md` | ICP data collection template |
| `scripts/instantly-audit.py` | Pulls campaigns, accounts, warmup scores via Instantly v2 API |
| `scripts/lead-pipeline.py` | End-to-end lead sourcing pipeline |
| `scripts/competitive-monitor.py` | Competitor tracking and intelligence |
| `scripts/cross-signal-detector.py` | Multi-source signal detection |
| `scripts/cold-outbound-sender.py` | Send approved outbound emails |

View file

@ -0,0 +1,139 @@
# Cold Email Copy Rules
Rules for writing and evaluating cold email copy. Apply to every step in every sequence.
---
## First Sentence Rules
**NEVER start with:**
- "I" — e.g., "I came across your company..."
- "We" — e.g., "We help companies like yours..."
- "Our team" — e.g., "Our team specializes in..."
- "I wanted to" — e.g., "I wanted to reach out because..."
- "Hope this finds you well" or any version of it
- "My name is..." (save for follow-ups if needed, never Step 1)
**ALWAYS start with one of:**
- **Prospect's company name** — "{{companyName}}'s recent..."
- **A specific market observation** — "Most [industry] companies we talk to are..."
- **A specific finding** — "Your [blog post / LinkedIn post / job listing] on..."
- **A relevant trend** — "Since [relevant thing] happened in [industry]..."
The first sentence earns the second. If it doesn't make the prospect think "hm, relevant," the email is dead.
---
## Body Length Rules
| Step | Max sentences | Notes |
|------|--------------|-------|
| Step 1 | 3 sentences | Open + value + CTA. That's it. |
| Steps 2-4 | 3-5 sentences | Add new angle or asset, not a repeat |
| Step 5 (bump) | 1-2 sentences | Short. "Still relevant?" style. |
| Step 6 (breakup) | 2-3 sentences | Leave value, don't close your file. |
If a step is longer than this, cut it. Ruthlessly.
---
## Stats and Social Proof
**Correct framing (observation):**
> "Most brands we audit are leaving 30-40% of their SEO traffic unconverted."
**Incorrect framing (study/study-like):**
> "According to our data, 73% of brands have this problem."
Why: Observation sounds like earned experience. Study sounds like a marketing claim. Prospects believe the former.
**Never fabricate:**
- Specific client names unless verified and approved
- Revenue numbers or % improvements unless you have the actual data
- Podcast episodes or content references unless they exist and are linkable
- Case study specifics — if you can't verify it, generalize it
---
## CTAs
**Soft asks (preferred):**
- "Worth a look?"
- "Want the data?"
- "Does this match what you're seeing?"
- "Relevant to what you're working on?"
- "Happy to share what we found — useful?"
**Hard asks (avoid in Step 1):**
- "Book a call with me" → too much commitment too early
- "Schedule 30 minutes" → presumes interest
- "Let's hop on a call" → pushy
- "Are you free Thursday?" → too forward for a stranger
Use hard asks only in Step 4+ if you've gotten engagement signals. Even then, soften them.
---
## Links
- **Step 1:** No links (deliverability + trust)
- **Steps 2-3:** Max 1 link, only if it adds genuine value (a case study, a report, a tool)
- **Breakup email:** Include 1 real link to genuinely useful content (not a sales page)
- **Never:** Hallucinate URLs. All links must be verified real pages before use.
- **Never:** Link to a landing page with a form in Steps 1-2 — it signals spray-and-pray
---
## Breakup Email (Final Step)
**Correct:**
> Leave something genuinely useful. A real article, a real report, a real piece of content that relates to their problem.
> "In case it's useful regardless — here's the framework we use: [real URL]. No pressure on the rest."
**Incorrect:**
> "Just wanted to close the loop / closing your file / marking you as not interested"
> This is negative framing and slightly manipulative. The prospect notices.
---
## AI Engine References
When listing AI tools in copy or messaging, always include the full set:
**ChatGPT, Perplexity, Gemini, Claude**
Do not omit any major AI platform. If listing "AI tools" or "AI search engines," include all four.
---
## Personalization Rules
- `{{personalization}}` field: must be set per lead. Don't leave it generic.
- Personalization should reference something *specific* to the company: a recent hire, a published piece, a product launch, a job listing signal, a funding round.
- If you can't personalize at least 50% of the list, remove `{{personalization}}` from the template and rewrite to not depend on it.
---
## Subject Lines
- Length: 3-7 words is the sweet spot
- No exclamation points
- No all-caps
- No emoji in B2B cold email (unless targeting a persona that expects it)
- Best patterns:
- Question: "Quick question, {{firstName}}"
- Observation: "{{companyName}}'s content strategy"
- Specificity: "Saw your post on [topic]"
- Intrigue: "One thing we noticed"
- A/B test 2 variants per Step 1. Pick winner after 100+ sends each.
---
## Tone
- Peer-to-peer, not vendor-to-prospect
- Curious, not desperate
- Specific, not generic
- Short, not comprehensive
- Human, not corporate
If it sounds like a marketing email, rewrite it. Cold email that converts sounds like a text from a knowledgeable peer.

View file

@ -0,0 +1,89 @@
# Expert Panel — Default Roster
10 outbound sales experts. Each scores copy through their specific lens.
User can swap or add panelists based on industry or offer type.
---
## Default Panel
### 1. Alex Berman
**Background:** Cold email master, B2B agency lead gen. $100M+ pipeline generated via cold outreach.
**Scoring lens:** Raw reply rate potential. Does this email get a "yes" or "tell me more"? Evaluates offer clarity, brevity, and specificity.
**Red flags he catches:** Vague value props, over-explaining, walls of text.
### 2. Oren Klaff
**Background:** Author of *Pitch Anything*. Neuromarketing and frame control specialist.
**Scoring lens:** Frame and status. Does this email position the sender as high-status? Is there genuine scarcity or social proof? Does it trigger "I need to respond to this"?
**Red flags he catches:** Begging energy, "I just wanted to...", weak positioning.
### 3. Josh Braun
**Background:** *Badass B2B Growth*. Anti-spam cold email philosophy.
**Scoring lens:** Does this email respect the prospect? Is it genuinely useful or just noise? Evaluates honest curiosity, relevant observations, and non-pushy CTAs.
**Red flags he catches:** Fake personalization, presumptuous CTAs, spray-and-pray signals.
### 4. Becc Holland
**Background:** Creator of "Flip the Script." Pattern interrupt specialist.
**Scoring lens:** Does Step 1 stop the scroll? Is the opening surprising enough to earn the next sentence? Evaluates subject line + first sentence combo.
**Red flags he catches:** Generic openers, "I hope this finds you well", predictable subject lines.
### 5. Sam McKenna
**Background:** #samsales. "Show Me You Know Me" methodology.
**Scoring lens:** Research depth. Does the email prove the sender actually knows this prospect? Evaluates specificity of personalization and relevance of observation.
**Red flags she catches:** Generic compliments, surface-level research, "I noticed your website..."
### 6. Kyle Coleman
**Background:** Copy.ai VP Marketing. B2B sequencing strategy.
**Scoring lens:** Sequence architecture. Does the follow-up ladder make sense? Does each step add new value rather than just bumping? Evaluates sequence logic and escalation.
**Red flags he catches:** Repetitive follow-ups, "just checking in", no value escalation.
### 7. Will Allred
**Background:** Lavender co-founder. Reply rate optimization via AI-assisted email analysis.
**Scoring lens:** Readability and reply rate signals. Reading grade level, sentence length, mobile rendering, emotional tone. Does this feel like a real email from a real person?
**Red flags he catches:** Long sentences, passive voice, corporate jargon, "synergies".
### 8. Jeremy Donovan
**Background:** SalesLoft SVP of Revenue Strategy. Data-driven deliverability and analytics.
**Scoring lens:** Deliverability and measurability. Are there spam triggers? Is the send structure safe? Are metrics targets realistic?
**Red flags he catches:** Spam words, link overload, unrealistic reply rate expectations.
### 9. Jeb Blount
**Background:** Author of *Fanatical Prospecting*. Multi-channel outbound systems.
**Scoring lens:** Pipeline math and multi-channel logic. Is the sequence volume sufficient? Should LinkedIn or phone be layered in? Is the outreach sustainable?
**Red flags he catches:** Under-resourced sequences, single-channel dependency, no follow-through plan.
### 10. Patrick Dang
**Background:** B2B sales coach. Email + LinkedIn combo plays.
**Scoring lens:** LinkedIn integration potential. Does the email sequence have natural LinkedIn touchpoints? Is the overall outreach strategy connected across channels?
**Red flags he catches:** Siloed email sequences with no social proof layer, no profile warmup.
---
## Scoring Table Format (per round)
| Panelist | Score | Rationale |
|----------|-------|-----------|
| Alex Berman | XX | [one-line reason] |
| Oren Klaff | XX | [one-line reason] |
| Josh Braun | XX | [one-line reason] |
| Becc Holland | XX | [one-line reason] |
| Sam McKenna | XX | [one-line reason] |
| Kyle Coleman | XX | [one-line reason] |
| Will Allred | XX | [one-line reason] |
| Jeremy Donovan | XX | [one-line reason] |
| Jeb Blount | XX | [one-line reason] |
| Patrick Dang | XX | [one-line reason] |
| **AVERAGE** | **XX** | |
---
## Swapping / Adding Panelists
User may request panelist changes. Examples:
- Selling to HR → add Lou Adler (hiring-focused B2B sales)
- Selling SaaS dev tools → add Jason Lemkin (SaaS-specific outbound)
- Selling to enterprise → add John Barrows (enterprise sales methodology)
- Selling to e-commerce → add Ezra Firestone (e-com marketing lens)
When adding panelists, define their scoring lens before running rounds.
Minimum panel size: 5. Maximum: 15 (more than 15 creates noise, not signal).

View file

@ -0,0 +1,160 @@
# ICP Data Collection Template
Use this template when defining the Ideal Customer Profile. Collect all fields before writing copy.
---
## ICP Definition
**Client/Campaign:** _______________
**Date:** _______________
**Collected by:** _______________
---
### Target Titles
Who specifically receives these emails? List primary and secondary titles.
**Primary titles (high intent):**
- e.g., VP of Marketing
- e.g., Director of Demand Generation
- e.g., Head of Growth
**Secondary titles (acceptable, lower priority):**
- e.g., CMO (at smaller companies)
- e.g., Marketing Manager (if company size <50)
**Never target:**
- e.g., Coordinators, Interns, Assistants (unless specifically requested)
---
### Target Industries / Verticals
**Primary verticals:**
1.
2.
3.
**Secondary verticals (test, not primary):**
1.
2.
**Excluded verticals (anti-ICP):**
- e.g., Non-profits (budget constraints)
- e.g., Government (procurement timelines)
- e.g., [Specific vertical you can't serve]
---
### Company Size
**Employee count range:**
- Minimum: ___
- Maximum: ___
- Sweet spot: ___
**Revenue range (if targeting by revenue):**
- Minimum ARR/Revenue: $___
- Maximum: $___
**Funding stage (if relevant):**
- e.g., Series A+
- e.g., Bootstrapped >$5M revenue
- e.g., PE-backed
---
### Geographic Targeting
**Primary markets:**
- e.g., US only
- e.g., US + Canada
- e.g., English-speaking markets
**Excluded regions:**
- e.g., APAC (different sales motion)
---
### Buying Signals / Trigger Events
What makes a company more likely to buy right now?
- e.g., Recently hired a new VP Marketing (job posting signal)
- e.g., Raised funding in last 6 months
- e.g., Launched new product in last 90 days
- e.g., Running paid search (visible via SpyFu/SemRush)
- e.g., Job listings for [role] signal they need help
---
### Anti-ICP (Explicit Exclusions)
Who should never receive these emails?
**Company characteristics:**
- e.g., <10 employees (too small, no budget)
- e.g., Bootstrapped and not scaling
- e.g., Already a current client
**Contact characteristics:**
- e.g., No verified email (bounce risk)
- e.g., Missing firstName (won't personalize)
- e.g., Opt-out list
---
### Offer-to-ICP Fit
**What's the primary offer?**
- [ ] Free audit
- [ ] Free trial
- [ ] Demo
- [ ] Strategy call
- [ ] Content/report download
- [ ] Other: _______________
**Why this offer for this ICP?**
(One sentence — if you can't answer this, the offer needs rethinking)
---
### Known Objections
What does this ICP typically say no to?
1.
2.
3.
**How we address objections in copy:**
(Don't address all of them — pick the one that kills the most deals and neutralize it in Step 3 or 4)
---
### Personalization Data Available
What data fields are available per lead?
- [ ] firstName ✓ (required)
- [ ] companyName ✓ (required)
- [ ] personalization field — source: _______________
- [ ] Industry
- [ ] Employee count
- [ ] LinkedIn URL
- [ ] Other: _______________
**Personalization source:**
- e.g., Clay enrichment
- e.g., Apollo export
- e.g., Manual research (for small lists)
- e.g., None (template must work without it)
---
### Notes / Special Instructions
Any other context the copywriter needs:
_______________

View file

@ -0,0 +1,79 @@
# Instantly-Specific Rules
## Valid Variables (ONLY these — no others)
| Variable | Usage |
|----------|-------|
| `{{firstName\|there}}` | Prospect first name, fallback "there" |
| `{{companyName\|your company}}` | Prospect company, fallback "your company" |
| `{{personalization}}` | Custom personalization field (set per lead) |
| `{{sendingAccountFirstName}}` | Sender's first name (from sending account) |
**Never use:**
- Square-bracket placeholders like `[Competitor A]`, `[Your Company]`, `[Industry]`
- Custom variables not listed above — they won't render in Instantly
- If a concept can't be expressed with valid variables, rewrite the copy to not need it
## firstName Rule (Critical)
- **Always require firstName** during lead upload. Filter out leads without first name.
- Do NOT rely on the `|there` fallback as a design choice — it signals a bad list.
- If the list has >5% missing firstName, flag it before launch.
## Sequence Structure
- **Steps:** 5-6 max (not 8). Diminishing returns after 6.
- **Step delays (days after previous step):**
- Step 1: Day 0 (immediate)
- Step 2: Day 2
- Step 3: Day 4-7
- Step 4: Day 7
- Step 5: Day 7-14
- Step 6 (breakup): Day 7-14 after Step 5
## A/B Testing
- **Step 1 only:** Test 2 subject line variants (A/B)
- Don't A/B test body copy in early campaigns — isolate subject line variable first
- Winning subject line = whichever hits higher open rate at 100+ sends per variant
## Signature Format
```
{{sendingAccountFirstName}}
```
- No company name, no title, no tagline — unless explicitly requested
- Keep it human. Feels like it came from a person, not a company.
## Deliverability Rules
### Send Limits
- **Safe:** 30 emails/day per account
- **Aggressive:** 50 emails/day per account (only with score 90+, warmed 30+ days)
- Never exceed 50/day per account without explicit discussion
### Warmup Requirements
- **Minimum:** 14 days warmup before first campaign
- **Minimum score:** 80+ warmup score
- Accounts below 80 or under 14 days: DO NOT add to active campaigns
### Domain Setup (must verify before launch)
- SPF: configured and passing
- DKIM: configured and passing
- DMARC: policy set (at minimum p=none with reporting)
- MX records: pointing correctly
- Custom tracking domain: set up in Instantly (subdomain, not root domain)
### Spam Signals to Avoid
- Words: "free", "guarantee", "no risk", "limited time", "act now", "click here"
- Excessive links (max 1 per email, ideally 0 in Steps 1-2)
- Images in cold email (never)
- HTML formatting (plain text only)
- All-caps words
- Exclamation points in subject lines
## Upload Requirements
Leads must have:
- `firstName` (required — filter out if missing)
- `email` (required)
- `companyName` (required for `{{companyName}}` variable)
- `personalization` (required if using `{{personalization}}` in sequence)
Validate list before upload. Bad data = bad deliverability.

View file

@ -0,0 +1 @@
requests>=2.28.0

View file

@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""
Cold Outbound Sender sends approved emails via SMTP or a configured email CLI.
Reads from a JSON file of approved prospects, sends up to N/day,
logs to a history file.
Usage:
python3 cold-outbound-sender.py [--dry-run] [--max N]
python3 cold-outbound-sender.py --approved-file path/to/approved.json
python3 cold-outbound-sender.py --send-method smtp
Environment variables:
SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASSWORD for SMTP sending
SENDER_EMAIL sender email address
SENDER_NAME sender display name
"""
import argparse
import json
import os
import smtplib
import subprocess
import sys
from datetime import datetime
from email.mime.text import MIMEText
from pathlib import Path
DEFAULT_MAX_PER_DAY = 10
DEFAULT_APPROVED_FILE = "./data/cold-outbound-approved.json"
DEFAULT_HISTORY_FILE = "./data/cold-outbound-history.json"
def validate_outbound(text):
"""Basic validation for outbound content. Returns (ok, text)."""
if not text or not isinstance(text, str):
return False, text
# Check for common leaked credential patterns
suspicious_patterns = [
r'sk-[a-zA-Z0-9]{20,}', # API keys
r'Bearer [a-zA-Z0-9\-_.]+', # Auth headers
r'/Users/[a-zA-Z]+/', # Local paths
r'password\s*[:=]\s*\S+', # Password patterns
]
import re
for pattern in suspicious_patterns:
if re.search(pattern, text, re.IGNORECASE):
return False, text
return True, text
def load_history(history_path):
if os.path.exists(history_path):
try:
with open(history_path) as f:
return json.load(f)
except Exception:
pass
return []
def save_history(history, history_path):
os.makedirs(os.path.dirname(history_path), exist_ok=True)
with open(history_path, 'w') as f:
json.dump(history, f, indent=2)
def count_sent_today(history):
today = datetime.now().strftime("%Y-%m-%d")
return sum(1 for h in history if h.get("sent_date", "").startswith(today))
def send_email_smtp(to, subject, body, sender_email, sender_name,
smtp_host, smtp_port, smtp_user, smtp_password, dry_run=False):
"""Send via SMTP."""
ok_subj, subject = validate_outbound(subject)
ok_body, body = validate_outbound(body)
if not ok_subj or not ok_body:
print(f" 🛡️ Email to {to} BLOCKED by validation (suspicious content detected)")
return False
if dry_run:
print(f" [DRY RUN] Would send to {to}: {subject}")
return True
try:
msg = MIMEText(body, 'plain')
msg['Subject'] = subject
msg['From'] = f"{sender_name} <{sender_email}>"
msg['To'] = to
with smtplib.SMTP(smtp_host, int(smtp_port)) as server:
server.starttls()
server.login(smtp_user, smtp_password)
server.sendmail(sender_email, [to], msg.as_string())
print(f" ✅ Sent to {to}: {subject}")
return True
except Exception as e:
print(f" ❌ Error sending to {to}: {e}", file=sys.stderr)
return False
def send_email_cli(to, subject, body, sender_email, sender_name, cli_command, dry_run=False):
"""Send via a CLI tool (e.g., gog, msmtp, mailx)."""
ok_subj, subject = validate_outbound(subject)
ok_body, body = validate_outbound(body)
if not ok_subj or not ok_body:
print(f" 🛡️ Email to {to} BLOCKED by validation (suspicious content detected)")
return False
if dry_run:
print(f" [DRY RUN] Would send to {to}: {subject}")
return True
try:
# Default CLI pattern: gog gmail send
cmd = cli_command.split() + [
"--to", to,
"--subject", subject,
"--body", body,
"--from", f"{sender_name} <{sender_email}>",
]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
if result.returncode == 0:
print(f" ✅ Sent to {to}: {subject}")
return True
else:
print(f" ❌ Failed to send to {to}: {result.stderr}", file=sys.stderr)
return False
except Exception as e:
print(f" ❌ Error sending to {to}: {e}", file=sys.stderr)
return False
def main():
parser = argparse.ArgumentParser(description="Cold Outbound Sender")
parser.add_argument("--dry-run", action="store_true", help="Don't actually send emails")
parser.add_argument("--max", type=int, default=DEFAULT_MAX_PER_DAY,
help=f"Max emails per day (default: {DEFAULT_MAX_PER_DAY})")
parser.add_argument("--approved-file", default=DEFAULT_APPROVED_FILE,
help="Path to approved prospects JSON file")
parser.add_argument("--history-file", default=DEFAULT_HISTORY_FILE,
help="Path to send history JSON file")
parser.add_argument("--send-method", choices=["smtp", "cli"], default="smtp",
help="Send method: smtp or cli (default: smtp)")
parser.add_argument("--cli-command", default="gog gmail send",
help="CLI command for sending (used with --send-method cli)")
args = parser.parse_args()
# Load config from env
sender_email = os.environ.get("SENDER_EMAIL", "")
sender_name = os.environ.get("SENDER_NAME", "")
smtp_host = os.environ.get("SMTP_HOST", "smtp.gmail.com")
smtp_port = os.environ.get("SMTP_PORT", "587")
smtp_user = os.environ.get("SMTP_USER", sender_email)
smtp_password = os.environ.get("SMTP_PASSWORD", "")
if not os.path.exists(args.approved_file):
print(f"No approved prospects file found at {args.approved_file}")
sys.exit(0)
with open(args.approved_file) as f:
approved = json.load(f)
history = load_history(args.history_file)
sent_today = count_sent_today(history)
remaining = args.max - sent_today
if remaining <= 0:
print(f"Already sent {sent_today} emails today (max {args.max}). Stopping.")
sys.exit(0)
sent_count = 0
for prospect in approved:
if sent_count >= remaining:
break
email = prospect.get("email")
if not email or email == "Unknown":
continue
# Check if already sent
if any(h.get("email") == email for h in history):
print(f" SKIP {email}: already in history")
continue
angle_key = prospect.get("approved_angle", "A")
drafts = prospect.get("angle_drafts", {})
draft = drafts.get(angle_key, {})
subject = draft.get("subject", f"Quick question for {prospect.get('company', 'you')}")
body = draft.get("body", "")
if not body:
print(f" SKIP {email}: no draft body for angle {angle_key}")
continue
if args.send_method == "smtp":
if not smtp_password and not args.dry_run:
print("ERROR: SMTP_PASSWORD env var required for smtp sending.")
sys.exit(1)
success = send_email_smtp(
email, subject, body, sender_email, sender_name,
smtp_host, smtp_port, smtp_user, smtp_password, args.dry_run
)
else:
success = send_email_cli(
email, subject, body, sender_email, sender_name,
args.cli_command, args.dry_run
)
if success:
history.append({
"company": prospect.get("company", ""),
"contact_name": prospect.get("contact_name", ""),
"email": email,
"angle": angle_key,
"subject": subject,
"sent_date": datetime.now().isoformat(),
"score": prospect.get("score", 0),
})
sent_count += 1
if not args.dry_run:
save_history(history, args.history_file)
# Remove sent prospects from approved file
if not args.dry_run and sent_count > 0:
sent_emails = {h["email"] for h in history}
remaining_approved = [p for p in approved if p.get("email") not in sent_emails]
with open(args.approved_file, 'w') as f:
json.dump(remaining_approved, f, indent=2)
print(f"\nSent {sent_count} emails ({'dry run' if args.dry_run else 'live'}). Total today: {sent_today + sent_count}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,483 @@
#!/usr/bin/env python3
"""
Competitive Monitor tracks pricing, blog posts, and feature changes across competitors.
Generates weekly competitive intelligence diffs. Configurable competitor list.
Usage:
python3 competitive-monitor.py
python3 competitive-monitor.py --company acme
python3 competitive-monitor.py --output report.md
python3 competitive-monitor.py --config competitors.json
Competitor config can be provided via:
1. --config flag pointing to a JSON file
2. COMPETITORS_CONFIG env var pointing to a JSON file
3. Built-in example competitors (for demo purposes)
"""
import argparse
import json
import os
import re
import sys
import urllib.request
import urllib.parse
from datetime import datetime, timedelta
from difflib import unified_diff
from typing import Dict, List, Optional
from html.parser import HTMLParser
from urllib.error import URLError, HTTPError
def validate_text(text, max_length=500000):
"""Basic input validation for scraped content."""
if not text or not isinstance(text, str):
return text
# Truncate extremely long content
if len(text) > max_length:
text = text[:max_length]
return text
class BlogExtractor(HTMLParser):
"""Extract blog post titles and dates from HTML."""
def __init__(self):
super().__init__()
self.posts = []
self.current_title = None
self.current_date = None
self.in_title = False
self.in_date = False
self.title_tags = ['h1', 'h2', 'h3', 'h4']
def handle_starttag(self, tag, attrs):
if tag.lower() in self.title_tags:
self.in_title = True
for name, value in attrs:
if name in ['class', 'id'] and any(
date_word in value.lower() for date_word in ['date', 'time', 'published']
):
self.in_date = True
def handle_endtag(self, tag):
if tag.lower() in self.title_tags:
self.in_title = False
self.in_date = False
def handle_data(self, data):
if self.in_title and data.strip():
self.current_title = data.strip()
if self.in_date and data.strip():
date_match = re.search(
r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b|\b\w+ \d{1,2},? \d{4}\b', data
)
if date_match:
self.current_date = date_match.group()
if self.current_title and self.current_date:
self.posts.append({
'title': self.current_title,
'date': self.current_date,
})
self.current_title = None
self.current_date = None
class CompetitiveMonitor:
"""Main competitive monitoring class."""
# Example competitors for demo. Override with --config or COMPETITORS_CONFIG.
EXAMPLE_COMPETITORS = {
'competitor_a': {
'name': 'Competitor A',
'domain': 'competitor-a.com',
'pricing_url': 'https://www.competitor-a.com/pricing',
'blog_url': 'https://www.competitor-a.com/blog',
'linkedin_query': 'Competitor A site:linkedin.com',
'jobs_query': 'Competitor A careers OR jobs',
},
'competitor_b': {
'name': 'Competitor B',
'domain': 'competitor-b.com',
'pricing_url': 'https://www.competitor-b.com/pricing',
'blog_url': 'https://www.competitor-b.com/blog',
'linkedin_query': 'Competitor B site:linkedin.com',
'jobs_query': 'Competitor B careers OR jobs',
},
}
def __init__(self, data_dir: str = None, competitors: dict = None):
self.data_dir = data_dir or os.path.join(os.getcwd(), 'data', 'competitive')
self.pricing_dir = os.path.join(self.data_dir, 'pricing-snapshots')
self.history_dir = os.path.join(self.data_dir, 'scan-history')
self.competitors = competitors or self.EXAMPLE_COMPETITORS
os.makedirs(self.pricing_dir, exist_ok=True)
os.makedirs(self.history_dir, exist_ok=True)
def fetch_url(self, url: str, timeout: int = 10) -> Optional[str]:
"""Fetch URL content with error handling."""
try:
headers = {
'User-Agent': (
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/91.0.4472.124 Safari/537.36'
)
}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request, timeout=timeout) as response:
content = response.read().decode('utf-8', errors='ignore')
content = validate_text(content)
return content
except (URLError, HTTPError, UnicodeDecodeError) as e:
print(f"❌ Error fetching {url}: {e}")
return None
def extract_blog_posts(self, html: str) -> List[Dict]:
"""Extract blog posts from HTML."""
if not html:
return []
extractor = BlogExtractor()
try:
extractor.feed(html)
return extractor.posts
except Exception as e:
print(f"Error extracting blog posts: {e}")
return []
def is_recent_post(self, date_str: str, days_back: int = 7) -> bool:
"""Check if post is from last N days."""
if not date_str:
return False
formats = [
'%m/%d/%Y', '%m-%d-%Y', '%Y-%m-%d',
'%B %d, %Y', '%b %d, %Y', '%B %d %Y', '%b %d %Y',
]
for fmt in formats:
try:
post_date = datetime.strptime(date_str, fmt)
cutoff_date = datetime.now() - timedelta(days=days_back)
return post_date >= cutoff_date
except ValueError:
continue
return False
def get_pricing_diff(self, company_key: str, current_content: str) -> Optional[str]:
"""Compare current pricing with previous snapshot."""
today = datetime.now().strftime('%Y-%m-%d')
pricing_file = os.path.join(self.pricing_dir, f'{company_key}-{today}.txt')
with open(pricing_file, 'w', encoding='utf-8') as f:
f.write(current_content)
previous_files = [
f for f in os.listdir(self.pricing_dir)
if f.startswith(f'{company_key}-') and f != f'{company_key}-{today}.txt'
]
if not previous_files:
return "🆕 First pricing snapshot saved"
previous_files.sort(reverse=True)
previous_file = os.path.join(self.pricing_dir, previous_files[0])
try:
with open(previous_file, 'r', encoding='utf-8') as f:
previous_content = f.read()
if current_content.strip() == previous_content.strip():
return None
current_lines = current_content.splitlines()
previous_lines = previous_content.splitlines()
diff = list(unified_diff(
previous_lines, current_lines,
fromfile='previous', tofile='current', n=0
))
changes = len([
line for line in diff
if line.startswith(('+', '-')) and not line.startswith(('+++', '---'))
])
return f"🔍 {changes} lines changed since last snapshot"
except Exception as e:
return f"❌ Error comparing snapshots: {e}"
def scan_competitor(self, company_key: str) -> Dict:
"""Scan single competitor."""
company = self.competitors[company_key]
print(f"\n🔍 Scanning {company['name']}...")
results = {
'company': company['name'],
'domain': company['domain'],
'scan_time': datetime.now().isoformat(),
'pricing': {},
'blog': {},
'search_queries': {
'linkedin': company.get('linkedin_query', ''),
'jobs': company.get('jobs_query', ''),
},
}
# Fetch pricing page
pricing_url = company.get('pricing_url')
if pricing_url:
print(f" 📄 Fetching pricing: {pricing_url}")
pricing_content = self.fetch_url(pricing_url)
if pricing_content:
clean_content = re.sub(r'<[^>]+>', '', pricing_content)
clean_content = re.sub(r'\s+', ' ', clean_content).strip()
pricing_diff = self.get_pricing_diff(company_key, clean_content)
results['pricing'] = {
'url': pricing_url,
'fetched': True,
'content_length': len(clean_content),
'diff': pricing_diff,
}
else:
results['pricing'] = {
'url': pricing_url,
'fetched': False,
'error': 'Failed to fetch pricing page',
}
# Fetch blog page
blog_url = company.get('blog_url')
if blog_url:
print(f" 📝 Fetching blog: {blog_url}")
blog_content = self.fetch_url(blog_url)
recent_posts = []
if blog_content:
all_posts = self.extract_blog_posts(blog_content)
recent_posts = [post for post in all_posts if self.is_recent_post(post['date'])]
results['blog'] = {
'url': blog_url,
'fetched': bool(blog_content),
'total_posts_found': len(self.extract_blog_posts(blog_content)) if blog_content else 0,
'recent_posts': recent_posts,
}
return results
def generate_report(self, scan_results: List[Dict], threat_keywords: List[str] = None) -> str:
"""Generate markdown report."""
today = datetime.now().strftime('%Y-%m-%d')
# Configurable threat keywords (topics that signal competitive overlap)
if threat_keywords is None:
threat_keywords = ['funnel', 'conversion', 'landing page', 'ab test', 'optimize', 'cro']
report = f"""# 🔍 Competitive Intelligence Report - {today}
## Executive Summary
Monitored {len(scan_results)} competitors for pricing changes, recent blog activity, and market signals.
"""
threats = []
interesting = []
opportunities = []
search_queries = []
for result in scan_results:
company = result['company']
pricing = result.get('pricing', {})
if pricing.get('diff') and '🔍' in str(pricing['diff']):
interesting.append(
f"**{company}**: {pricing['diff']} → *Monitor for pricing strategy shifts*"
)
elif pricing.get('diff') and '🆕' in str(pricing['diff']):
interesting.append(
f"**{company}**: {pricing['diff']} → *Baseline established for future tracking*"
)
blog = result.get('blog', {})
recent_posts = blog.get('recent_posts', [])
if recent_posts:
post_titles = [
post['title'][:80] + '...' if len(post['title']) > 80 else post['title']
for post in recent_posts[:3]
]
content_lower = ' '.join(post_titles).lower()
if any(keyword in content_lower for keyword in threat_keywords):
threats.append(
f"**{company}**: {len(recent_posts)} recent posts, potential feature overlap → *Review competitive positioning*"
)
else:
interesting.append(
f"**{company}**: {len(recent_posts)} recent posts → *{', '.join(post_titles[:2])}*"
)
else:
opportunities.append(
f"**{company}**: No recent blog content → *Content marketing gap you can exploit*"
)
sq = result.get('search_queries', {})
if sq.get('linkedin'):
search_queries.append(f"LinkedIn search: {sq['linkedin']}")
if sq.get('jobs'):
search_queries.append(f"Jobs search: {sq['jobs']}")
if threats:
report += "## 🔴 THREATS\n\n"
for threat in threats:
report += f"- {threat}\n"
report += "\n"
if interesting:
report += "## 🟡 INTERESTING\n\n"
for item in interesting:
report += f"- {item}\n"
report += "\n"
if opportunities:
report += "## 🟢 OPPORTUNITIES\n\n"
for opp in opportunities:
report += f"- {opp}\n"
report += "\n"
if search_queries:
report += "## 🔎 LinkedIn/Jobs Search Queries\n\n"
report += "Run these queries for social/hiring signals:\n\n"
for query in search_queries:
report += f"- `{query}`\n"
report += "\n"
report += "## 📊 Technical Summary\n\n"
for result in scan_results:
company = result['company']
pricing = result.get('pricing', {})
blog = result.get('blog', {})
report += f"**{company}:**\n"
report += f"- Pricing: {'' if pricing.get('fetched') else ''} {pricing.get('diff', 'No changes')}\n"
report += f"- Blog: {'' if blog.get('fetched') else ''} {len(blog.get('recent_posts', []))} recent posts\n\n"
return report
def save_results(self, scan_results: List[Dict]) -> str:
"""Save scan results to files."""
today = datetime.now().strftime('%Y-%m-%d')
latest_file = os.path.join(self.data_dir, 'latest-scan.json')
with open(latest_file, 'w') as f:
json.dump(scan_results, f, indent=2)
history_file = os.path.join(self.history_dir, f'{today}.json')
with open(history_file, 'w') as f:
json.dump(scan_results, f, indent=2)
return latest_file
def run(self, company_filter: Optional[str] = None) -> str:
"""Run competitive monitoring scan."""
print("🚀 Starting competitive monitoring scan...")
companies_to_scan = (
[company_filter] if company_filter else list(self.competitors.keys())
)
if company_filter and company_filter not in self.competitors:
print(f"❌ Unknown company: {company_filter}")
print(f"Available companies: {', '.join(self.competitors.keys())}")
return ""
scan_results = []
for company_key in companies_to_scan:
try:
result = self.scan_competitor(company_key)
scan_results.append(result)
except Exception as e:
print(f"❌ Error scanning {company_key}: {e}")
self.save_results(scan_results)
report = self.generate_report(scan_results)
print(f"\n✅ Scan complete! Results for {len(scan_results)} companies.")
return report
def load_competitors_config(config_path: str) -> dict:
"""Load competitors from a JSON config file.
Expected format:
{
"competitor_key": {
"name": "Competitor Name",
"domain": "competitor.com",
"pricing_url": "https://competitor.com/pricing",
"blog_url": "https://competitor.com/blog",
"linkedin_query": "Competitor Name site:linkedin.com",
"jobs_query": "Competitor Name careers OR jobs"
}
}
"""
with open(config_path, 'r') as f:
return json.load(f)
def main():
parser = argparse.ArgumentParser(description='Competitive Monitoring Scraper')
parser.add_argument('--company', help='Scan specific company only (by key)')
parser.add_argument('--output', '-o', help='Save report to file')
parser.add_argument('--config', help='Path to competitors JSON config file')
parser.add_argument('--data-dir', help='Directory for storing scan data')
parser.add_argument('--threat-keywords', nargs='*',
help='Keywords that signal competitive overlap (space-separated)')
args = parser.parse_args()
# Load competitor config
config_path = args.config or os.environ.get('COMPETITORS_CONFIG')
competitors = None
if config_path:
try:
competitors = load_competitors_config(config_path)
print(f"📋 Loaded {len(competitors)} competitors from {config_path}")
except Exception as e:
print(f"❌ Error loading config: {e}")
sys.exit(1)
monitor = CompetitiveMonitor(
data_dir=args.data_dir,
competitors=competitors,
)
report = monitor.run(args.company)
if report:
print("\n" + "=" * 60)
print(report)
print("=" * 60)
if args.output:
with open(args.output, 'w') as f:
f.write(report)
print(f"\n📁 Report saved to: {args.output}")
if __name__ == '__main__':
main()

View file

@ -0,0 +1,278 @@
#!/usr/bin/env python3
"""
Cross-Signal Detector finds overlapping signals across multiple data sources.
When your SEO data and sales data both flag the same company, that's a cross-signal
worth acting on. This script scans agent outputs and data files for company names,
industry verticals, and keyword clusters, then finds overlaps.
Usage:
python3 cross-signal-detector.py
python3 cross-signal-detector.py --data-dir ./data/agent-outputs
python3 cross-signal-detector.py --hours 48
python3 cross-signal-detector.py --output cross-signals.json
Environment variables:
DATA_DIR directory containing agent output files to scan
OUTPUT_FILE where to write the signal detection results
"""
import argparse
import json
import os
import re
import glob
from datetime import datetime, timedelta, timezone
from collections import defaultdict
# Words to exclude from company name extraction (common English words that look like names)
STOP_WORDS = {
'The', 'This', 'That', 'What', 'How', 'Why', 'When', 'Where',
'For', 'From', 'With', 'About', 'Into', 'Over', 'After',
'Before', 'Between', 'Under', 'During', 'Through',
'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday', 'January', 'February', 'March',
'April', 'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December',
'None', 'True', 'False', 'Error', 'Warning',
}
# Configurable: add your own team names / internal terms to exclude
CUSTOM_STOP_WORDS = set(os.environ.get('SIGNAL_STOP_WORDS', '').split(',')) if os.environ.get('SIGNAL_STOP_WORDS') else set()
def get_recent_files(directory, hours=24):
"""Get files modified in the last N hours."""
cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
recent = []
if not os.path.isdir(directory):
return recent
for f in glob.glob(os.path.join(directory, "*")):
if os.path.isfile(f):
mtime = datetime.fromtimestamp(os.path.getmtime(f), tz=timezone.utc)
if mtime > cutoff:
recent.append(f)
return recent
def extract_companies(text):
"""Extract company names (capitalized words, common patterns)."""
companies = set()
all_stop = STOP_WORDS | CUSTOM_STOP_WORDS
for match in re.findall(
r'\b([A-Z][a-zA-Z]+(?:\.[a-zA-Z]+)?(?:\s+(?:AI|Inc|Corp|Labs|Tech|io))?)\b',
text
):
if len(match) > 2 and match not in all_stop:
companies.add(match)
return companies
def extract_keywords(text):
"""Extract keyword themes from marketing/business text."""
keywords = set()
patterns = [
r'(?:ai|artificial intelligence)\s+(?:marketing|agent|tool|saas|automation)',
r'(?:seo|content|digital)\s+(?:marketing|strategy|optimization|growth)',
r'(?:b2b|saas|enterprise)\s+(?:marketing|growth|sales)',
r'(?:social media|linkedin|twitter|youtube)\s+(?:marketing|growth|strategy)',
r'(?:email|outbound|cold)\s+(?:marketing|outreach|campaign)',
r'(?:paid|ppc|google)\s+(?:ads|advertising|media)',
]
text_lower = text.lower()
for p in patterns:
match = re.search(p, text_lower)
if match:
keywords.add(match.group())
return keywords
def extract_verticals(text):
"""Extract industry verticals."""
verticals = set()
vertical_keywords = {
'fintech': ['fintech', 'financial', 'banking', 'payments'],
'healthtech': ['healthtech', 'health tech', 'healthcare', 'medical'],
'edtech': ['edtech', 'education', 'learning platform'],
'ai_saas': ['ai saas', 'ai tool', 'ai agent', 'ai platform', 'artificial intelligence'],
'ecommerce': ['ecommerce', 'e-commerce', 'shopify', 'dtc', 'd2c'],
'cybersecurity': ['cybersecurity', 'security', 'infosec'],
'martech': ['martech', 'marketing tech', 'marketing tool'],
'hr_tech': ['hr tech', 'hiring', 'recruiting', 'talent'],
}
text_lower = text.lower()
for vertical, kws in vertical_keywords.items():
if any(kw in text_lower for kw in kws):
verticals.add(vertical)
return verticals
def read_file_safe(filepath):
"""Read file content safely."""
try:
with open(filepath) as f:
return f.read()
except Exception:
return ""
def categorize_file(filepath, agent_patterns=None):
"""Categorize a file by agent/source based on filename patterns.
Override with agent_patterns dict: {"pattern": "agent_name"}
"""
basename = os.path.basename(filepath).lower()
# Default patterns — customize these for your setup
default_patterns = {
'seo': 'seo',
'oracle': 'seo',
'content': 'content',
'flash': 'content',
'trend': 'content',
'deal': 'deal',
'cold': 'cold_outbound',
'outbound': 'cold_outbound',
'recruit': 'recruiting',
'hiring': 'recruiting',
}
patterns = agent_patterns or default_patterns
for pattern, agent in patterns.items():
if pattern in basename:
return agent
return 'other'
def detect_signals(data_dir, additional_data_dirs=None, hours=48, agent_patterns=None):
"""Main detection logic.
Args:
data_dir: Primary directory to scan for agent output files
additional_data_dirs: Dict of {"agent_name": "glob_pattern"} for extra data
hours: How far back to look for files
agent_patterns: Dict of {"filename_pattern": "agent_name"} for categorization
"""
recent_files = get_recent_files(data_dir, hours=hours)
if not recent_files:
# Fallback to 7 days
recent_files = get_recent_files(data_dir, hours=168)
# Categorize by agent/source
agent_data = defaultdict(lambda: {
"files": [], "companies": set(), "keywords": set(), "verticals": set(), "text": ""
})
for f in recent_files:
agent = categorize_file(f, agent_patterns)
text = read_file_safe(f)
agent_data[agent]["files"].append(f)
agent_data[agent]["companies"].update(extract_companies(text))
agent_data[agent]["keywords"].update(extract_keywords(text))
agent_data[agent]["verticals"].update(extract_verticals(text))
agent_data[agent]["text"] += text + "\n"
# Scan additional data directories
if additional_data_dirs:
for agent, pattern in additional_data_dirs.items():
files = sorted(glob.glob(pattern))[-1:] # latest only
for f in files:
text = read_file_safe(f)
agent_data[agent]["companies"].update(extract_companies(text))
agent_data[agent]["keywords"].update(extract_keywords(text))
agent_data[agent]["verticals"].update(extract_verticals(text))
# Find overlaps
signals = []
agents_list = list(agent_data.keys())
# 1. Company overlap
for i, a1 in enumerate(agents_list):
for a2 in agents_list[i + 1:]:
common_companies = agent_data[a1]["companies"] & agent_data[a2]["companies"]
if common_companies:
confidence = min(95, 60 + len(common_companies) * 10)
signals.append({
"confidence": confidence,
"type": "company_overlap",
"agents": [a1, a2],
"signal": f"Company overlap: {', '.join(list(common_companies)[:5])} appearing in both {a1} and {a2}",
"recommended_play": f"Cross-reference {a1} and {a2} data for these companies — coordinate outreach/content",
"entities": list(common_companies)[:10],
})
# 2. Vertical overlap
for i, a1 in enumerate(agents_list):
for a2 in agents_list[i + 1:]:
common_verticals = agent_data[a1]["verticals"] & agent_data[a2]["verticals"]
if common_verticals:
confidence = min(90, 50 + len(common_verticals) * 15)
signals.append({
"confidence": confidence,
"type": "vertical_alignment",
"agents": [a1, a2],
"signal": f"Vertical alignment: {', '.join(common_verticals)} trending across {a1} + {a2}",
"recommended_play": f"Coordinated push into {', '.join(common_verticals)}: content + outbound + SEO",
"entities": list(common_verticals),
})
# 3. Keyword cluster overlap
for i, a1 in enumerate(agents_list):
for a2 in agents_list[i + 1:]:
common_kw = agent_data[a1]["keywords"] & agent_data[a2]["keywords"]
if common_kw:
confidence = min(88, 55 + len(common_kw) * 12)
signals.append({
"confidence": confidence,
"type": "keyword_cluster",
"agents": [a1, a2],
"signal": f"Keyword cluster overlap: {', '.join(list(common_kw)[:3])}",
"recommended_play": "Target these keywords in content and outbound simultaneously",
"entities": list(common_kw),
})
# Deduplicate and sort by confidence
signals.sort(key=lambda x: x["confidence"], reverse=True)
output = {
"date": datetime.now().strftime("%Y-%m-%d"),
"generated_at": datetime.now(timezone.utc).isoformat(),
"agents_analyzed": list(agent_data.keys()),
"files_scanned": sum(len(d["files"]) for d in agent_data.values()),
"signals": signals[:20], # top 20
}
return output
def main():
parser = argparse.ArgumentParser(
description='Cross-Signal Detector — find overlapping signals across data sources'
)
parser.add_argument('--data-dir', default=os.environ.get('DATA_DIR', './data/agent-outputs'),
help='Directory containing agent output files')
parser.add_argument('--output', default=os.environ.get('OUTPUT_FILE', './data/cross-signals-latest.json'),
help='Output file path')
parser.add_argument('--hours', type=int, default=48,
help='How far back to look for files (default: 48)')
args = parser.parse_args()
output = detect_signals(data_dir=args.data_dir, hours=args.hours)
os.makedirs(os.path.dirname(args.output) or '.', exist_ok=True)
with open(args.output, "w") as f:
json.dump(output, f, indent=2)
signals = output.get("signals", [])
print(f"Cross-signal detection complete: {len(signals)} signals found")
print(f"Output: {args.output}")
if signals:
print(f"Top signal (confidence {signals[0]['confidence']}): {signals[0]['signal'][:100]}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,376 @@
#!/usr/bin/env python3
"""
instantly-audit.py
Pulls campaign data, account inventory, and warmup scores from the Instantly v2 API.
Usage:
python3 instantly-audit.py --api-key YOUR_KEY
python3 instantly-audit.py # uses INSTANTLY_API_KEY env var
python3 instantly-audit.py --api-key YOUR_KEY --output report.md
python3 instantly-audit.py --api-key YOUR_KEY --json # raw JSON output
Instantly v2 API docs: https://developer.instantly.ai/
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime
try:
import requests
except ImportError:
print("ERROR: 'requests' not installed. Run: pip install requests")
sys.exit(1)
BASE_URL = "https://api.instantly.ai/api/v2"
def get_headers(api_key: str) -> dict:
return {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
def paginate(url: str, headers: dict, params: dict = None, limit: int = 100) -> list:
"""Handle Instantly v2 cursor-based pagination."""
results = []
params = params or {}
params["limit"] = limit
starting_after = None
while True:
if starting_after:
params["starting_after"] = starting_after
try:
resp = requests.get(url, headers=headers, params=params, timeout=30)
except requests.exceptions.RequestException as e:
print(f" ⚠️ Request failed: {e}")
break
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 5))
print(f" ⏳ Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
if resp.status_code == 401:
print(" 🔴 Authentication failed. Check your API key.")
sys.exit(1)
if not resp.ok:
print(f" ⚠️ API error {resp.status_code}: {resp.text[:200]}")
break
data = resp.json()
items = data.get("items", data if isinstance(data, list) else [])
results.extend(items)
next_cursor = data.get("next_starting_after") or data.get("next_cursor")
if not next_cursor or len(items) < limit:
break
starting_after = next_cursor
return results
def fetch_campaigns(headers: dict) -> list:
"""Fetch all campaigns with analytics."""
print("📋 Fetching campaigns...")
campaigns = paginate(f"{BASE_URL}/campaigns", headers)
print(f" Found {len(campaigns)} campaigns")
return campaigns
def fetch_campaign_analytics(headers: dict, campaign_ids: list) -> dict:
"""Fetch analytics summary for campaigns."""
if not campaign_ids:
return {}
print("📊 Fetching campaign analytics...")
analytics = {}
for i in range(0, len(campaign_ids), 10):
batch = campaign_ids[i:i+10]
try:
resp = requests.get(
f"{BASE_URL}/campaigns/analytics/overview",
headers=headers,
params={"campaign_id": batch},
timeout=30,
)
if resp.ok:
data = resp.json()
if isinstance(data, dict):
analytics.update(data)
elif isinstance(data, list):
for item in data:
cid = item.get("campaign_id") or item.get("id")
if cid:
analytics[cid] = item
except requests.exceptions.RequestException as e:
print(f" ⚠️ Analytics fetch failed for batch: {e}")
time.sleep(0.3)
return analytics
def fetch_accounts(headers: dict) -> list:
"""Fetch all sending accounts with warmup status."""
print("📧 Fetching sending accounts...")
accounts = paginate(f"{BASE_URL}/accounts", headers)
print(f" Found {len(accounts)} accounts")
return accounts
def fetch_warmup_scores(headers: dict, account_emails: list) -> dict:
"""Fetch warmup analytics for accounts."""
if not account_emails:
return {}
print("🔥 Fetching warmup scores...")
warmup_data = {}
for email in account_emails:
try:
resp = requests.get(
f"{BASE_URL}/accounts/{email}/warmup/analytics",
headers=headers,
timeout=30,
)
if resp.ok:
warmup_data[email] = resp.json()
elif resp.status_code == 404:
warmup_data[email] = {"score": None, "status": "no_warmup_data"}
time.sleep(0.1)
except requests.exceptions.RequestException:
warmup_data[email] = {"score": None, "status": "fetch_error"}
return warmup_data
def assess_warmup_readiness(account: dict, warmup: dict) -> tuple:
"""Return (ready: bool, issues: list) for an account."""
issues = []
score = (warmup.get("warmup_score") or warmup.get("score")
or account.get("stat_warmup_score") or account.get("warmup_score"))
if score is None:
issues.append("No warmup data available")
elif score < 80:
issues.append(f"Warmup score {score} < 80 (minimum required)")
warmup_start = account.get("warmup_start_date") or account.get("created_at")
if warmup_start:
try:
start_dt = datetime.fromisoformat(warmup_start.replace("Z", "+00:00"))
days_warmed = (datetime.now(start_dt.tzinfo) - start_dt).days
if days_warmed < 14:
issues.append(f"Only {days_warmed} days warmed (need 14+)")
except (ValueError, AttributeError):
pass
status = str(account.get("status", "")).lower()
if status in ("paused", "error", "suspended", "disabled"):
issues.append(f"Account status: {status}")
ready = len(issues) == 0
return ready, issues
def format_pct(value, total, decimals=1) -> str:
if not total:
return "N/A"
return f"{(value / total * 100):.{decimals}f}%"
def generate_report(campaigns: list, analytics: dict, accounts: list, warmup_scores: dict) -> str:
lines = []
now = datetime.now().strftime("%Y-%m-%d %H:%M")
lines.append(f"# Instantly Audit Report")
lines.append(f"Generated: {now}\n")
# ── Account Inventory ──
lines.append("## Sending Account Inventory\n")
ready_accounts = []
not_ready_accounts = []
for acct in accounts:
email = acct.get("email", "unknown")
warmup = warmup_scores.get(email, {})
ready, issues = assess_warmup_readiness(acct, warmup)
score = (warmup.get("warmup_score") or warmup.get("score")
or acct.get("stat_warmup_score") or acct.get("warmup_score") or "N/A")
daily_limit = acct.get("sending_limit") or acct.get("daily_limit", 30)
row = {
"email": email,
"status": acct.get("status", "unknown"),
"warmup_score": score,
"daily_limit": daily_limit,
"ready": ready,
"issues": issues,
}
if ready:
ready_accounts.append(row)
else:
not_ready_accounts.append(row)
total_accounts = len(accounts)
total_ready = len(ready_accounts)
lines.append(f"**Total accounts:** {total_accounts}")
lines.append(f"**Ready to send:** {total_ready}")
lines.append(f"**Not ready:** {len(not_ready_accounts)} ⚠️\n")
if ready_accounts:
conservative_daily = total_ready * 30
aggressive_daily = total_ready * 50
conservative_monthly = conservative_daily * 22
aggressive_monthly = aggressive_daily * 22
lines.append("### Capacity Math (ready accounts only)")
lines.append(f"- Conservative (30/day/account): **{conservative_daily:,}/day → {conservative_monthly:,}/month**")
lines.append(f"- Aggressive (50/day/account): **{aggressive_daily:,}/day → {aggressive_monthly:,}/month**\n")
lines.append("### ✅ Ready Accounts")
if ready_accounts:
lines.append("| Account | Status | Warmup Score | Daily Limit |")
lines.append("|---------|--------|-------------|------------|")
for a in ready_accounts:
lines.append(f"| {a['email']} | {a['status']} | {a['warmup_score']} | {a['daily_limit']} |")
else:
lines.append("_None — no accounts meet warmup requirements_")
lines.append("\n### ⚠️ Not Ready Accounts")
if not_ready_accounts:
lines.append("| Account | Status | Warmup Score | Issues |")
lines.append("|---------|--------|-------------|--------|")
for a in not_ready_accounts:
issues_str = "; ".join(a["issues"]) if a["issues"] else "unknown"
lines.append(f"| {a['email']} | {a['status']} | {a['warmup_score']} | {issues_str} |")
else:
lines.append("_None — all accounts are ready_")
# ── Campaign Performance ──
lines.append("\n---\n## Campaign Performance\n")
lines.append(f"**Total campaigns:** {len(campaigns)}\n")
if not campaigns:
lines.append("_No campaigns found_")
else:
lines.append("| Campaign | Status | Sent | Open Rate | Reply Rate | Positive Reply Rate |")
lines.append("|----------|--------|------|-----------|-----------|-------------------|")
for c in campaigns:
cid = c.get("id", "")
name = c.get("name", "Unnamed")[:50]
status = c.get("status", "unknown")
a = analytics.get(cid, {})
sent = a.get("emails_sent", 0) or c.get("emails_sent", 0)
opened = a.get("emails_opened", 0)
replied = a.get("emails_replied", 0)
positive = a.get("positive_replies", 0)
open_rate = format_pct(opened, sent)
reply_rate = format_pct(replied, sent)
pos_rate = format_pct(positive, sent)
lines.append(f"| {name} | {status} | {sent:,} | {open_rate} | {reply_rate} | {pos_rate} |")
# ── Flags & Recommendations ──
lines.append("\n---\n## Flags & Recommendations\n")
flags = []
if total_ready == 0:
flags.append("🔴 **BLOCKER:** No accounts are ready to send. All fail warmup requirements. Do not launch campaigns.")
elif total_ready < 3:
flags.append(f"⚠️ Only {total_ready} account(s) ready. Low volume capacity. Consider warming more accounts.")
low_open = []
low_reply = []
for c in campaigns:
cid = c.get("id", "")
a = analytics.get(cid, {})
sent = a.get("emails_sent", 0)
if sent < 50:
continue
opened = a.get("emails_opened", 0)
replied = a.get("emails_replied", 0)
open_pct = (opened / sent * 100) if sent else 0
reply_pct = (replied / sent * 100) if sent else 0
if open_pct < 40:
low_open.append(c.get("name", cid))
if reply_pct < 3:
low_reply.append(c.get("name", cid))
if low_open:
flags.append(f"⚠️ Low open rate (<40%) campaigns (subject line issue): {', '.join(low_open[:5])}")
if low_reply:
flags.append(f"⚠️ Low reply rate (<3%) campaigns (copy/offer issue): {', '.join(low_reply[:5])}")
if not flags:
flags.append("✅ No critical flags detected.")
for f in flags:
lines.append(f"- {f}")
lines.append(f"\n---\n_Audit complete. {total_accounts} accounts, {len(campaigns)} campaigns analyzed._")
return "\n".join(lines)
def main():
parser = argparse.ArgumentParser(description="Instantly v2 API Audit Tool")
parser.add_argument("--api-key", help="Instantly API key (or set INSTANTLY_API_KEY env var)")
parser.add_argument("--output", help="Write report to this file (default: print to stdout)")
parser.add_argument("--json", action="store_true", help="Output raw JSON instead of markdown report")
args = parser.parse_args()
api_key = args.api_key or os.environ.get("INSTANTLY_API_KEY")
if not api_key:
api_key = input("Instantly API key: ").strip()
if not api_key:
print("ERROR: API key required. Set INSTANTLY_API_KEY env var or pass --api-key.")
sys.exit(1)
headers = get_headers(api_key)
print(f"\n🔍 Starting Instantly audit...\n")
campaigns = fetch_campaigns(headers)
campaign_ids = [c.get("id") for c in campaigns if c.get("id")]
analytics = fetch_campaign_analytics(headers, campaign_ids)
accounts = fetch_accounts(headers)
account_emails = [a.get("email") for a in accounts if a.get("email")]
warmup_scores = fetch_warmup_scores(headers, account_emails)
if args.json:
output = json.dumps({
"campaigns": campaigns,
"analytics": analytics,
"accounts": accounts,
"warmup_scores": warmup_scores,
}, indent=2, default=str)
else:
output = generate_report(campaigns, analytics, accounts, warmup_scores)
if args.output:
with open(args.output, "w") as f:
f.write(output)
print(f"\n✅ Report written to: {args.output}")
else:
print("\n" + output)
if __name__ == "__main__":
main()

View file

@ -0,0 +1,607 @@
#!/usr/bin/env python3
"""
Lead Pipeline: Apollo LeadMagic Dedupe Instantly
End-to-end lead sourcing, verification, deduplication, and upload pipeline.
Usage:
python3 lead-pipeline.py \\
--titles "VP Marketing,CMO" --industries "SaaS" \\
--company-size "11,50" --locations "United States" \\
--campaign-id YOUR_CAMPAIGN_UUID --volume 500
# Dry run (no upload)
python3 lead-pipeline.py \\
--titles "CTO,VP Engineering" --company-size "51,200" \\
--campaign-id YOUR_CAMPAIGN_UUID --volume 100 --dry-run
API keys are read from environment variables:
APOLLO_API_KEY, LEADMAGIC_API_KEY, INSTANTLY_API_KEY
Or pass them via --apollo-key, --leadmagic-key, --instantly-key flags.
"""
import argparse
import json
import os
import sys
import time
from datetime import datetime
from pathlib import Path
try:
import requests
except ImportError:
print("ERROR: 'requests' package required. Run: pip3 install requests", file=sys.stderr)
sys.exit(1)
# ---------------------------------------------------------------------------
# Retry / backoff helper
# ---------------------------------------------------------------------------
def request_with_retry(method, url, max_retries=5, **kwargs):
"""HTTP request with exponential backoff on 429 / 5xx."""
backoff = 1
for attempt in range(max_retries + 1):
try:
resp = requests.request(method, url, timeout=30, **kwargs)
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", backoff))
print(f" ⏳ Rate limited (429). Waiting {wait}s …")
time.sleep(wait)
backoff = min(backoff * 2, 60)
continue
if resp.status_code >= 500:
print(f" ⚠️ Server error {resp.status_code}. Retry in {backoff}s …")
time.sleep(backoff)
backoff = min(backoff * 2, 60)
continue
return resp
except requests.exceptions.RequestException as e:
if attempt == max_retries:
raise
print(f" ⚠️ Request error: {e}. Retry in {backoff}s …")
time.sleep(backoff)
backoff = min(backoff * 2, 60)
return resp # type: ignore
# ---------------------------------------------------------------------------
# Step 1: Apollo People Search
# ---------------------------------------------------------------------------
def source_from_apollo(api_key, titles, industries, company_size, locations, keywords, volume):
"""Pull leads from Apollo People Search API."""
print(f"\n{'='*50}")
print(f"STEP 1: Sourcing from Apollo (target: {volume})")
print(f"{'='*50}")
url = "https://api.apollo.io/api/v1/mixed_people/search"
leads = []
page = 1
# Parse company size into Apollo format
size_ranges = []
if company_size:
parts = [s.strip() for s in company_size.split(",")]
if len(parts) == 2:
size_ranges = [f"{parts[0]},{parts[1]}"]
else:
size_ranges = parts
while len(leads) < volume:
body = {
"api_key": api_key,
"per_page": 100,
"page": page,
}
if titles:
body["person_titles"] = [t.strip() for t in titles.split(",")]
if industries:
body["q_organization_keyword_tags"] = [i.strip() for i in industries.split(",")]
if size_ranges:
body["organization_num_employees_ranges"] = size_ranges
if locations:
body["person_locations"] = [l.strip() for l in locations.split(",")]
if keywords:
body["q_keywords"] = keywords
print(f" 📡 Apollo page {page}", end=" ", flush=True)
resp = request_with_retry("POST", url, json=body)
if resp.status_code != 200:
print(f"ERROR {resp.status_code}: {resp.text[:200]}")
break
data = resp.json()
people = data.get("people", [])
if not people:
print("no more results.")
break
page_leads = 0
for person in people:
email = person.get("email")
if not email:
continue
leads.append({
"email": email.lower().strip(),
"first_name": person.get("first_name", ""),
"last_name": person.get("last_name", ""),
"title": person.get("title", ""),
"company_name": (person.get("organization") or {}).get("name", ""),
"domain": (person.get("organization") or {}).get("primary_domain", ""),
})
page_leads += 1
if len(leads) >= volume:
break
print(f"{page_leads} with email ({len(leads)} total)")
total_pages = data.get("pagination", {}).get("total_pages", page)
if page >= total_pages:
print(" Reached last Apollo page.")
break
page += 1
time.sleep(0.5)
# Dedupe by email within sourced set
seen = set()
unique_leads = []
for lead in leads:
if lead["email"] not in seen:
seen.add(lead["email"])
unique_leads.append(lead)
print(f"\n ✅ Sourced {len(unique_leads)} unique leads with emails")
return unique_leads
# ---------------------------------------------------------------------------
# Step 2: LeadMagic Email Verification
# ---------------------------------------------------------------------------
def verify_with_leadmagic(api_key, leads):
"""Verify emails via LeadMagic. Returns only valid leads."""
print(f"\n{'='*50}")
print(f"STEP 2: Verifying {len(leads)} emails via LeadMagic")
print(f"{'='*50}")
url = "https://api.leadmagic.io/v1/people/email-validation"
headers = {
"X-API-Key": api_key,
"Content-Type": "application/json",
}
valid_leads = []
invalid_count = 0
unknown_count = 0
error_count = 0
rejection_reasons = {}
for i, lead in enumerate(leads):
if (i + 1) % 50 == 0 or i == 0:
print(f" 🔍 Verifying {i+1}/{len(leads)}")
try:
resp = request_with_retry("POST", url, headers=headers, json={"email": lead["email"]})
if resp.status_code != 200:
error_count += 1
continue
data = resp.json()
status = data.get("email_status", "unknown")
if status == "valid":
lead["is_free_email"] = data.get("is_free_email", False)
lead["is_role_based"] = data.get("is_role_based", False)
valid_leads.append(lead)
elif status == "invalid":
invalid_count += 1
rejection_reasons["invalid"] = rejection_reasons.get("invalid", 0) + 1
else:
unknown_count += 1
rejection_reasons["unknown"] = rejection_reasons.get("unknown", 0) + 1
except Exception as e:
error_count += 1
print(f" ⚠️ Error verifying {lead['email']}: {e}")
if (i + 1) % 20 == 0:
time.sleep(0.5)
print(f"\n ✅ Verified: {len(valid_leads)} valid")
print(f" ❌ Invalid: {invalid_count}")
print(f" ❓ Unknown: {unknown_count}")
print(f" ⚠️ Errors: {error_count}")
if rejection_reasons:
print(f" 📊 Rejection breakdown: {rejection_reasons}")
return valid_leads, {
"total": len(leads),
"valid": len(valid_leads),
"invalid": invalid_count,
"unknown": unknown_count,
"errors": error_count,
"rejection_reasons": rejection_reasons,
}
# ---------------------------------------------------------------------------
# Step 3: Deduplicate against Instantly + exclusion list
# ---------------------------------------------------------------------------
def get_instantly_existing_emails(api_key):
"""Pull ALL existing leads from Instantly workspace for dedup."""
print(f"\n 📥 Fetching existing Instantly leads for dedup …")
url = "https://api.instantly.ai/api/v2/leads/list"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
existing_emails = set()
cursor = None
page = 0
while True:
body = {"limit": 100}
if cursor:
body["starting_after"] = cursor
resp = request_with_retry("POST", url, headers=headers, json=body)
if resp.status_code != 200:
print(f" ⚠️ Instantly list error {resp.status_code}: {resp.text[:200]}")
break
data = resp.json()
items = data.get("items", [])
if not items:
break
for item in items:
email = item.get("email", "").lower().strip()
if email:
existing_emails.add(email)
cursor = data.get("next_starting_after")
if not cursor:
break
page += 1
if page % 10 == 0:
print(f"{len(existing_emails)} existing leads so far")
time.sleep(1)
print(f" 📊 Found {len(existing_emails)} existing leads in Instantly")
return existing_emails
def load_exclusion_list(filepath):
"""Load burned emails from a CSV file (one email per line or first column)."""
excluded = set()
if not filepath or not os.path.exists(filepath):
return excluded
with open(filepath, "r") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
email = line.split(",")[0].strip().strip('"').lower()
if "@" in email:
excluded.add(email)
print(f" 📋 Loaded {len(excluded)} emails from exclusion list")
return excluded
def deduplicate(leads, api_key, exclude_file=None):
"""Remove leads already in Instantly or on exclusion list."""
print(f"\n{'='*50}")
print(f"STEP 3: Deduplicating {len(leads)} leads")
print(f"{'='*50}")
existing = get_instantly_existing_emails(api_key)
excluded = load_exclusion_list(exclude_file)
deduped = []
instantly_dupes = 0
burned_dupes = 0
for lead in leads:
email = lead["email"]
if email in existing:
instantly_dupes += 1
elif email in excluded:
burned_dupes += 1
else:
deduped.append(lead)
print(f"\n ✅ Net new leads: {len(deduped)}")
print(f" 🔄 Already in Instantly: {instantly_dupes}")
print(f" 🚫 On exclusion list: {burned_dupes}")
return deduped, {
"instantly_dupes": instantly_dupes,
"burned_dupes": burned_dupes,
"net_new": len(deduped),
}
# ---------------------------------------------------------------------------
# Step 4: Upload to Instantly
# ---------------------------------------------------------------------------
def generate_personalization(lead):
"""Generate a simple 1-line personalization based on available data."""
name = lead.get("first_name", "")
company = lead.get("company_name", "")
title = lead.get("title", "")
if company and title:
return f"Noticed you're {title} at {company} — curious how you're thinking about growth this quarter."
elif company:
return f"Been following {company}'s trajectory — impressive momentum."
elif title:
return f"As a {title}, you're probably juggling growth and efficiency right now."
return "Your background caught my eye — wanted to reach out."
def upload_to_instantly(api_key, leads, campaign_id, dry_run=False):
"""Upload leads to Instantly campaign in batches."""
print(f"\n{'='*50}")
print(f"STEP 4: Uploading {len(leads)} leads to Instantly")
print(f"{'='*50}")
if dry_run:
print(" 🏃 DRY RUN — skipping actual upload")
return {"uploaded": 0, "failed": 0, "dry_run": True}
url = "https://api.instantly.ai/api/v2/leads"
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
uploaded = 0
failed = 0
batch_size = 25
for i in range(0, len(leads), batch_size):
batch = leads[i:i + batch_size]
batch_num = (i // batch_size) + 1
total_batches = (len(leads) + batch_size - 1) // batch_size
print(f" 📤 Batch {batch_num}/{total_batches} ({len(batch)} leads) …", end=" ", flush=True)
batch_success = 0
batch_fail = 0
for lead in batch:
body = {
"email": lead["email"],
"first_name": lead.get("first_name", ""),
"last_name": lead.get("last_name", ""),
"company_name": lead.get("company_name", ""),
"campaign": campaign_id,
"custom_variables": {
"title": lead.get("title", ""),
"company_name": lead.get("company_name", ""),
"personalization": generate_personalization(lead),
},
}
try:
resp = request_with_retry("POST", url, headers=headers, json=body)
if resp.status_code in (200, 201):
batch_success += 1
else:
batch_fail += 1
if batch_fail <= 3:
print(f"\n ⚠️ Failed {lead['email']}: {resp.status_code} {resp.text[:100]}")
except Exception as e:
batch_fail += 1
print(f"\n ⚠️ Error uploading {lead['email']}: {e}")
uploaded += batch_success
failed += batch_fail
print(f"{batch_success} ok, {batch_fail} failed")
if i + batch_size < len(leads):
time.sleep(1)
print(f"\n ✅ Uploaded: {uploaded}")
if failed:
print(f" ❌ Failed: {failed}")
return {"uploaded": uploaded, "failed": failed, "dry_run": False}
# ---------------------------------------------------------------------------
# Reporting
# ---------------------------------------------------------------------------
def save_report(output_dir, sourced, verified_stats, dedup_stats, upload_stats, leads_uploaded, args):
"""Save run log as JSON."""
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M")
report_path = os.path.join(output_dir, f"{timestamp}.json")
report = {
"timestamp": datetime.now().isoformat(),
"parameters": {
"titles": args.titles,
"industries": args.industries,
"company_size": args.company_size,
"locations": args.locations,
"keywords": args.keywords,
"campaign_id": args.campaign_id,
"volume": args.volume,
"exclude_file": args.exclude_file,
"dry_run": args.dry_run,
},
"results": {
"sourced_from_apollo": sourced,
"verification": verified_stats,
"deduplication": dedup_stats,
"upload": upload_stats,
},
"leads_uploaded": [
{k: v for k, v in lead.items() if k not in ("is_free_email", "is_role_based")}
for lead in leads_uploaded
],
}
os.makedirs(output_dir, exist_ok=True)
with open(report_path, "w") as f:
json.dump(report, f, indent=2, default=str)
print(f"\n 💾 Run log saved: {report_path}")
return report_path
def print_summary(sourced_count, verified_stats, dedup_stats, upload_stats):
"""Print final summary."""
print(f"\n{'='*50}")
print(f" LEAD PIPELINE SUMMARY")
print(f"{'='*50}")
print(f" Sourced from Apollo: {sourced_count:>6}")
print(f" Verified (LeadMagic): {verified_stats['valid']:>6} ({verified_stats['valid']/max(sourced_count,1)*100:.1f}%)")
print(f" Already in Instantly: {dedup_stats['instantly_dupes']:>6}")
print(f" Excluded (burned list): {dedup_stats['burned_dupes']:>6}")
print(f" Net new uploaded: {upload_stats['uploaded']:>6}")
if upload_stats.get('failed'):
print(f" Failed uploads: {upload_stats['failed']:>6}")
if upload_stats.get('dry_run'):
print(f" ⚠️ DRY RUN — nothing was uploaded")
print(f"{'='*50}\n")
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(
description="Lead Pipeline: Apollo → LeadMagic → Dedupe → Instantly",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Full pipeline run
python3 lead-pipeline.py \\
--titles "VP Marketing,CMO" --industries "SaaS" \\
--company-size "11,50" --locations "United States" \\
--campaign-id abc-123 --volume 200
# Dry run (no upload)
python3 lead-pipeline.py \\
--titles "CTO,VP Engineering" --company-size "51,200" \\
--campaign-id abc-123 --volume 100 --dry-run
""",
)
parser.add_argument("--apollo-key", default=os.environ.get("APOLLO_API_KEY"),
help="Apollo API key (or set APOLLO_API_KEY env var)")
parser.add_argument("--leadmagic-key", default=os.environ.get("LEADMAGIC_API_KEY"),
help="LeadMagic API key (or set LEADMAGIC_API_KEY env var)")
parser.add_argument("--instantly-key", default=os.environ.get("INSTANTLY_API_KEY"),
help="Instantly API key (or set INSTANTLY_API_KEY env var)")
parser.add_argument("--titles", required=True, help="Comma-separated job titles")
parser.add_argument("--industries", default="", help="Comma-separated industries/keywords")
parser.add_argument("--company-size", default="", help="Employee range, e.g. '11,50'")
parser.add_argument("--locations", default="", help="Comma-separated locations")
parser.add_argument("--keywords", default="", help="Additional search keywords")
parser.add_argument("--campaign-id", required=True, help="Instantly campaign UUID")
parser.add_argument("--volume", type=int, default=500, help="Target number of leads (default: 500)")
parser.add_argument("--exclude-file", default=None, help="Path to CSV of burned/excluded emails")
parser.add_argument("--output-dir", default="./data/lead-pipeline-runs/",
help="Directory for run logs (default: ./data/lead-pipeline-runs/)")
parser.add_argument("--dry-run", action="store_true", help="Run pipeline but skip Instantly upload")
args = parser.parse_args()
# Validate required keys
if not args.apollo_key:
print("ERROR: Apollo API key required. Set APOLLO_API_KEY env var or pass --apollo-key.")
sys.exit(1)
if not args.leadmagic_key:
print("ERROR: LeadMagic API key required. Set LEADMAGIC_API_KEY env var or pass --leadmagic-key.")
sys.exit(1)
if not args.instantly_key:
print("ERROR: Instantly API key required. Set INSTANTLY_API_KEY env var or pass --instantly-key.")
sys.exit(1)
start_time = time.time()
os.makedirs(args.output_dir, exist_ok=True)
print(f"\n🚀 Lead Pipeline Started — {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f" Target: {args.volume} leads → Campaign {args.campaign_id}")
if args.dry_run:
print(f" ⚠️ DRY RUN MODE — will not upload to Instantly")
# Step 1: Source from Apollo
sourced_leads = source_from_apollo(
api_key=args.apollo_key,
titles=args.titles,
industries=args.industries,
company_size=args.company_size,
locations=args.locations,
keywords=args.keywords,
volume=args.volume,
)
if not sourced_leads:
print("\n❌ No leads sourced from Apollo. Exiting.")
sys.exit(1)
# Save intermediate state
intermediate_path = os.path.join(args.output_dir, "last-sourced.json")
with open(intermediate_path, "w") as f:
json.dump(sourced_leads, f, indent=2)
# Step 2: Verify via LeadMagic
verified_leads, verified_stats = verify_with_leadmagic(args.leadmagic_key, sourced_leads)
if not verified_leads:
print("\n❌ No leads passed verification. Exiting.")
sys.exit(1)
intermediate_path = os.path.join(args.output_dir, "last-verified.json")
with open(intermediate_path, "w") as f:
json.dump(verified_leads, f, indent=2)
# Step 3: Deduplicate
deduped_leads, dedup_stats = deduplicate(verified_leads, args.instantly_key, args.exclude_file)
if not deduped_leads:
print("\n⚠️ All leads already exist in Instantly. Nothing to upload.")
upload_stats = {"uploaded": 0, "failed": 0, "dry_run": args.dry_run}
else:
# Step 4: Upload to Instantly
upload_stats = upload_to_instantly(args.instantly_key, deduped_leads, args.campaign_id, args.dry_run)
# Step 5: Report
print_summary(len(sourced_leads), verified_stats, dedup_stats, upload_stats)
save_report(
args.output_dir,
sourced=len(sourced_leads),
verified_stats=verified_stats,
dedup_stats=dedup_stats,
upload_stats=upload_stats,
leads_uploaded=deduped_leads,
args=args,
)
elapsed = time.time() - start_time
print(f"⏱️ Completed in {elapsed/60:.1f} minutes")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,50 @@
# ─── Required API Keys ───────────────────────────────────────────────────────
# HubSpot API key (for Deal Resurrector + Suppression Pipeline CRM check)
# Get one: HubSpot → Settings → Integrations → Private Apps
HUBSPOT_API_KEY=your-hubspot-private-app-token
# Instantly API key (for RB2B Router + Suppression Pipeline)
# Get one: Instantly.ai → Settings → API
INSTANTLY_API_KEY=your-instantly-api-key
# Brave Search API key (for Trigger Prospector)
# Get one: https://api.search.brave.com/
BRAVE_API_KEY=your-brave-search-api-key
# ─── Optional: Database (for ICP Learning Analyzer) ─────────────────────────
# PostgreSQL connection string for prospect tracking database
# Only needed if you're running the ICP Learning Analyzer
DATABASE_URL=postgresql://user:password@localhost:5432/prospects_db
# ─── Your Company Info (for email templates) ────────────────────────────────
YOUR_COMPANY_NAME=Your Company
YOUR_SENDER_NAME=Your Name
YOUR_SENDER_TITLE=CEO
YOUR_VALUE_PROP=We've built new capabilities since we last talked that I think you'd find interesting.
# ─── Pipeline Configuration ─────────────────────────────────────────────────
# Default source site for RB2B webhook ingest
DEFAULT_SOURCE_SITE=your-site.com
# Minimum intent score to process visitors (0-100, default: 50)
MIN_INTENT_SCORE=50
# Minimum company size for ICP match (employees, default: 50)
ICP_MIN_COMPANY_SIZE=50
# Company dedup window in days (default: 7)
COMPANY_DEDUP_WINDOW_DAYS=7
# HubSpot API rate limit delay in seconds (default: 1.5)
HUBSPOT_RATE_DELAY=1.5
# Campaign routing (override defaults)
CAMPAIGN_AGENCY=Agency-Default
CAMPAIGN_GENERAL=General-Default
# Base directory override (defaults to script directory)
# BASE_DIR=/path/to/your/sales-pipeline

349
sales-pipeline/README.md Normal file
View file

@ -0,0 +1,349 @@
# 🎯 AI Sales Pipeline
> **Turn anonymous website visitors into qualified pipeline in under 60 seconds.**
A complete AI-powered sales pipeline automation suite: from website visitor identification through intent scoring, suppression, campaign routing, dead deal resurrection, trigger-based prospecting, and self-learning ICP optimization.
These tools were built in production at [Single Grain](https://www.singlegrain.com), processing thousands of visitors and deals weekly. Now open-sourced for any B2B company to use.
---
## Architecture
```
┌─────────────────────────────────────────┐
│ YOUR WEBSITE(S) │
└──────────────┬──────────────────────────┘
│ RB2B pixel fires
┌───────────────────────────────────────┐
│ rb2b_webhook_ingest.py │
│ Intent Scoring + ICP Classification │
│ (pricing=90, blog=30, services=65) │
└──────────────┬────────────────────────┘
│ High-intent visitors
┌────────────────────────────────────────────────┐
│ rb2b_suppression_pipeline.py │
│ 5-Layer Check: │
│ CRM → Outbound → Stripe → Analytics → Block │
│ + Company-level dedup (1 per domain/week) │
└──────────────┬─────────────────────────────────┘
│ Clean leads only
┌────────────────────────────────────────────────┐
│ rb2b_instantly_router.py │
│ Agency Detection + Source Site Routing │
│ → Routes to correct Instantly campaign │
│ → Auto-activates paused campaigns │
└────────────────────────────────────────────────┘
┌────────────────────┐ ┌────────────────────┐ ┌─────────────────────┐
│ deal_resurrector │ │ trigger_prospector │ │ icp_learning_ │
│ .py │ │ .py │ │ analyzer.py │
│ │ │ │ │ │
│ 3 intelligence │ │ Monitors: │ │ Reads approve/ │
│ layers on dead │ │ • New CMO hires │ │ reject decisions │
│ deals: │ │ • Job postings │ │ │
│ 1. Time decay │ │ • Funding rounds │ │ Outputs: │
│ scoring │ │ • Agency searches │ │ • Industry targets │
│ 2. POC expansion │ │ │ │ • Size sweet spots │
│ 3. Follow the │ │ Scores, enriches, │ │ • Title patterns │
│ champion │ │ drafts outreach │ │ • Revenue ranges │
└────────────────────┘ └─────────────────────┘ └─────────────────────┘
│ │ │
└────────────────────────┼─────────────────────────┘
┌───────────────────────┐
│ Your CRM / Outbound │
│ (HubSpot, Instantly) │
└───────────────────────┘
```
---
## Tools
### 1. 🌐 RB2B Webhook Ingest (`rb2b_webhook_ingest.py`)
Receives RB2B visitor identification webhooks, scores intent based on pages visited, and classifies ICP fit.
**What it does:**
- Scores every page visit against configurable intent patterns (pricing page = 90, blog = 30)
- Checks ICP fit by title seniority + company size
- Outputs structured signals with priority levels (high/medium/low)
- Runs as HTTP server or processes stdin/batch files
```bash
# Run as webhook server
python3 rb2b_webhook_ingest.py --serve --port 4100
# Test with sample data
echo '{"email":"cmo@acme.com","job_title":"CMO","company":"Acme Inc","company_size":500,"pages_visited":["https://yoursite.com/pricing"]}' | python3 rb2b_webhook_ingest.py --dry-run
```
### 2. 🛡️ Suppression Pipeline (`rb2b_suppression_pipeline.py`)
5-layer suppression that prevents embarrassing outreach to existing customers, active leads, or competitors.
**Layers:**
1. **Personal Email Filter** — Skip gmail.com, yahoo.com, etc.
2. **CRM Check** — Already in HubSpot? Don't cold email them.
3. **Outbound Platform** — Already in an Instantly campaign (last 90 days)?
4. **Payment Provider** — Paying Stripe customer? Definitely don't cold email.
5. **Blocklist** — Competitor domains + manual blocks
6. **Company Dedup** — Only 1 contact per company domain per 7-day window
```bash
# Check a single email
python3 rb2b_suppression_pipeline.py --email john@acme.com --company "Acme Inc"
# Output:
# 📋 Suppression check for: john@acme.com
# ──────────────────────────────────────────────────
# ✅ Personal Email Filter: business email
# ✅ CRM Check: not in CRM
# ✅ Outbound Platform: not in outbound platform
# ✅ Payment Provider: not a paying customer
# ✅ Blocklist: not blocklisted
# ✅ Company Dedup: no company dedup conflict
# ──────────────────────────────────────────────────
# ✅ CLEAR — eligible for enrollment
```
### 3. 🔀 Instantly Router (`rb2b_instantly_router.py`)
The orchestrator: combines intent scoring + suppression + agency classification to route leads to the right Instantly campaign automatically.
**What it does:**
- Scores visitor intent
- Runs full suppression pipeline
- Classifies agency vs. non-agency visitors (2+ signal threshold)
- Detects source site (if you have multiple properties)
- Routes to the correct campaign and auto-enrolls via Instantly API
- Auto-activates paused campaigns when leads are ready
```bash
# Run as webhook server (production mode)
python3 rb2b_instantly_router.py --serve --port 4100
# Dry run test
echo '{"email":"vp@techco.com","job_title":"VP Marketing","company":"TechCo","industry":"SaaS","company_size":"200","pages_visited":["https://yoursite.com/pricing","https://yoursite.com/case-studies"]}' | python3 rb2b_instantly_router.py --dry-run
```
### 4. 🔥 Deal Resurrector (`deal_resurrector.py`)
Three intelligence layers on your closed-lost deals. Finds the best revival opportunities using a composite scoring formula.
**Layer 1 — Time Decay Scoring (0-100):**
- Time component (35 pts): 60-90 days = sweet spot, decays over time
- Value component (30 pts): Normalized deal value
- Reason component (20 pts): "Timing" deals score higher than "bad fit"
- Trigger component (15 pts): Bonus if recent email opens or site visits
**Layer 2 — POC Expansion:**
- Verifies if your contact is still at the company
- Finds replacement decision-makers when contacts leave
**Layer 3 — Follow the Champion:**
- Tracks departed contacts to their new companies
- If they moved to an ICP-fit company, generates outreach for the new org
```bash
# Find top 10 revival opportunities (dry run)
python3 deal_resurrector.py --top 10 --dry-run
# Full run with champion tracking
python3 deal_resurrector.py --top 5 --include-champion
# Exclude a company from future runs
python3 deal_resurrector.py --add-exclusion "Already Won Corp"
```
### 5. 🔍 Trigger Prospector (`trigger_prospector.py`)
Scans the web for buying signals: new marketing leadership hires, job postings, funding rounds, and active agency searches.
**Signal Categories:**
| Signal | What It Means | Score Weight |
|--------|--------------|-------------|
| New CMO/VP hire | Budget reallocation window | 35 pts |
| Job posting | Growth mode, team building | 25 pts |
| Funding round | Capital to deploy | 30 pts |
| Agency search | Active evaluation | 40 pts |
Each prospect gets a composite score (0-100) plus enrichment: estimated company size, industry, suggested services, outreach channel recommendation, and a ready-to-send email draft.
```bash
# Scan last 7 days for signals
python3 trigger_prospector.py --days 7 --top 15
# Wider scan with lower threshold
python3 trigger_prospector.py --days 30 --top 25 --min-score 40
```
### 6. 📊 ICP Learning Analyzer (`icp_learning_analyzer.py`)
Your ICP should evolve from data, not guesswork. This tool reads your prospect approve/reject history and outputs recommended filter changes.
**What it analyzes:**
- Industry patterns (which convert vs. get rejected)
- Company size sweet spots (10th-90th percentile of approvals)
- Title/seniority patterns
- Revenue ranges
- Per-source approval rates (cold vs. trigger vs. warm vs. revival)
```bash
# Run analysis
python3 icp_learning_analyzer.py
# With custom config
python3 icp_learning_analyzer.py --config data/icp-config.json
# Example output:
# 📊 ICP Learning Analyzer Results
# Total prospects analyzed: 847
# ────────────────────────────────────────
# cold : ready (n=312, approval=23%)
# → Target: SaaS, Fintech, E-commerce
# → Exclude: Crypto/Web3
# → Employees: 50-500
# trigger : ready (n=156, approval=41%)
# → Target: SaaS, Healthcare
# → Employees: 100-1000
# warm : ready (n=289, approval=67%)
# revival : insufficient_data (n=12, min_required=30)
```
---
## Quick Start
### 1. Clone and install
```bash
git clone https://github.com/nichochar/ai-marketing-skills.git
cd ai-marketing-skills/sales-pipeline
pip install -r requirements.txt
```
### 2. Configure environment
```bash
cp .env.example .env
# Edit .env with your API keys
```
### 3. Set up campaign config (for RB2B Router)
```bash
cp data/campaigns.json.example data/campaigns.json
# Add your Instantly campaign UUIDs
```
### 4. Test with dry runs
```bash
# Test suppression pipeline
python3 rb2b_suppression_pipeline.py --email test@example.com
# Test intent scoring
echo '{"email":"test@example.com","pages_visited":["https://yoursite.com/pricing"]}' \
| python3 rb2b_webhook_ingest.py --dry-run
# Test deal resurrector
python3 deal_resurrector.py --top 5 --dry-run
# Test trigger prospector
python3 trigger_prospector.py --days 7 --top 10
```
### 5. Deploy webhook server
```bash
# Start the full pipeline as a webhook endpoint
python3 rb2b_instantly_router.py --serve --port 4100
# Point your RB2B webhook (or Zapier/Make) at:
# POST http://your-server:4100/
```
---
## Customization
### Intent Scoring
Edit `PAGE_INTENT_SCORES` in `rb2b_webhook_ingest.py` to match your site's URL structure:
```python
PAGE_INTENT_SCORES = {
"pricing": 90, # Your pricing page path
"demo": 85, # Demo request page
"case-study": 70, # Social proof pages
"blog": 30, # Low-intent content
# Add your own patterns...
}
```
### Agency Detection
Modify `AGENCY_KEYWORDS_COMPANY` and `AGENCY_INDUSTRIES` in `rb2b_instantly_router.py` for your market.
### Loss Reason Scoring
Customize `LOSS_REASON_BONUS` in `deal_resurrector.py` based on which loss reasons actually convert when revisited.
### Trigger Queries
Edit `SEARCH_QUERIES` in `trigger_prospector.py` to target your specific market signals.
---
## Integrations
| Tool | Required | Used By |
|------|----------|---------|
| [RB2B](https://rb2b.com) | For visitor ID | Webhook Ingest, Router |
| [Instantly](https://instantly.ai) | For cold email | Router, Suppression |
| [HubSpot](https://hubspot.com) | For CRM | Deal Resurrector, Suppression |
| [Brave Search](https://api.search.brave.com) | For web signals | Trigger Prospector |
| PostgreSQL | For ICP learning | ICP Analyzer |
| Stripe | Optional | Suppression (customer check) |
---
## File Structure
```
sales-pipeline/
├── README.md # This file
├── SKILL.md # Claude Code skill definition
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── rb2b_webhook_ingest.py # Webhook server + intent scoring
├── rb2b_suppression_pipeline.py # 5-layer suppression checks
├── rb2b_instantly_router.py # Full pipeline orchestrator
├── deal_resurrector.py # Dead deal revival engine
├── trigger_prospector.py # Web signal prospecting
├── icp_learning_analyzer.py # Self-learning ICP optimization
└── data/
├── campaigns.json.example # Instantly campaign config template
└── icp-config.example.json # ICP analyzer config template
```
---
## How It Works Together
1. **RB2B identifies** anonymous website visitors with name, email, company, title
2. **Webhook Ingest** scores their intent based on which pages they viewed
3. **Suppression Pipeline** checks 5 layers to avoid emailing existing contacts
4. **Router** classifies agency vs. non-agency, picks the right campaign, enrolls
5. **Meanwhile**, Deal Resurrector mines your CRM for revival opportunities
6. **Trigger Prospector** scans the web for companies showing buying signals
7. **ICP Analyzer** learns from your approve/reject decisions and tightens targeting
The result: a self-improving pipeline that gets better the more you use it.
---
<p align="center">
Built by <a href="https://www.singlegrain.com">Single Grain</a> · Open-sourced as part of <a href="https://github.com/nichochar/ai-marketing-skills">AI Marketing Skills</a>
</p>

66
sales-pipeline/SKILL.md Normal file
View file

@ -0,0 +1,66 @@
# AI Sales Pipeline
Complete AI-powered sales pipeline automation: website visitor identification → intent scoring → suppression → campaign routing → dead deal resurrection → trigger prospecting → self-learning ICP optimization.
## When to Use
Use this skill when:
- Setting up automated outbound from website visitor identification (RB2B)
- Running suppression checks before cold outreach
- Routing leads to the right cold email campaigns
- Reviving closed-lost deals from HubSpot
- Finding companies showing buying signals (new hires, funding, job postings)
- Analyzing prospect approve/reject patterns to improve ICP targeting
## Tools
### RB2B Pipeline (visitor → outbound)
| Script | Purpose | Key Command |
|--------|---------|-------------|
| `rb2b_webhook_ingest.py` | Webhook server + intent scoring | `python3 rb2b_webhook_ingest.py --serve --port 4100` |
| `rb2b_suppression_pipeline.py` | 5-layer suppression checks | `python3 rb2b_suppression_pipeline.py --email user@co.com` |
| `rb2b_instantly_router.py` | Full pipeline: score → suppress → route → enroll | `python3 rb2b_instantly_router.py --serve --port 4100` |
### Deal Intelligence
| Script | Purpose | Key Command |
|--------|---------|-------------|
| `deal_resurrector.py` | 3-layer dead deal revival (time decay + POC expansion + champion tracking) | `python3 deal_resurrector.py --top 10 --dry-run` |
| `trigger_prospector.py` | Web signal monitoring (new hires, funding, agency searches) | `python3 trigger_prospector.py --days 7 --top 15` |
| `icp_learning_analyzer.py` | Learn from approve/reject decisions, recommend ICP changes | `python3 icp_learning_analyzer.py` |
## Configuration
All scripts use environment variables for API keys and configuration. Copy `.env.example` to `.env` and fill in your values.
### Required Environment Variables
- `HUBSPOT_API_KEY` — HubSpot private app token (Deal Resurrector, Suppression)
- `INSTANTLY_API_KEY` — Instantly API key (Router, Suppression)
- `BRAVE_API_KEY` — Brave Search API key (Trigger Prospector)
- `DATABASE_URL` — PostgreSQL connection string (ICP Analyzer only)
### Key Customization Points
- **Intent scoring**: Edit `PAGE_INTENT_SCORES` dict in webhook_ingest to match your URL patterns
- **Agency detection**: Edit `AGENCY_KEYWORDS_*` in router for your market
- **Loss reason scoring**: Edit `LOSS_REASON_BONUS` in deal_resurrector for your close reasons
- **Signal queries**: Edit `SEARCH_QUERIES` in trigger_prospector for your target market
- **Campaign routing**: Edit `data/campaigns.json` with your Instantly campaign UUIDs
## Data Flow
```
RB2B Webhook → Ingest (score) → Suppress (5 layers) → Route (classify) → Instantly
HubSpot CRM → Deal Resurrector (score + draft emails) → Review Queue
Brave Search → Trigger Prospector (score + enrich) → Outreach Queue
Prospect DB → ICP Analyzer (learn patterns) → Filter Recommendations
```
## Dependencies
- Python 3.9+
- `requests` (for HubSpot API)
- `psycopg2-binary` (for ICP Analyzer only)
- No other external dependencies — scripts use stdlib HTTP server and urllib

View file

@ -0,0 +1,6 @@
{
"campaigns": {
"Agency-Default": "your-instantly-campaign-uuid-here",
"General-Default": "your-instantly-campaign-uuid-here"
}
}

View file

@ -0,0 +1,11 @@
{
"source_type_mapping": {
"cold_outbound": "cold",
"trigger_prospector": "trigger",
"website_visitor": "warm",
"deal_revival": "revival",
"referral": "warm",
"inbound": "warm"
},
"min_sample_size": 30
}

View file

@ -0,0 +1,668 @@
#!/usr/bin/env python3
"""
Deal Resurrector v2 Three intelligence layers on dead deals:
Layer 1: Time Decay Scoring (composite score with configurable decay windows)
Layer 2: POC Expansion (verify contacts, find replacements)
Layer 3: Follow the Champion (track departed POCs to new companies)
Pulls closed-lost deals from HubSpot, scores them using a composite formula
(time decay + deal value + loss reason + engagement triggers), then generates
personalized revival emails per loss reason category.
Usage:
python3 deal_resurrector.py --top 10 --dry-run
python3 deal_resurrector.py --top 5 --include-champion
python3 deal_resurrector.py --add-exclusion "Acme Corp"
"""
import argparse
import json
import os
import random
import re
import subprocess
import sys
import time
from datetime import datetime, timedelta, timezone
from pathlib import Path
import requests
# ─── Configuration ───────────────────────────────────────────────────────────
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
DATA_DIR = BASE_DIR / "data"
EXCLUSIONS_FILE = DATA_DIR / "resurrector-exclusions.json"
OUTPUT_FILE = DATA_DIR / "deal-resurrector-latest.json"
# HubSpot API
HUBSPOT_BASE_URL = "https://api.hubapi.com"
HUBSPOT_TOKEN = os.environ.get("HUBSPOT_API_KEY", "")
# ─── Closed-Lost Stage IDs ──────────────────────────────────────────────────
# Map your HubSpot closed-lost stage IDs to pipeline names.
# Find these in HubSpot → Settings → Objects → Deals → Pipelines
CLOSED_LOST_STAGES = {
# "stage_id_here": "Pipeline Name",
# Example:
# "1079884213": "Enterprise Pipeline",
# "960522377": "ABM Pipeline",
}
# ─── HubSpot Properties to Fetch ────────────────────────────────────────────
DEAL_PROPERTIES = [
"dealname", "amount", "closedate", "dealstage",
"closed_lost_reason", "hs_closed_amount", "pipeline",
"hubspot_owner_id", "notes_last_updated",
]
CONTACT_PROPERTIES = [
"firstname", "lastname", "email", "jobtitle", "company",
"hs_last_sales_activity_date", "notes_last_updated",
"hs_email_last_open_date", "hs_email_last_click_date",
"hs_analytics_last_visit_timestamp", "hs_analytics_num_page_views",
"num_associated_deals", "recent_conversion_event_name",
]
COMPANY_PROPERTIES = [
"name", "domain", "industry", "numberofemployees",
"annualrevenue", "hs_last_sales_activity_date",
"notes_last_updated", "num_associated_deals",
"hs_analytics_last_visit_timestamp",
]
# ─── Time Decay Windows ─────────────────────────────────────────────────────
# (min_days, max_days, weight)
# Deals in the 60-90 day window get full weight; older deals decay.
DECAY_WINDOWS = [
(60, 90, 1.0), # Sweet spot — enough time has passed, still fresh
(91, 180, 0.8), # Good window
(181, 365, 0.6), # Getting stale but still viable
(366, 540, 0.4), # Long shot unless trigger present
(541, 99999, 0.2), # Only if engagement trigger detected
]
# ─── Loss Reason → Bonus Multiplier ─────────────────────────────────────────
# Deals lost to "timing" are more likely to convert than "bad fit".
LOSS_REASON_BONUS = {
"timing": 1.3,
"not ready": 1.25,
"budget": 1.15,
"price": 1.1,
"internal": 1.05,
"no decision": 1.0,
"competitor": 0.7,
"no need": 0.5,
"bad fit": 0.3,
}
# Rate limit delay between HubSpot API calls (seconds)
SEARCH_DELAY = float(os.environ.get("HUBSPOT_RATE_DELAY", "1.5"))
# ─── Your Company Info (for email templates) ────────────────────────────────
YOUR_COMPANY_NAME = os.environ.get("YOUR_COMPANY_NAME", "Your Company")
YOUR_SENDER_NAME = os.environ.get("YOUR_SENDER_NAME", "Your Name")
YOUR_SENDER_TITLE = os.environ.get("YOUR_SENDER_TITLE", "CEO")
# A brief value prop to include in emails
YOUR_VALUE_PROP = os.environ.get("YOUR_VALUE_PROP",
"We've built new capabilities since we last talked that I think you'd find interesting.")
# ─── Exclusion List ──────────────────────────────────────────────────────────
def load_exclusions() -> set:
"""Load excluded company names (lowercased) from the exclusions file."""
if not EXCLUSIONS_FILE.exists():
return set()
try:
data = json.loads(EXCLUSIONS_FILE.read_text())
return {e["company"].lower() for e in data.get("excluded_deals", [])}
except Exception as ex:
print(f"⚠️ Could not load exclusions: {ex}", file=sys.stderr)
return set()
def add_exclusion(company: str, deal_id: str = "", reason: str = "manually_excluded") -> None:
"""Append a company to the exclusions file."""
data = {"excluded_deals": []}
if EXCLUSIONS_FILE.exists():
try:
data = json.loads(EXCLUSIONS_FILE.read_text())
except Exception:
pass
existing = {e["company"].lower() for e in data["excluded_deals"]}
if company.lower() in existing:
print(f" {company} is already excluded.")
return
data["excluded_deals"].append({
"deal_id": deal_id or company.lower().replace(" ", "-"),
"company": company,
"reason": reason,
"excluded_date": datetime.now().strftime("%Y-%m-%d"),
})
DATA_DIR.mkdir(parents=True, exist_ok=True)
EXCLUSIONS_FILE.write_text(json.dumps(data, indent=2))
print(f"✅ Added {company} to exclusion list")
# ─── HubSpot Client ─────────────────────────────────────────────────────────
class HubSpotClient:
def __init__(self, token: str):
self.token = token.strip()
self.session = requests.Session()
self.session.headers.update({
"Authorization": f"Bearer {self.token}",
"Content-Type": "application/json",
})
self._rate_wait = 0.12
def _request(self, method, path, **kwargs):
url = f"{HUBSPOT_BASE_URL}{path}"
for attempt in range(4):
resp = self.session.request(method, url, **kwargs)
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 2))
print(f" ⏳ Rate limited, waiting {wait}s…", file=sys.stderr)
time.sleep(wait)
continue
resp.raise_for_status()
time.sleep(self._rate_wait)
return resp.json()
raise RuntimeError(f"Too many retries for {path}")
def get(self, path, **kwargs):
return self._request("GET", path, **kwargs)
def post(self, path, **kwargs):
return self._request("POST", path, **kwargs)
def search_closed_lost_deals(self, since_date: str):
"""Search for all closed-lost deals across configured pipelines."""
all_deals = []
for stage_id in CLOSED_LOST_STAGES:
all_deals.extend(self._search_by_stage(stage_id, since_date))
return all_deals
def _search_by_stage(self, stage_id, since_date):
deals = []
after = None
while True:
body = {
"filterGroups": [{"filters": [
{"propertyName": "dealstage", "operator": "EQ", "value": stage_id},
{"propertyName": "closedate", "operator": "GTE", "value": since_date},
]}],
"properties": DEAL_PROPERTIES,
"sorts": [{"propertyName": "closedate", "direction": "DESCENDING"}],
"limit": 100,
}
if after:
body["after"] = after
data = self.post("/crm/v3/objects/deals/search", json=body)
deals.extend(data.get("results", []))
paging = data.get("paging", {}).get("next")
if paging:
after = paging["after"]
else:
break
return deals
def get_deal_associations(self, deal_id, to_type="contacts"):
try:
data = self.get(f"/crm/v4/objects/deals/{deal_id}/associations/{to_type}")
return data.get("results", [])
except Exception:
return []
def get_contact(self, contact_id):
try:
return self.get(
f"/crm/v3/objects/contacts/{contact_id}",
params={"properties": ",".join(CONTACT_PROPERTIES)},
)
except Exception:
return None
def get_company_for_contact(self, contact_id):
try:
assocs = self.get(f"/crm/v4/objects/contacts/{contact_id}/associations/companies")
results = assocs.get("results", [])
if not results:
return None
company_id = results[0].get("toObjectId")
return self.get(
f"/crm/v3/objects/companies/{company_id}",
params={"properties": ",".join(COMPANY_PROPERTIES)},
)
except Exception:
return None
# ─── Helpers ─────────────────────────────────────────────────────────────────
def parse_ts(val):
"""Parse a timestamp value (epoch ms or ISO string) to datetime."""
if not val:
return None
try:
if isinstance(val, (int, float)) or (isinstance(val, str) and val.isdigit()):
return datetime.fromtimestamp(int(val) / 1000, tz=timezone.utc)
return datetime.fromisoformat(val.replace("Z", "+00:00"))
except Exception:
return None
# ─── Layer 1: Time Decay Scoring ────────────────────────────────────────────
def compute_time_decay_score(days_since_close: int, deal_value: float,
max_deal_value: float, loss_reason: str,
has_trigger: bool) -> dict:
"""Compute composite score (0-100) using additive formula:
Time component: up to 35 pts (decay weight × 35)
Value component: up to 30 pts (normalized value × 30)
Reason component: up to 20 pts (loss reason bonus × 20)
Trigger component: up to 15 pts (engagement signals)
"""
# Time decay weight
time_weight = 0.0
for lo, hi, weight in DECAY_WINDOWS:
if lo <= days_since_close <= hi:
time_weight = weight
break
# Too fresh (<60 days) — penalize (deal is still raw)
if days_since_close < 60:
time_weight = 0.2
# Very old deals only score if trigger present
if days_since_close > 540 and not has_trigger:
time_weight = 0.0
# Normalize deal value (0-1)
value_norm = min(deal_value / max(max_deal_value, 1), 1.0)
# Loss reason bonus
reason_lower = (loss_reason or "").lower()
reason_score = 0.5 # default for unknown reasons
for keyword, bonus in LOSS_REASON_BONUS.items():
if keyword in reason_lower:
reason_score = min(bonus, 1.0)
break
# Trigger bonus
trigger_pts = 15.0 if has_trigger else 0.0
# Additive composite
time_pts = time_weight * 35
value_pts = value_norm * 30
reason_pts = reason_score * 20
composite = min(100, round(time_pts + value_pts + reason_pts + trigger_pts))
return {
"time_decay_weight": time_weight,
"value_normalized": round(value_norm, 3),
"trigger_bonus": round(reason_score, 2),
"composite_score": composite,
}
# ─── Email Generation ───────────────────────────────────────────────────────
def _random_cta():
return random.choice([
"Worth revisiting?",
"Open to a quick catch-up?",
"Curious if the timing is better now?",
"Worth 15 min to compare notes?",
"Any interest in reconnecting?",
"Make sense to chat again?",
])
def _random_signoff():
return random.choice([
YOUR_SENDER_NAME,
f"{YOUR_SENDER_NAME}\n{YOUR_SENDER_TITLE}, {YOUR_COMPANY_NAME}",
f"- {YOUR_SENDER_NAME}",
])
# Revival email angles — rotated based on loss reason
REVIVAL_ANGLES = {
"timing": [
{
"subject": "{first}, checking back in",
"hook": "When we last talked, you mentioned the timing wasn't right. "
"It's been {months} months. Figured I'd check in rather than assume.",
},
{
"subject": "been a while, {first}",
"hook": "It's been {months} months since we last connected on {company}. "
"A lot has probably changed on both sides.",
},
],
"competitor": [
{
"subject": "how's the current setup, {first}?",
"hook": "Last time, you went with another partner. Totally respect that. "
"Curious how it's going and whether there's room to compare notes.",
},
],
"budget": [
{
"subject": "new pricing options",
"hook": "Pricing was the sticking point last time. We've restructured since then. "
"We now offer performance-based models where you pay for results.",
},
],
"internal": [
{
"subject": "{first}, dust settled yet?",
"hook": "Last time, internal changes at {company} put things on hold. "
"Wanted to see if the original initiative is back on the table.",
},
],
"ghost": [
{
"subject": "{first}, one more try",
"hook": "We connected {months} months ago but lost touch. No hard feelings. "
"Just wanted to resurface in case the need is still there.",
},
],
"default": [
{
"subject": "quick update for {first}",
"hook": "We connected {months} months ago about growing {company}. "
"A lot has changed on our end since then.",
},
],
}
def _categorize_loss_reason(loss_reason):
"""Map a free-text loss reason to a category for email angle selection."""
lr = (loss_reason or "").lower()
if any(w in lr for w in ["timing", "not ready", "circle back", "follow up"]):
return "timing"
if any(w in lr for w in ["competitor", "chose", "existing relationship"]):
return "competitor"
if any(w in lr for w in ["budget", "price", "pricing", "cost"]):
return "budget"
if any(w in lr for w in ["internal", "restructur", "reorg", "change"]):
return "internal"
if any(w in lr for w in ["ghost", "unresponsive", "no response"]):
return "ghost"
return "default"
def draft_revival_email(contact_name, company_name, deal_value, loss_reason,
days_since_close, contact_title=""):
"""Draft a personalized revival email based on loss reason category."""
first = contact_name.split()[0] if contact_name else "there"
months = days_since_close // 30
category = _categorize_loss_reason(loss_reason)
angle = random.choice(REVIVAL_ANGLES.get(category, REVIVAL_ANGLES["default"]))
subject = angle["subject"].format(first=first, company=company_name, months=months)
hook = angle["hook"].format(first=first, company=company_name, months=months)
cta = _random_cta()
signoff = _random_signoff()
body = f"Hey {first},\n\n{hook}\n\n{YOUR_VALUE_PROP}\n\n{cta}\n\n{signoff}"
return {"subject": subject, "body": body}
def draft_replacement_email(replacement_name, company_name, original_contact):
"""Draft email to a replacement POC at the same company."""
first = replacement_name.split()[0] if replacement_name else "there"
orig_first = original_contact.split()[0] if original_contact else "your predecessor"
cta = _random_cta()
signoff = _random_signoff()
return {
"subject": f"picking up where {orig_first} left off at {company_name}",
"body": (
f"Hey {first},\n\n"
f"We were in conversation with {original_contact} about growth for "
f"{company_name} before the team change.\n\n"
f"{YOUR_VALUE_PROP}\n\n"
f"{cta}\n\n{signoff}"
),
}
def draft_champion_email(champion_name, new_company, new_title, old_company):
"""Draft email to a champion who moved to a new company."""
first = champion_name.split()[0] if champion_name else "there"
cta = _random_cta()
signoff = _random_signoff()
return {
"subject": f"congrats on the move, {first}",
"body": (
f"Hey {first},\n\n"
f"Saw you moved to {new_company}. Congrats on the {new_title} role.\n\n"
f"We had a great conversation when you were at {old_company}. "
f"Now that you're settling in, I'd love to show you what we can do "
f"for {new_company}.\n\n"
f"{cta}\n\n{signoff}"
),
}
# ─── Main Pipeline ───────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="Deal Resurrector v2 — Time Decay + POC Expansion + Champion Tracking"
)
parser.add_argument("--top", type=int, default=10, help="Number of top deals (default: 10)")
parser.add_argument("--min-score", type=int, default=40, help="Minimum composite score (default: 40)")
parser.add_argument("--min-deal-value", type=float, default=5000, help="Min deal value (default: 5000)")
parser.add_argument("--months", type=int, default=24, help="Look back N months (default: 24)")
parser.add_argument("--include-champion", action="store_true", help="Enable Layer 3: Follow the Champion")
parser.add_argument("--dry-run", action="store_true", help="Print results, don't save")
parser.add_argument("--skip-search", action="store_true", help="Skip web searches (faster)")
parser.add_argument("--add-exclusion", metavar="COMPANY", help="Add a company to exclusion list and exit")
args = parser.parse_args()
if args.add_exclusion:
add_exclusion(args.add_exclusion)
return
if not HUBSPOT_TOKEN:
print("❌ HUBSPOT_API_KEY environment variable not set.", file=sys.stderr)
print(" Set it: export HUBSPOT_API_KEY='your-token-here'", file=sys.stderr)
sys.exit(1)
print("🔥 Deal Resurrector v2")
print(f" Layers: Time Decay + POC Expansion"
f"{ ' + Champion Tracking' if args.include_champion else ''}")
print(f" Top {args.top} | min score {args.min_score} | min value ${args.min_deal_value:,.0f}")
print()
excluded_companies = load_exclusions()
if excluded_companies:
print(f"🚫 Exclusion list: {len(excluded_companies)} companies will be skipped")
print()
client = HubSpotClient(HUBSPOT_TOKEN)
# Step 1: Pull closed-lost deals
since = (datetime.now(timezone.utc) - timedelta(days=args.months * 30)).strftime("%Y-%m-%d")
print(f"📥 Fetching closed-lost deals since {since}")
deals = client.search_closed_lost_deals(since)
print(f" Found {len(deals)} closed-lost deals")
# Filter by value
filtered = []
for d in deals:
amt = float(d["properties"].get("amount") or 0)
if amt >= args.min_deal_value:
filtered.append(d)
print(f" {len(filtered)} deals above ${args.min_deal_value:,.0f}")
# Filter exclusions
if excluded_companies:
pre = len(filtered)
filtered = [
d for d in filtered
if d["properties"].get("dealname", "").lower() not in excluded_companies
and not any(excl in d["properties"].get("dealname", "").lower()
for excl in excluded_companies)
]
excluded_count = pre - len(filtered)
if excluded_count:
print(f" 🚫 {excluded_count} deal(s) excluded")
if not filtered:
print("No deals to process. Exiting.")
return
max_value = max(float(d["properties"].get("amount") or 0) for d in filtered)
now = datetime.now(timezone.utc)
# Step 2: Score and enrich
results = []
for i, deal in enumerate(filtered):
dp = deal["properties"]
deal_id = deal["id"]
deal_name = dp.get("dealname", "Unknown")
amount = float(dp.get("amount") or 0)
loss_reason = dp.get("closed_lost_reason") or "Unknown"
close_dt = parse_ts(dp.get("closedate"))
days_since = (now - close_dt).days if close_dt else 999
print(f" [{i+1}/{len(filtered)}] {deal_name} (${amount:,.0f}, {days_since}d ago)…",
end="", flush=True)
# Get primary contact
assocs = client.get_deal_associations(deal_id, "contacts")
contact_name = "Unknown"
contact_email = ""
contact_title = ""
company_name = deal_name
contact_data = None
if assocs:
cid = str(assocs[0].get("toObjectId"))
contact_data = client.get_contact(cid)
if contact_data:
cp = contact_data.get("properties", {})
fn = cp.get("firstname") or ""
ln = cp.get("lastname") or ""
contact_name = f"{fn} {ln}".strip() or "Unknown"
contact_email = cp.get("email", "")
contact_title = cp.get("jobtitle", "")
company_name = cp.get("company") or company_name
company_data = client.get_company_for_contact(cid)
if company_data:
company_name = company_data.get("properties", {}).get("name") or company_name
# Detect engagement triggers
triggers = []
if contact_data and contact_data.get("properties"):
cp = contact_data["properties"]
if parse_ts(cp.get("hs_email_last_open_date")):
if (now - parse_ts(cp.get("hs_email_last_open_date"))).days < 60:
triggers.append("recent_email_open")
if parse_ts(cp.get("hs_analytics_last_visit_timestamp")):
if (now - parse_ts(cp.get("hs_analytics_last_visit_timestamp"))).days < 90:
triggers.append("recent_site_visit")
has_trigger = len(triggers) > 0
# Layer 1: Time Decay Score
decay = compute_time_decay_score(days_since, amount, max_value, loss_reason, has_trigger)
composite = decay["composite_score"]
if composite < args.min_score:
print(f" → score {composite} (skip)")
continue
print(f" → score {composite}")
# Generate revival email
original_email = draft_revival_email(
contact_name, company_name, amount, loss_reason, days_since, contact_title
)
# Determine revival type
revival_type = "trigger" if has_trigger else "time_decay"
entry = {
"deal_id": deal_id,
"company": company_name,
"original_contact": {
"name": contact_name,
"email": contact_email,
"title": contact_title,
},
"deal_value": amount,
"days_since_close": days_since,
"close_date": dp.get("closedate", ""),
"loss_reason": loss_reason,
"pipeline": CLOSED_LOST_STAGES.get(dp.get("dealstage"), "Unknown"),
"time_decay_score": decay["time_decay_weight"],
"composite_score": composite,
"poc_status": "unknown",
"triggers": triggers,
"revival_emails": {
"original": original_email,
"replacement": None,
"champion": None,
},
"revival_type": revival_type,
}
results.append(entry)
# Sort by composite score
results.sort(key=lambda x: x["composite_score"], reverse=True)
top_results = results[:args.top]
# Output
output = {
"generated_at": now.isoformat(),
"version": "v2",
"total_closed_lost": len(deals),
"above_min_value": len(filtered),
"scored_above_threshold": len(results),
"returned": len(top_results),
"parameters": {
"months": args.months,
"min_score": args.min_score,
"min_deal_value": args.min_deal_value,
"top": args.top,
"include_champion": args.include_champion,
},
"deals": top_results,
}
# Print summary
print(f"\n{'='*70}")
print(f"🔥 TOP {len(top_results)} REVIVAL OPPORTUNITIES")
print(f"{'='*70}")
for i, d in enumerate(top_results, 1):
print(f"\n#{i} | Score: {d['composite_score']}/100 | {d['company']}")
print(f" Deal Value: ${d['deal_value']:,.0f} | Days Since Close: {d['days_since_close']}")
print(f" Contact: {d['original_contact']['name']} ({d['original_contact']['email']})")
print(f" Title: {d['original_contact']['title']}")
print(f" Loss Reason: {d['loss_reason']}")
print(f" Revival Type: {d['revival_type']}")
print(f" Triggers: {', '.join(d['triggers']) or 'none'}")
if not args.dry_run:
DATA_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_FILE.write_text(json.dumps(output, indent=2, default=str))
print(f"\n📁 Saved to {OUTPUT_FILE}")
else:
print(f"\n🏃 Dry run — not saving.")
print(f"\n{'='*70}")
print(f"✅ Deal Resurrector v2 complete. {len(top_results)} deals ready for review.")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,287 @@
#!/usr/bin/env python3
"""
ICP Learning Analyzer learns from your prospect approve/reject decisions.
Reads prospect approval/rejection history from a PostgreSQL database,
analyzes patterns by source type (cold, trigger, warm, revival), and
outputs recommended ICP filter changes.
Your ICP evolves from your own data instead of guesswork.
Analyzes:
- Industry patterns (which industries convert vs. get rejected)
- Company size sweet spots (employee count ranges that win)
- Title patterns (which seniority levels get approved)
- Revenue ranges (what deal sizes work)
- Approval rates per source type
Usage:
python3 icp_learning_analyzer.py
python3 icp_learning_analyzer.py --config data/icp-config.json
Requires:
- DATABASE_URL environment variable (PostgreSQL connection string)
- psycopg2-binary package
- A prospects table with status, source, and company/contact joins
Configuration:
Create data/icp-config.json with source_type_mapping and min_sample_size.
See .env.example and data/icp-config.example.json for templates.
"""
import argparse
import json
import logging
import os
import sys
from collections import Counter, defaultdict
from datetime import datetime, timezone
from pathlib import Path
logging.basicConfig(level=logging.INFO, format="%(asctime)s [ICP-Analyzer] %(message)s")
log = logging.getLogger(__name__)
# ─── Configuration ───────────────────────────────────────────────────────────
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
DATA_DIR = BASE_DIR / "data"
OUTPUT_PATH = DATA_DIR / "icp-recommendations.json"
# Database connection string
DATABASE_URL = os.environ.get("DATABASE_URL", "")
# Default ICP config (override with --config flag)
DEFAULT_CONFIG = {
# Maps your prospect source names to analysis categories
"source_type_mapping": {
"cold_outbound": "cold",
"trigger_prospector": "trigger",
"website_visitor": "warm",
"deal_revival": "revival",
"referral": "warm",
"inbound": "warm",
},
# Minimum approved samples before generating recommendations
"min_sample_size": 30,
}
def load_config(config_path=None):
"""Load ICP config from file or use defaults."""
if config_path and Path(config_path).exists():
with open(config_path) as f:
return json.load(f)
default_path = DATA_DIR / "icp-config.json"
if default_path.exists():
with open(default_path) as f:
return json.load(f)
log.info("No config file found, using defaults")
return DEFAULT_CONFIG
def fetch_prospects():
"""Fetch approved/rejected prospects from database.
Expected schema:
prospects: source, status, signal, conviction_score, company_id, contact_id
companies: id, industry, employees, revenue_range
contacts: id, title
Status values: approved, skipped, sent, opened, replied, meeting, won, lost
"""
try:
import psycopg2
except ImportError:
log.error("psycopg2 not installed. Run: pip install psycopg2-binary")
return []
if not DATABASE_URL:
log.error("DATABASE_URL not set. Set it in your environment or .env file.")
return []
try:
conn = psycopg2.connect(DATABASE_URL)
cur = conn.cursor()
cur.execute("""
SELECT p.source, p.status, p.signal, p.conviction_score,
c.industry, c.employees, c.revenue_range,
ct.title
FROM prospects p
LEFT JOIN companies c ON p.company_id = c.id
LEFT JOIN contacts ct ON p.contact_id = ct.id
WHERE p.status IN ('approved', 'skipped', 'sent', 'opened',
'replied', 'meeting', 'won', 'lost')
""")
cols = [d[0] for d in cur.description]
rows = [dict(zip(cols, row)) for row in cur.fetchall()]
conn.close()
log.info(f"Fetched {len(rows)} prospect records")
return rows
except Exception as e:
log.error(f"Database query failed: {e}")
return []
def classify_status(status):
"""Map database status to binary approved/rejected for analysis."""
approved_statuses = {"approved", "sent", "opened", "replied", "meeting", "won"}
return "approved" if status in approved_statuses else "rejected"
def parse_revenue(revenue_range):
"""Parse revenue_range string to midpoint integer.
Handles formats like: "$10M-$50M", "10M-50M", "$5M - $10M"
Returns None if unparseable.
"""
if not revenue_range:
return None
cleaned = str(revenue_range).replace("$", "").replace(",", "").strip()
parts = (cleaned
.replace("M", "000000")
.replace("B", "000000000")
.replace("K", "000")
.split("-"))
try:
nums = [int(float(p.strip())) for p in parts if p.strip()]
return sum(nums) // len(nums) if nums else None
except (ValueError, ZeroDivisionError):
return None
def analyze_source_group(prospects, min_sample):
"""Analyze a group of prospects and return filter recommendations.
Returns recommendations for:
- industries: which to target, which to exclude
- employees: min/max employee count range
- titles: top-performing job titles
- revenue: min/max revenue range
- confidence: overall approval rate
"""
approved = [p for p in prospects if classify_status(p["status"]) == "approved"]
rejected = [p for p in prospects if classify_status(p["status"]) == "rejected"]
if len(approved) < min_sample:
return {
"status": "insufficient_data",
"sample_size": len(approved),
"min_required": min_sample,
"filters": {},
}
total_approved = len(approved)
total_rejected = max(len(rejected), 1)
# ── Industry Analysis ────────────────────────────────────────────────
approved_industries = Counter(p["industry"] for p in approved if p.get("industry"))
rejected_industries = Counter(p["industry"] for p in rejected if p.get("industry"))
# Industries with >10% of approvals = recommend targeting
rec_industries = [ind for ind, cnt in approved_industries.most_common(10)
if cnt / total_approved >= 0.10]
# Industries with >30% of rejections and <5% of approvals = recommend excluding
exclude_industries = [ind for ind, cnt in rejected_industries.most_common()
if cnt / total_rejected >= 0.30
and approved_industries.get(ind, 0) / total_approved < 0.05]
# ── Employee Count Analysis ──────────────────────────────────────────
approved_emp = sorted([p["employees"] for p in approved if p.get("employees")])
emp_filters = {}
if approved_emp:
p10 = approved_emp[max(0, len(approved_emp) // 10)]
p90 = approved_emp[min(len(approved_emp) - 1, len(approved_emp) * 9 // 10)]
emp_filters["min_employees"] = p10
emp_filters["max_employees"] = p90
# ── Title Analysis ───────────────────────────────────────────────────
approved_titles = Counter(p["title"] for p in approved if p.get("title"))
top_titles = [t for t, _ in approved_titles.most_common(8)]
# ── Revenue Analysis ─────────────────────────────────────────────────
approved_rev = [parse_revenue(p.get("revenue_range")) for p in approved]
approved_rev = sorted([r for r in approved_rev if r is not None])
rev_filters = {}
if approved_rev:
rev_filters["revenue_min"] = approved_rev[max(0, len(approved_rev) // 10)]
rev_filters["revenue_max"] = approved_rev[min(len(approved_rev) - 1,
len(approved_rev) * 9 // 10)]
# ── Compile Filters ──────────────────────────────────────────────────
approval_rate = total_approved / (total_approved + len(rejected))
filters = {**emp_filters, **rev_filters}
if rec_industries:
filters["industries"] = rec_industries
if exclude_industries:
filters["exclude_industries"] = exclude_industries
if top_titles:
filters["titles"] = top_titles
return {
"status": "ready",
"filters": filters,
"confidence": round(approval_rate, 3),
"sample_size": total_approved,
"rejected_count": len(rejected),
"approval_rate": round(approval_rate, 3),
}
# ─── Main ────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="ICP Learning Analyzer")
parser.add_argument("--config", help="Path to icp-config.json")
args = parser.parse_args()
config = load_config(args.config)
source_mapping = config.get("source_type_mapping", DEFAULT_CONFIG["source_type_mapping"])
min_sample = config.get("min_sample_size", DEFAULT_CONFIG["min_sample_size"])
prospects = fetch_prospects()
# Group by mapped source type
grouped = defaultdict(list)
for p in prospects:
mapped = source_mapping.get(p.get("source", ""), "other")
grouped[mapped].append(p)
recommendations = {}
for source_type in ["cold", "trigger", "warm", "revival"]:
group = grouped.get(source_type, [])
log.info(f"[{source_type}] {len(group)} total prospects")
recommendations[source_type] = analyze_source_group(group, min_sample)
output = {
"generated_at": datetime.now(timezone.utc).isoformat(),
"status": "complete" if prospects else "no_data",
"total_prospects_analyzed": len(prospects),
"recommendations": recommendations,
}
DATA_DIR.mkdir(parents=True, exist_ok=True)
with open(OUTPUT_PATH, "w") as f:
json.dump(output, f, indent=2)
log.info(f"Wrote recommendations to {OUTPUT_PATH}")
# Summary
print(f"\n📊 ICP Learning Analyzer Results")
print(f" Total prospects analyzed: {len(prospects)}")
print(f" {''*40}")
for src, rec in recommendations.items():
status = rec.get("status", "unknown")
sample = rec.get("sample_size", 0)
rate = rec.get("approval_rate", 0)
print(f" {src:10s}: {status:20s} (n={sample}, approval={rate:.0%})")
if rec.get("filters"):
f = rec["filters"]
if f.get("industries"):
print(f" → Target: {', '.join(f['industries'][:5])}")
if f.get("exclude_industries"):
print(f" → Exclude: {', '.join(f['exclude_industries'][:3])}")
if f.get("min_employees"):
print(f" → Employees: {f['min_employees']}-{f.get('max_employees', '?')}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,410 @@
#!/usr/bin/env python3
"""
RB2B Instantly Router
Full pipeline: receives RB2B webhook data, runs suppression pipeline,
classifies visitor type, routes to correct Instantly campaign via API.
Can run as:
1. HTTP server (direct webhook endpoint)
2. Stdin processor (for testing / batch processing)
Usage:
python3 rb2b_instantly_router.py --serve --port 4100
echo '{"email":"..."}' | python3 rb2b_instantly_router.py
echo '{"email":"..."}' | python3 rb2b_instantly_router.py --dry-run
"""
import argparse
import json
import logging
import os
import re
import subprocess
import sys
from datetime import datetime, timezone
from http.server import HTTPServer, BaseHTTPRequestHandler
from pathlib import Path
from urllib.parse import urlparse
LOG = logging.getLogger("rb2b-router")
# ─── Configuration ───────────────────────────────────────────────────────────
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
# Import the suppression pipeline (lives in same directory)
sys.path.insert(0, str(BASE_DIR))
from rb2b_suppression_pipeline import run_suppression_pipeline, record_enrollment
# Instantly API key — set via environment variable
INSTANTLY_API_KEY = os.environ.get("INSTANTLY_API_KEY", "")
# Campaign configuration file — maps campaign names to Instantly campaign UUIDs
# Format: {"campaigns": {"Campaign-Name": "uuid-here", ...}}
CAMPAIGNS_FILE = BASE_DIR / "data" / "campaigns.json"
def _load_campaigns():
"""Load campaign name → UUID mapping from config file."""
try:
data = json.loads(CAMPAIGNS_FILE.read_text())
return data.get("campaigns", {})
except Exception:
return {}
CAMPAIGNS = _load_campaigns()
# ─── Agency Detection ────────────────────────────────────────────────────────
# Keywords that signal the visitor works at a marketing agency.
# Useful for routing agency visitors to agency-specific campaigns
# (e.g., partnership offers vs. client acquisition).
AGENCY_KEYWORDS_COMPANY = [
"agency", "digital", "media", "creative", "studio", "consultancy",
"marketing agency", "seo agency", "advertising",
]
AGENCY_KEYWORDS_TITLE = ["agency", "consultant", "freelance"]
AGENCY_INDUSTRIES = ["marketing and advertising", "advertising services"]
# ─── Seniority Tiers (for company-level dedup) ──────────────────────────────
# Lower rank = more senior. When two people from the same company visit,
# keep the more senior one.
SENIORITY_ORDER = {
"founder": 1, "ceo": 1, "co-founder": 1, "president": 1,
"cmo": 2, "cto": 2, "coo": 2, "cfo": 2, "chief": 2,
"svp": 3, "evp": 3, "senior vice president": 3,
"vp": 4, "vice president": 4,
"director": 5, "senior director": 5, "managing director": 5,
"head of": 6,
"manager": 7, "senior manager": 7,
}
# ─── Intent Scoring ─────────────────────────────────────────────────────────
# Maps URL path patterns to intent scores. Customize for your site.
PAGE_INTENT_SCORES = {
"pricing": 90, "plans": 90, "contact": 85, "demo": 85,
"get-started": 85, "free-consultation": 85, "request-demo": 85,
"case-study": 70, "case-studies": 70, "results": 70,
"services": 65, "solutions": 65, "about": 60,
"blog": 30, "podcast": 25,
}
# Visitors below this score are skipped (blog-only readers, etc.)
MIN_INTENT_SCORE = int(os.environ.get("MIN_INTENT_SCORE", "50"))
def score_intent(pages):
"""Score visitor intent from pages visited. Returns 0-100."""
if not pages:
return 30 # default low
if isinstance(pages, str):
pages = [pages]
max_score = 20
for page in pages:
path = page.lower().strip("/")
for pattern, score in PAGE_INTENT_SCORES.items():
if pattern in path:
max_score = max(max_score, score)
return max_score
def is_agency(visitor):
"""Classify visitor as agency or non-agency based on multiple signals."""
signals = 0
company = (visitor.get("company_name") or visitor.get("company") or "").lower()
title = (visitor.get("job_title") or visitor.get("title") or "").lower()
industry = (visitor.get("industry") or "").lower()
size = visitor.get("company_size") or visitor.get("employees") or 0
if isinstance(size, str):
nums = re.findall(r'\d+', size)
size = int(nums[-1]) if nums else 0
for kw in AGENCY_KEYWORDS_COMPANY:
if kw in company:
signals += 1
break
for kw in AGENCY_KEYWORDS_TITLE:
if kw in title:
signals += 1
break
if industry in AGENCY_INDUSTRIES:
signals += 1
if size < 200 and ("marketing" in industry or "advertising" in industry):
signals += 1
# Require at least 2 signals to classify as agency
return signals >= 2
def detect_source_site(visitor):
"""Determine which of your sites the visitor came from.
Customize the domain checks for your own properties.
"""
pages = visitor.get("pages_visited") or visitor.get("page_views") or visitor.get("source_url") or ""
if isinstance(pages, list):
pages = " ".join(pages)
pages = pages.lower()
# Add your site domains here
# if "product-b.com" in pages:
# return "product-b.com"
# elif "product-a.com" in pages:
# return "product-a.com"
return os.environ.get("DEFAULT_SOURCE_SITE", "your-site.com")
def route_to_campaign(source_site, agency):
"""Determine the correct Instantly campaign based on source site + agency classification.
Customize campaign names to match your CAMPAIGNS_FILE config.
Returns a campaign name string that maps to a UUID in campaigns.json.
"""
# Example routing logic — customize for your campaigns:
if agency:
return os.environ.get("CAMPAIGN_AGENCY", "Agency-Default")
return os.environ.get("CAMPAIGN_GENERAL", "General-Default")
def get_seniority_rank(title):
"""Get seniority rank (lower = more senior). Returns 99 for unknown."""
title_lower = title.lower()
for keyword, rank in SENIORITY_ORDER.items():
if keyword in title_lower:
return rank
return 99
def ensure_campaign_active(campaign_name):
"""Check if campaign is active; if paused, activate it via Instantly API."""
campaign_id = CAMPAIGNS.get(campaign_name)
if not campaign_id or not INSTANTLY_API_KEY:
return
try:
check = subprocess.run(
["curl", "-s", f"https://api.instantly.ai/api/v2/campaigns/{campaign_id}",
"-H", f"Authorization: Bearer {INSTANTLY_API_KEY}"],
capture_output=True, text=True, timeout=10
)
data = json.loads(check.stdout)
status = data.get("status", 0)
if status != 1: # 1 = active
LOG.info(f" 🔄 Campaign {campaign_name} is paused, activating...")
subprocess.run(
["curl", "-s", "-X", "POST",
f"https://api.instantly.ai/api/v2/campaigns/{campaign_id}/activate",
"-H", f"Authorization: Bearer {INSTANTLY_API_KEY}",
"-H", "Content-Type: application/json",
"-d", "{}"],
capture_output=True, text=True, timeout=10
)
except Exception as e:
LOG.warning(f" ⚠️ Could not check/activate campaign {campaign_name}: {e}")
def add_to_instantly(visitor, campaign_name):
"""Add lead to Instantly campaign via API."""
campaign_id = CAMPAIGNS.get(campaign_name)
if not campaign_id:
LOG.error(f"Campaign not found in config: {campaign_name}")
return False
if not INSTANTLY_API_KEY:
LOG.error("INSTANTLY_API_KEY not set")
return False
ensure_campaign_active(campaign_name)
email = visitor.get("email") or visitor.get("business_email")
first_name = visitor.get("first_name") or (
visitor.get("name", "").split()[0] if visitor.get("name") else "there"
)
company = visitor.get("company_name") or visitor.get("company") or ""
# Format page visited for personalization
pages = visitor.get("pages_visited") or visitor.get("page_views") or []
if isinstance(pages, str):
pages = [pages]
page_display = pages[0] if pages else ""
if "://" in page_display:
page_display = urlparse(page_display).path
lead_data = {
"campaign": campaign_id,
"email": email,
"first_name": first_name,
"last_name": visitor.get("last_name", ""),
"company_name": company,
"website": visitor.get("company_website") or visitor.get("website") or "",
"custom_variables": {
"companyName": company,
"firstName": first_name,
"title": visitor.get("job_title") or visitor.get("title") or "",
"industry": visitor.get("industry") or "",
"pageVisited": page_display,
},
}
result = subprocess.run(
["curl", "-s", "-X", "POST", "https://api.instantly.ai/api/v2/leads",
"-H", f"Authorization: Bearer {INSTANTLY_API_KEY}",
"-H", "Content-Type: application/json",
"-d", json.dumps(lead_data)],
capture_output=True, text=True, timeout=15
)
try:
resp = json.loads(result.stdout)
if resp.get("email") or resp.get("id"):
LOG.info(f" ✅ Added to Instantly: {email}{campaign_name}")
return True
else:
LOG.warning(f" ⚠️ Instantly response: {result.stdout[:200]}")
return False
except Exception:
LOG.error(f" ❌ Instantly error: {result.stdout[:200]}")
return False
def process_visitor(visitor, dry_run=False):
"""Full pipeline: score → suppress → classify → route → enroll."""
email = visitor.get("email") or visitor.get("business_email")
if not email:
return {"status": "skipped", "reason": "no email"}
company = visitor.get("company_name") or visitor.get("company") or ""
title = visitor.get("job_title") or visitor.get("title") or ""
domain = email.split("@")[1].lower() if "@" in email else ""
LOG.info(f"\n{''*50}")
LOG.info(f"Processing: {email} ({company}, {title})")
# 1. Intent scoring
pages = visitor.get("pages_visited") or visitor.get("page_views") or []
intent_score = score_intent(pages)
if intent_score < MIN_INTENT_SCORE:
LOG.info(f" ⏭️ Low intent: {intent_score} < {MIN_INTENT_SCORE}")
return {"status": "skipped", "reason": f"low intent ({intent_score})"}
# 2. Suppression pipeline
suppressed, layers = run_suppression_pipeline(email, company, domain)
if suppressed:
last_reason = layers[-1][2] if layers else "unknown"
LOG.info(f" 🚫 Suppressed: {last_reason}")
return {"status": "suppressed", "reason": last_reason}
# 3. Classify agency
agency = is_agency(visitor)
# 4. Detect source site
source_site = detect_source_site(visitor)
# 5. Route to campaign
campaign = route_to_campaign(source_site, agency)
LOG.info(f" 📍 Source: {source_site} | Agency: {agency} | Campaign: {campaign}")
LOG.info(f" 📊 Intent: {intent_score} | Seniority: {get_seniority_rank(title)}")
if dry_run:
return {
"status": "dry_run",
"email": email,
"campaign": campaign,
"intent_score": intent_score,
"agency": agency,
"source_site": source_site,
}
# 6. Add to Instantly
success = add_to_instantly(visitor, campaign)
if success:
record_enrollment(email, domain, campaign)
return {"status": "enrolled", "email": email, "campaign": campaign}
else:
return {"status": "failed", "email": email, "campaign": campaign}
# ─── Webhook Server ──────────────────────────────────────────────────────────
class WebhookHandler(BaseHTTPRequestHandler):
"""HTTP handler for RB2B webhook."""
dry_run = False
def do_POST(self):
length = int(self.headers.get('Content-Length', 0))
if length > 1_000_000:
self.send_response(413)
self.end_headers()
return
body = self.rfile.read(length)
try:
payload = json.loads(body)
except Exception:
self.send_response(400)
self.end_headers()
return
visitors = payload if isinstance(payload, list) else [payload]
results = [process_visitor(v, dry_run=self.dry_run) for v in visitors]
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({
"processed": len(results),
"enrolled": sum(1 for r in results if r["status"] == "enrolled"),
"suppressed": sum(1 for r in results if r["status"] == "suppressed"),
"skipped": sum(1 for r in results if r["status"] == "skipped"),
}).encode())
def do_GET(self):
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "ok", "service": "rb2b-instantly-router"}).encode())
def log_message(self, fmt, *args):
LOG.info(fmt % args)
# ─── CLI ─────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="RB2B → Instantly Router")
parser.add_argument("--serve", action="store_true", help="Run as HTTP webhook server")
parser.add_argument("--port", type=int, default=4100, help="Server port (default: 4100)")
parser.add_argument("--dry-run", action="store_true", help="Score and classify without enrolling")
parser.add_argument("-v", "--verbose", action="store_true")
args = parser.parse_args()
logging.basicConfig(
level=logging.DEBUG if args.verbose else logging.INFO,
format="%(asctime)s %(message)s", datefmt="%H:%M:%S",
)
if args.serve:
WebhookHandler.dry_run = args.dry_run
server = HTTPServer(("0.0.0.0", args.port), WebhookHandler)
LOG.info(f"🚀 RB2B → Instantly router on port {args.port} (dry_run={args.dry_run})")
try:
server.serve_forever()
except KeyboardInterrupt:
server.shutdown()
else:
payload = json.load(sys.stdin)
visitors = payload if isinstance(payload, list) else [payload]
for v in visitors:
result = process_visitor(v, dry_run=args.dry_run)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,329 @@
#!/usr/bin/env python3
"""
RB2B 5-Layer Suppression Pipeline
Checks a visitor against multiple suppression layers before enrolling in outbound campaigns.
Layers: CRM Outbound Platform Payment Provider Product Analytics Internal Blocklist
Prevents you from cold-emailing existing customers, active leads, competitors, or
people you already contacted recently.
Usage:
# Check a single email
python3 rb2b_suppression_pipeline.py --email john@acme.com --company "Acme Inc"
# Dry run (show what would happen)
python3 rb2b_suppression_pipeline.py --email john@acme.com --dry-run
"""
import argparse
import json
import logging
import os
import subprocess
import sys
from datetime import datetime, timezone, timedelta
from pathlib import Path
LOG = logging.getLogger("rb2b-suppression")
# ─── Configuration ───────────────────────────────────────────────────────────
# Base directory — override with BASE_DIR env var or defaults to script parent
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
DATA_DIR = BASE_DIR / "data"
# API keys loaded from environment
OUTBOUND_API_KEY = os.environ.get("INSTANTLY_API_KEY", "")
CRM_API_KEY = os.environ.get("HUBSPOT_API_KEY", "")
# File paths for local data caches
BLOCKLIST_FILE = DATA_DIR / "blocklist.json"
ENROLLED_FILE = DATA_DIR / "enrolled.json"
STRIPE_CACHE_FILE = DATA_DIR / "stripe-customers.json"
ACTIVE_USERS_CACHE_FILE = DATA_DIR / "active-users.json"
# ─── Competitor domains to auto-suppress ─────────────────────────────────────
# Add your competitors' email domains here
COMPETITOR_DOMAINS = {
# Example: "competitor1.com", "competitor2.com",
}
# ─── Personal email domains (skip — no business value) ──────────────────────
PERSONAL_DOMAINS = {
"gmail.com", "yahoo.com", "hotmail.com", "outlook.com",
"icloud.com", "protonmail.com", "aol.com", "live.com",
"me.com", "mail.com", "ymail.com",
}
# ─── Company dedup window (days) ────────────────────────────────────────────
# Only enroll 1 contact per company domain within this window
COMPANY_DEDUP_WINDOW_DAYS = int(os.environ.get("COMPANY_DEDUP_WINDOW_DAYS", "7"))
def _curl_json(method, url, headers=None, body=None):
"""Make HTTP request via curl, return parsed JSON."""
cmd = ["curl", "-s", "-X", method, url]
for k, v in (headers or {}).items():
cmd.extend(["-H", f"{k}: {v}"])
if body:
cmd.extend(["-d", json.dumps(body)])
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=15)
return json.loads(result.stdout) if result.stdout.strip() else None
except Exception as e:
LOG.warning(f"API error: {e}")
return None
# ─── Layer 0: Personal Email Filter ─────────────────────────────────────────
def check_personal_email(email):
"""Filter personal email domains (gmail, yahoo, etc.)."""
domain = email.split("@")[1].lower() if "@" in email else ""
if domain in PERSONAL_DOMAINS:
return True, f"personal email domain: {domain}"
return False, "business email"
# ─── Layer 1: CRM Check (HubSpot) ───────────────────────────────────────────
def check_crm(email, domain=None):
"""Check if contact exists in your CRM. Uses HubSpot API."""
if not CRM_API_KEY:
LOG.warning("No CRM API key available, skipping CRM layer")
return False, "crm key unavailable (skipped)"
data = _curl_json("POST", "https://api.hubapi.com/crm/v3/objects/contacts/search",
headers={
"Authorization": f"Bearer {CRM_API_KEY}",
"Content-Type": "application/json",
},
body={
"filterGroups": [{
"filters": [{
"propertyName": "email",
"operator": "EQ",
"value": email,
}]
}],
"limit": 1,
}
)
if data and data.get("total", 0) > 0:
return True, f"exists in CRM (contact ID: {data['results'][0].get('id')})"
return False, "not in CRM"
# ─── Layer 2: Outbound Platform Check (Instantly) ───────────────────────────
def check_outbound_platform(email):
"""Check if email is already in any outbound campaign (90-day window)."""
if not OUTBOUND_API_KEY:
LOG.warning("No outbound API key available, skipping outbound layer")
return False, "outbound key unavailable (skipped)"
data = _curl_json("GET",
f"https://api.instantly.ai/api/v2/leads?email={email}&limit=10",
headers={"Authorization": f"Bearer {OUTBOUND_API_KEY}"}
)
if data and isinstance(data, dict):
items = data.get("items", [])
if items:
cutoff = datetime.now(timezone.utc) - timedelta(days=90)
for lead in items:
created = lead.get("timestamp_created", "")
campaign = lead.get("campaign_name", "unknown")
try:
dt = datetime.fromisoformat(created.replace("Z", "+00:00"))
if dt > cutoff:
return True, f"active in outbound campaign: {campaign}"
except:
return True, f"exists in outbound (campaign: {campaign})"
return False, "not in outbound platform"
# ─── Layer 3: Payment Provider Check (Stripe) ───────────────────────────────
def check_payment_provider(email, domain=None):
"""Check if email/domain matches a paying customer. Uses cached Stripe data."""
if not STRIPE_CACHE_FILE.exists():
LOG.info("Payment provider cache not found, skipping layer")
return False, "payment check skipped (no cache)"
try:
customers = json.loads(STRIPE_CACHE_FILE.read_text())
emails = {c.get("email", "").lower() for c in customers}
domains = {c.get("email", "").split("@")[1].lower()
for c in customers if "@" in c.get("email", "")}
if email.lower() in emails:
return True, "paying customer (exact email match)"
if domain and domain.lower() in domains:
return True, f"paying customer (domain match: {domain})"
except Exception:
pass
return False, "not a paying customer"
# ─── Layer 4: Product Analytics Check (Mixpanel/Amplitude) ──────────────────
def check_product_analytics(email):
"""Check if user has been active in product recently. Uses cached data."""
if not ACTIVE_USERS_CACHE_FILE.exists():
LOG.info("Product analytics cache not found, skipping layer")
return False, "product analytics check skipped (no cache)"
try:
users = json.loads(ACTIVE_USERS_CACHE_FILE.read_text())
active_emails = {u.get("email", "").lower() for u in users}
if email.lower() in active_emails:
return True, "active product user (last 30 days)"
except Exception:
pass
return False, "not an active product user"
# ─── Layer 5: Blocklist (competitors + manual) ──────────────────────────────
def check_blocklist(email, domain=None):
"""Check against competitor domains and manual blocklist."""
email_domain = email.split("@")[1].lower() if "@" in email else ""
if email_domain in COMPETITOR_DOMAINS:
return True, f"competitor domain: {email_domain}"
if BLOCKLIST_FILE.exists():
try:
blocklist = json.loads(BLOCKLIST_FILE.read_text())
blocked_emails = {e.lower() for e in blocklist.get("emails", [])}
blocked_domains = {d.lower() for d in blocklist.get("domains", [])}
if email.lower() in blocked_emails:
return True, "manually blocklisted (email)"
if email_domain in blocked_domains:
return True, f"manually blocklisted (domain: {email_domain})"
except Exception:
pass
return False, "not blocklisted"
# ─── Company-Level Deduplication ─────────────────────────────────────────────
def check_company_dedup(email, company_domain, window_days=None):
"""Only allow 1 contact per company domain within a rolling window."""
window_days = window_days or COMPANY_DEDUP_WINDOW_DAYS
if not ENROLLED_FILE.exists():
return False, "no prior enrollments"
try:
enrolled = json.loads(ENROLLED_FILE.read_text())
cutoff = (datetime.now(timezone.utc) - timedelta(days=window_days)).isoformat()
for entry in enrolled:
if (entry.get("domain") == company_domain and
entry.get("enrolled_at", "") > cutoff and
entry.get("email") != email):
return True, (f"company already enrolled: {entry.get('email')} "
f"on {entry.get('enrolled_at', '')[:10]}")
except Exception:
pass
return False, "no company dedup conflict"
# ─── Pipeline Orchestrator ───────────────────────────────────────────────────
def run_suppression_pipeline(email, company=None, domain=None, dry_run=False):
"""Run all suppression layers in sequence.
Returns:
(should_suppress: bool, results: list of (layer_name, suppressed, reason))
"""
if not domain and "@" in email:
domain = email.split("@")[1].lower()
results = []
layers = [
("Personal Email Filter", lambda: check_personal_email(email)),
("CRM Check", lambda: check_crm(email, domain)),
("Outbound Platform", lambda: check_outbound_platform(email)),
("Payment Provider", lambda: check_payment_provider(email, domain)),
("Product Analytics", lambda: check_product_analytics(email)),
("Blocklist", lambda: check_blocklist(email, domain)),
("Company Dedup", lambda: check_company_dedup(email, domain)),
]
for layer_name, check_fn in layers:
suppressed, reason = check_fn()
results.append((layer_name, suppressed, reason))
if suppressed:
return True, results
return False, results
def record_enrollment(email, domain, campaign):
"""Record an enrollment for company-level dedup tracking."""
try:
enrolled = json.loads(ENROLLED_FILE.read_text()) if ENROLLED_FILE.exists() else []
except Exception:
enrolled = []
enrolled.append({
"email": email,
"domain": domain,
"campaign": campaign,
"enrolled_at": datetime.now(timezone.utc).isoformat(),
})
# Keep only last 90 days
cutoff = (datetime.now(timezone.utc) - timedelta(days=90)).isoformat()
enrolled = [e for e in enrolled if e.get("enrolled_at", "") > cutoff]
ENROLLED_FILE.parent.mkdir(parents=True, exist_ok=True)
ENROLLED_FILE.write_text(json.dumps(enrolled, indent=2))
# ─── CLI ─────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="RB2B Suppression Pipeline")
parser.add_argument("--email", required=True)
parser.add_argument("--company", default="")
parser.add_argument("--domain", default="")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--verbose", "-v", action="store_true")
args = parser.parse_args()
logging.basicConfig(
level=logging.DEBUG if args.verbose else logging.INFO,
format="%(message)s",
)
suppressed, results = run_suppression_pipeline(
args.email, args.company, args.domain, args.dry_run
)
print(f"\n📋 Suppression check for: {args.email}")
print(f"{''*50}")
for layer_name, was_suppressed, reason in results:
icon = "🚫" if was_suppressed else ""
print(f" {icon} {layer_name}: {reason}")
print(f"{''*50}")
if suppressed:
print(f" 🚫 SUPPRESSED — do not enroll")
else:
print(f" ✅ CLEAR — eligible for enrollment")
return 0 if not suppressed else 1
if __name__ == "__main__":
sys.exit(main())

View file

@ -0,0 +1,419 @@
#!/usr/bin/env python3
"""
RB2B Webhook Ingestion Server
Receives RB2B webhook payloads (via Zapier/Make or direct integration),
scores visitor intent based on pages visited, checks ICP fit, and outputs
structured signals for downstream processing.
Can run as:
1. HTTP webhook server (direct RB2B integration)
2. Stdin processor (for testing / batch processing)
Usage:
# Process a single webhook payload from stdin
echo '{"email":"john@acme.com",...}' | python3 rb2b_webhook_ingest.py
# Process a batch file (one JSON per line)
python3 rb2b_webhook_ingest.py --batch webhooks.jsonl
# Run as HTTP webhook server
python3 rb2b_webhook_ingest.py --serve --port 4100
# Dry run (show scoring without side effects)
python3 rb2b_webhook_ingest.py --dry-run < payload.json
"""
import argparse
import json
import logging
import os
import re
import sys
from datetime import datetime, timezone
from http.server import HTTPServer, BaseHTTPRequestHandler
from pathlib import Path
from urllib.parse import urlparse
# ─── Configuration ───────────────────────────────────────────────────────────
LOG = logging.getLogger("rb2b-ingest")
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
OUTPUT_DIR = BASE_DIR / "data" / "signals"
# ─── Intent Scoring ─────────────────────────────────────────────────────────
# Maps URL path patterns to intent scores (0-100).
# Higher score = stronger purchase intent.
# Customize these for your site structure.
PAGE_INTENT_SCORES = {
# Hot pages — active buying signals
"pricing": 90,
"plans": 90,
"contact": 85,
"demo": 85,
"request-demo": 85,
"book-a-call": 85,
"get-started": 85,
"free-consultation": 85,
"proposal": 80,
"quote": 80,
# Warm pages — research/evaluation
"case-study": 70,
"case-studies": 70,
"results": 70,
"testimonials": 65,
"about": 60,
"team": 55,
"services": 65,
"solutions": 65,
# Service pages — customize for your offerings
# "your-service-1": 75,
# "your-service-2": 75,
# Cool pages — awareness/education
"blog": 30,
"podcast": 25,
"webinar": 40,
"resource": 35,
"guide": 35,
"ebook": 40,
}
# Minimum intent score to process (skip pure blog readers)
MIN_INTENT_SCORE = int(os.environ.get("MIN_INTENT_SCORE", "50"))
# ─── ICP Filters ────────────────────────────────────────────────────────────
# Title keywords that indicate decision-maker seniority
ICP_SENIORITY_KEYWORDS = [
"cmo", "vp", "vice president", "director", "head of", "chief",
"svp", "evp", "founder", "ceo", "coo", "cto", "partner",
"senior director", "managing director", "president",
]
# Minimum company size (employees) for ICP match
ICP_MIN_COMPANY_SIZE = int(os.environ.get("ICP_MIN_COMPANY_SIZE", "50"))
def score_pages(pages_visited):
"""Score visitor intent based on pages they viewed.
Args:
pages_visited: list of URL strings or page paths
Returns:
tuple: (max_score, hot_pages list, page_summary string)
"""
if not pages_visited:
return 0, [], "no pages tracked"
scores = []
hot_pages = []
for page_url in pages_visited:
try:
path = urlparse(page_url).path if "://" in page_url else page_url
except Exception:
path = page_url
path = path.lower().strip("/")
best_score = 20 # default for unknown pages
matched_pattern = None
for pattern, score in PAGE_INTENT_SCORES.items():
if pattern in path:
if score > best_score:
best_score = score
matched_pattern = pattern
scores.append(best_score)
if best_score >= 65:
hot_pages.append({
"page": path or "/",
"score": best_score,
"pattern": matched_pattern or "unknown",
})
max_score = max(scores) if scores else 0
page_count = len(pages_visited)
summary = f"{page_count} pages, max intent {max_score}"
if hot_pages:
summary += f", hot: {', '.join(p['pattern'] for p in hot_pages[:3])}"
return max_score, hot_pages, summary
def check_icp_match(visitor):
"""Check if visitor matches ICP criteria.
Returns:
tuple: (is_match: bool, reason: str)
"""
title = (visitor.get("job_title") or visitor.get("title") or "").lower()
company_size = visitor.get("company_size") or visitor.get("employees") or 0
if isinstance(company_size, str):
nums = re.findall(r'\d+', company_size)
company_size = int(nums[-1]) if nums else 0
seniority_match = any(kw in title for kw in ICP_SENIORITY_KEYWORDS)
size_match = company_size >= ICP_MIN_COMPANY_SIZE
if seniority_match and size_match:
return True, f"ICP match: {title}, {company_size}+ employees"
elif seniority_match:
return True, f"seniority match: {title} (company size unknown/small)"
elif size_match:
return False, f"size match but low seniority: {title}"
else:
return False, f"no ICP match: {title}, ~{company_size} employees"
def extract_domain(visitor):
"""Extract company domain from visitor data."""
domain = visitor.get("company_domain") or visitor.get("domain") or ""
if domain:
return domain.lower().replace("www.", "")
email = visitor.get("email") or visitor.get("business_email") or ""
if email and "@" in email:
domain = email.split("@")[1].lower()
generic = {"gmail.com", "yahoo.com", "hotmail.com", "outlook.com", "aol.com"}
if domain not in generic:
return domain
website = visitor.get("company_website") or visitor.get("website") or ""
if website:
try:
parsed = urlparse(website if "://" in website else f"https://{website}")
return parsed.netloc.lower().replace("www.", "")
except Exception:
pass
return None
def process_visitor(visitor, dry_run=False, source_site="your-site.com"):
"""Process a single RB2B visitor webhook payload.
Args:
visitor: dict with RB2B webhook data
dry_run: if True, don't write output files
source_site: which site the visitor came from
Returns:
dict with processing result
"""
# Basic input validation
if not isinstance(visitor, dict):
return {"status": "error", "reason": "invalid payload type"}
# Extract key fields
name = visitor.get("name") or visitor.get("full_name") or "Unknown"
first_name = visitor.get("first_name") or (name.split()[0] if name != "Unknown" else "there")
email = visitor.get("email") or visitor.get("business_email")
title = visitor.get("job_title") or visitor.get("title") or "Unknown role"
company = visitor.get("company_name") or visitor.get("company") or "Unknown company"
linkedin = visitor.get("linkedin_url") or visitor.get("linkedin_profile") or ""
pages = visitor.get("pages_visited") or visitor.get("page_views") or visitor.get("pages") or []
if isinstance(pages, str):
pages = [pages]
domain = extract_domain(visitor)
# Score intent
intent_score, hot_pages, page_summary = score_pages(pages)
# Check ICP
is_icp, icp_reason = check_icp_match(visitor)
# Determine priority
if intent_score >= 80 and is_icp:
priority = "high"
elif intent_score >= 60 or is_icp:
priority = "medium"
else:
priority = "low"
result = {
"name": name,
"email": email,
"title": title,
"company": company,
"domain": domain,
"intent_score": intent_score,
"is_icp": is_icp,
"icp_reason": icp_reason,
"priority": priority,
"page_summary": page_summary,
"source_site": source_site,
}
# Skip low-intent visitors
if intent_score < MIN_INTENT_SCORE and not is_icp:
result["status"] = "skipped"
result["reason"] = f"below threshold: intent {intent_score} < {MIN_INTENT_SCORE}, not ICP"
LOG.info(f"⏭️ Skipped {name} ({company}): {result['reason']}")
return result
# Build structured signal output
hot_page_str = ""
if hot_pages:
hot_page_str = f" (viewed: {', '.join(p['pattern'] for p in hot_pages[:2])})"
signal = {
"type": "site_visit",
"topic": f"Website visitor: {name}, {title} at {company}{hot_page_str}{source_site}",
"priority": priority,
"domain": domain,
"data": {
"name": name,
"first_name": first_name,
"email": email,
"title": title,
"company": company,
"linkedin": linkedin,
"pages_visited": pages,
"intent_score": intent_score,
"hot_pages": hot_pages,
"is_icp": is_icp,
"icp_reason": icp_reason,
"source_site": source_site,
},
"created_at": datetime.now(timezone.utc).isoformat(),
}
if dry_run:
result["status"] = "dry_run"
result["signal"] = signal
LOG.info(f"🔍 [DRY RUN] Would create signal: {signal['topic']}")
else:
# Write signal to output directory as JSON
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
ts = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
safe_email = (email or "unknown").replace("@", "_at_")
signal_file = OUTPUT_DIR / f"signal_{ts}_{safe_email}.json"
signal_file.write_text(json.dumps(signal, indent=2))
result["status"] = "signal_created"
result["signal_file"] = str(signal_file)
LOG.info(
f"{'' if result['status'] == 'signal_created' else '📋'} "
f"{name} ({company}) — intent:{intent_score} icp:{is_icp}{result['status']}"
)
return result
# ─── Webhook Server ──────────────────────────────────────────────────────────
class RB2BWebhookHandler(BaseHTTPRequestHandler):
"""HTTP handler for direct RB2B webhook integration."""
dry_run = False
source_site = "your-site.com"
def do_POST(self):
content_length = int(self.headers.get('Content-Length', 0))
if content_length > 1_000_000: # 1MB limit
self.send_response(413)
self.end_headers()
return
body = self.rfile.read(content_length)
try:
payload = json.loads(body)
except json.JSONDecodeError:
self.send_response(400)
self.end_headers()
self.wfile.write(b'{"error":"invalid json"}')
return
visitors = payload if isinstance(payload, list) else [payload]
results = [process_visitor(v, dry_run=self.dry_run, source_site=self.source_site)
for v in visitors]
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({
"processed": len(results),
"signals_created": sum(1 for r in results if r.get("status") == "signal_created"),
"skipped": sum(1 for r in results if r.get("status") == "skipped"),
}).encode())
def do_GET(self):
"""Health check endpoint."""
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps({"status": "ok", "service": "rb2b-webhook-ingest"}).encode())
def log_message(self, format, *args):
LOG.info(format % args)
# ─── CLI ─────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(description="RB2B Webhook → Signal Pipeline")
parser.add_argument("--batch", help="Process batch file (one JSON per line)")
parser.add_argument("--serve", action="store_true", help="Run as HTTP webhook server")
parser.add_argument("--port", type=int, default=4100, help="Server port (default: 4100)")
parser.add_argument("--dry-run", action="store_true", help="Don't write signal files")
parser.add_argument("--source-site", default="your-site.com", help="Source site name")
parser.add_argument("--verbose", "-v", action="store_true")
args = parser.parse_args()
logging.basicConfig(
level=logging.DEBUG if args.verbose else logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
if args.serve:
RB2BWebhookHandler.dry_run = args.dry_run
RB2BWebhookHandler.source_site = args.source_site
server = HTTPServer(("0.0.0.0", args.port), RB2BWebhookHandler)
LOG.info(f"🚀 RB2B webhook server listening on port {args.port}")
LOG.info(f" POST http://localhost:{args.port}/ to ingest visitors")
LOG.info(f" Dry run: {args.dry_run}")
try:
server.serve_forever()
except KeyboardInterrupt:
LOG.info("Shutting down...")
server.shutdown()
elif args.batch:
results = []
with open(args.batch) as f:
for line in f:
line = line.strip()
if not line:
continue
try:
visitor = json.loads(line)
result = process_visitor(visitor, dry_run=args.dry_run, source_site=args.source_site)
results.append(result)
except json.JSONDecodeError as e:
LOG.error(f"Invalid JSON line: {e}")
created = sum(1 for r in results if r.get("status") == "signal_created")
skipped = sum(1 for r in results if r.get("status") == "skipped")
print(f"\n📊 Batch complete: {len(results)} processed, {created} signals, {skipped} skipped")
else:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError as e:
LOG.error(f"Invalid JSON on stdin: {e}")
sys.exit(1)
visitors = payload if isinstance(payload, list) else [payload]
for visitor in visitors:
result = process_visitor(visitor, dry_run=args.dry_run, source_site=args.source_site)
print(json.dumps(result, indent=2))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,9 @@
# Core dependencies
requests>=2.28.0
# For ICP Learning Analyzer (PostgreSQL connection)
# Only needed if using icp_learning_analyzer.py
psycopg2-binary>=2.9.0
# Optional: python-dotenv for loading .env files
python-dotenv>=1.0.0

View file

@ -0,0 +1,410 @@
#!/usr/bin/env python3
"""
Trigger-Based Prospecting Engine
Monitors job postings, new hires, and funding signals to identify
companies where new marketing leaders are evaluating agency/vendor relationships.
Searches across multiple signal categories:
- New CMO/VP Marketing hires (leadership change = budget reallocation)
- Marketing leadership job postings (team building = growth mode)
- Agency search signals (active evaluation)
- Funding rounds (capital to deploy on growth)
Each signal is scored, enriched with industry/size estimates, and paired
with a personalized outreach hook and email draft.
Usage:
python3 trigger_prospector.py --days 7 --top 15 --min-score 50
Requires: BRAVE_API_KEY environment variable
"""
import argparse
import json
import os
import random
import re
import sys
from datetime import datetime, timedelta
from pathlib import Path
from urllib.parse import urlencode
from urllib.request import Request, urlopen
# ─── Configuration ───────────────────────────────────────────────────────────
BASE_DIR = Path(os.environ.get("BASE_DIR", Path(__file__).resolve().parent))
DATA_DIR = BASE_DIR / "data"
OUTPUT_FILE = DATA_DIR / "trigger-prospects-latest.json"
BRAVE_SEARCH_URL = "https://api.search.brave.com/res/v1/web/search"
# Your company info (for email templates)
YOUR_COMPANY_NAME = os.environ.get("YOUR_COMPANY_NAME", "Your Company")
YOUR_SENDER_NAME = os.environ.get("YOUR_SENDER_NAME", "Your Name")
# ─── Signal Search Queries ───────────────────────────────────────────────────
# Customize these queries for your target market.
# Each category maps to a list of search queries that detect buying signals.
SEARCH_QUERIES = {
"new_hire": [
'"hired head of marketing"',
'"new CMO" announced',
'"VP marketing joined"',
'"head of growth" hired',
'"VP of marketing" appointed',
'"chief marketing officer" joins',
],
"job_posting": [
'"head of marketing" job posting site:linkedin.com',
'"VP of marketing" hiring site:linkedin.com',
'"CMO" open role site:linkedin.com',
],
"agency_search": [
'"looking for marketing agency"',
'"looking for agency" marketing',
'"seeking marketing partner"',
'"RFP" "marketing agency"',
],
"funding": [
'"series A" raised marketing',
'"series B" raised marketing',
'"raised" million marketing growth',
'"funding round" marketing scale',
],
}
# ─── Service Keyword Mapping ────────────────────────────────────────────────
# Maps your service offerings to keywords found in signal text.
# Used to suggest which services to pitch to each prospect.
SERVICE_KEYWORDS = {
"SEO": ["seo", "organic", "search engine", "content marketing", "blog", "rankings"],
"Paid Media": ["paid", "ppc", "ads", "advertising", "google ads", "facebook ads",
"media buy", "paid social", "paid search"],
"Creative": ["creative", "brand", "design", "video", "content", "storytelling"],
"CRO": ["conversion", "cro", "optimization", "landing page", "funnel", "a/b test"],
"AI Marketing": ["ai", "artificial intelligence", "machine learning", "automation",
"personalization"],
}
def get_brave_api_key():
"""Get Brave Search API key from environment."""
key = os.environ.get("BRAVE_API_KEY")
if not key:
print("❌ BRAVE_API_KEY not set.", file=sys.stderr)
print(" Get one at: https://api.search.brave.com/", file=sys.stderr)
sys.exit(1)
return key
def brave_search(query: str, api_key: str, freshness: str = "pw", count: int = 10) -> list:
"""Search Brave and return results list."""
params = urlencode({"q": query, "count": count, "freshness": freshness})
url = f"{BRAVE_SEARCH_URL}?{params}"
req = Request(url, headers={
"Accept": "application/json",
"Accept-Encoding": "identity",
"X-Subscription-Token": api_key,
})
try:
with urlopen(req, timeout=15) as resp:
data = json.loads(resp.read().decode())
return data.get("web", {}).get("results", [])
except Exception as e:
print(f" Warning: Search failed for '{query[:50]}...': {e}", file=sys.stderr)
return []
def freshness_for_days(days: int) -> str:
"""Map day count to Brave freshness parameter."""
if days <= 1:
return "pd"
elif days <= 7:
return "pw"
elif days <= 30:
return "pm"
return "py"
def extract_company_name(title: str, description: str) -> str:
"""Best-effort company name extraction from search result text."""
patterns = [
r"(?:at|joins?|hired by|appointed at|named .* at)\s+([A-Z][A-Za-z0-9&\.\- ]{1,40}?)"
r"(?:\s+as|\s*[,\.\-\|]|\s+to\b)",
r"([A-Z][A-Za-z0-9&\.\- ]{1,40}?)\s+(?:hires?|appoints?|names?|announces?|welcomes?)\b",
r"([A-Z][A-Za-z0-9&\.\- ]{1,40}?)\s+(?:raises?|secures?|closes?)\s+\$",
r"([A-Z][A-Za-z0-9&\.\- ]{1,40}?)\s+(?:series [A-C]|funding)",
]
text = f"{title} {description}"
for pat in patterns:
m = re.search(pat, text)
if m:
name = m.group(1).strip().rstrip(" -,.|")
if name.lower() not in {"the", "a", "new", "former", "our", "this", "why", "how", "what"}:
return name
parts = re.split(r"[\|\-–—:]", title)
if parts:
candidate = parts[0].strip()
if len(candidate) < 50 and candidate[0:1].isupper():
return candidate
return title[:60]
def estimate_company_size(text: str) -> str:
"""Estimate company size from context clues in signal text."""
text_lower = text.lower()
if any(w in text_lower for w in ["enterprise", "fortune 500", "10,000", "global"]):
return "1000+"
if any(w in text_lower for w in ["series c", "series d", "ipo", "public"]):
return "500-1000"
if any(w in text_lower for w in ["series b", "growth stage", "scale"]):
return "200-500"
if any(w in text_lower for w in ["series a", "startup", "seed"]):
return "50-200"
if any(w in text_lower for w in ["pre-seed", "bootstrapped", "early stage"]):
return "10-50"
return "50-500"
def estimate_industry(text: str) -> str:
"""Estimate industry from signal text."""
text_lower = text.lower()
industries = {
"SaaS": ["saas", "software", "platform", "cloud", "app"],
"E-commerce": ["ecommerce", "e-commerce", "retail", "shop", "store", "dtc", "d2c"],
"Fintech": ["fintech", "financial", "banking", "payments", "insurance"],
"Healthcare": ["health", "medical", "biotech", "pharma", "wellness"],
"Education": ["edtech", "education", "learning", "course"],
"AI/ML": ["artificial intelligence", "machine learning", "ai-powered", "ai company"],
"Crypto/Web3": ["crypto", "blockchain", "web3", "defi", "nft"],
"Media": ["media", "publishing", "news", "content"],
"B2B Services": ["b2b", "consulting", "services", "agency"],
}
for industry, keywords in industries.items():
if any(k in text_lower for k in keywords):
return industry
return "Technology"
def suggest_services(text: str) -> list:
"""Suggest which of your services to pitch based on signal text."""
text_lower = text.lower()
matched = []
for service, keywords in SERVICE_KEYWORDS.items():
if any(k in text_lower for k in keywords):
matched.append(service)
if not matched:
matched = ["SEO", "Paid Media"] # Sensible defaults
return matched
def score_prospect(signal_type: str, size_est: str, services: list, text: str) -> int:
"""Score a prospect 0-100 based on signal type, company fit, and context."""
score = 0
# Signal type scoring
signal_scores = {"new_hire": 35, "job_posting": 25, "funding": 30, "agency_search": 40}
score += signal_scores.get(signal_type, 20)
# Company size fit (mid-market is ideal for most agencies)
size_scores = {"10-50": 10, "50-200": 25, "200-500": 25, "500-1000": 15, "1000+": 5}
score += size_scores.get(size_est, 15)
# Service alignment
score += min(len(services) * 5, 20)
# Bonus signals in text
text_lower = text.lower()
if "cmo" in text_lower or "chief marketing" in text_lower:
score += 10
if "agency" in text_lower:
score += 5
if any(w in text_lower for w in ["review", "evaluate", "looking for", "rfp"]):
score += 5
return min(score, 100)
def generate_outreach_hook(company: str, signal_type: str) -> str:
"""Generate a casual outreach hook based on the signal type."""
hooks = {
"new_hire": [
f"New marketing leadership at {company}. The first 90 days is when the best "
f"leaders figure out what's actually working.",
f"Congrats on the new hire at {company}. Leadership changes are the best time "
f"to audit what's driving results and what's noise.",
],
"job_posting": [
f"Noticed {company} is hiring marketing roles. Usually means growth is the "
f"priority. We help companies hit targets while the team ramps up.",
f"{company} is building out the marketing team. We've been the bridge for "
f"companies in that exact phase.",
],
"funding": [
f"Congrats on the raise. Post-funding is when the pressure to scale "
f"acquisition hits. We help turn capital into pipeline efficiently.",
f"Saw the funding news for {company}. The companies that win post-raise "
f"scale acquisition without burning through runway.",
],
"agency_search": [
f"Saw {company} is evaluating marketing partners. Happy to throw our hat in.",
f"Noticed you're looking for a marketing partner at {company}.",
],
}
options = hooks.get(signal_type, [f"Noticed some movement at {company}."])
return random.choice(options)
def generate_email_draft(company, signal_type, services):
"""Generate a trigger-based cold email draft."""
services_str = ", ".join(services[:3]) if services else "growth marketing"
cta = random.choice([
"Worth exploring?", "Curious if relevant?", "Worth a conversation?",
"Make sense to chat?", "Worth 15 min?",
])
signoff = YOUR_SENDER_NAME
templates = {
"new_hire": {
"subject": f"{company}, new leadership = fresh eyes",
"body": (f"Hey,\n\nSaw the leadership change at {company}. The first 90 days "
f"are when the best marketing leaders audit what's working and cut what's not.\n\n"
f"We specialize in {services_str} and figured the timing might be right.\n\n"
f"{cta}\n\n{signoff}"),
},
"job_posting": {
"subject": f"{company} is hiring, we can help now",
"body": (f"Hey,\n\nNoticed {company} is hiring marketing roles. Hiring takes time, "
f"but growth targets don't wait.\n\n"
f"We've been the bridge for companies in that exact gap, handling "
f"{services_str} while the team ramps up.\n\n{cta}\n\n{signoff}"),
},
"funding": {
"subject": f"congrats on the raise, {company}",
"body": (f"Hey,\n\nSaw the funding news. Congrats. Post-raise is when the pressure "
f"to scale acquisition really hits.\n\n"
f"We help companies turn funding into efficient pipeline growth, "
f"specifically through {services_str}.\n\n{cta}\n\n{signoff}"),
},
"agency_search": {
"subject": f"{company} + {YOUR_COMPANY_NAME}",
"body": (f"Hey,\n\nSaw you're evaluating marketing partners at {company}.\n\n"
f"We specialize in {services_str}. Happy to share a few quick wins "
f"we'd go after in the first 30 days. No commitment.\n\n{cta}\n\n{signoff}"),
},
}
t = templates.get(signal_type, templates["agency_search"])
return f"Subject: {t['subject']}\n\n{t['body']}"
def suggest_channel(signal_type: str) -> str:
"""Suggest the best outreach channel for this signal type."""
channels = {
"new_hire": "LinkedIn (congratulate + connect)",
"agency_search": "Email (direct response)",
"funding": "LinkedIn + Email (warm congrats)",
"job_posting": "Email",
}
return channels.get(signal_type, "Email")
# ─── Main Pipeline ───────────────────────────────────────────────────────────
def run(days: int = 7, top: int = 15, min_score: int = 50):
api_key = get_brave_api_key()
freshness = freshness_for_days(days)
print(f"🔍 Trigger-Based Prospecting Engine")
print(f" Scanning last {days} days | Top {top} | Min score: {min_score}")
print(f" {'-'*50}")
all_prospects = []
seen_urls = set()
for signal_type, queries in SEARCH_QUERIES.items():
print(f"\n📡 Scanning: {signal_type.replace('_', ' ').title()}")
for query in queries:
print(f"{query[:60]}...")
results = brave_search(query, api_key, freshness=freshness, count=8)
for r in results:
url = r.get("url", "")
if url in seen_urls:
continue
seen_urls.add(url)
title = r.get("title", "")
desc = r.get("description", "")
full_text = f"{title} {desc}"
company = extract_company_name(title, desc)
size_est = estimate_company_size(full_text)
industry = estimate_industry(full_text)
services = suggest_services(full_text)
score = score_prospect(signal_type, size_est, services, full_text)
if score < min_score:
continue
prospect = {
"company": company,
"signal_type": signal_type,
"signal_detail": title,
"signal_url": url,
"signal_date": datetime.now().strftime("%Y-%m-%d"),
"prospect_score": score,
"industry": industry,
"est_company_size": size_est,
"suggested_services": services,
"suggested_channel": suggest_channel(signal_type),
"outreach_hook": generate_outreach_hook(company, signal_type),
"email_draft": generate_email_draft(company, signal_type, services),
}
all_prospects.append(prospect)
# Deduplicate by company (keep highest score)
company_best = {}
for p in all_prospects:
key = p["company"].lower().strip()
if key not in company_best or p["prospect_score"] > company_best[key]["prospect_score"]:
company_best[key] = p
prospects = sorted(company_best.values(),
key=lambda x: x["prospect_score"], reverse=True)[:top]
# Save output
DATA_DIR.mkdir(parents=True, exist_ok=True)
output = {
"generated_at": datetime.now().isoformat(),
"params": {"days": days, "top": top, "min_score": min_score},
"total_signals_found": len(all_prospects),
"prospects": prospects,
}
OUTPUT_FILE.write_text(json.dumps(output, indent=2))
# Print summary
print(f"\n{'='*60}")
print(f"🎯 TOP {len(prospects)} PROSPECTS (of {len(all_prospects)} signals found)")
print(f"{'='*60}\n")
for i, p in enumerate(prospects, 1):
print(f" {i:2d}. [{p['prospect_score']:3d}] {p['company']}")
print(f" Signal: {p['signal_type']}{p['signal_detail'][:70]}")
print(f" Size: {p['est_company_size']} | Industry: {p['industry']}")
print(f" Services: {', '.join(p['suggested_services'])}")
print(f" Channel: {p['suggested_channel']}")
print()
print(f"📁 Saved to: {OUTPUT_FILE}")
return prospects
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Trigger-Based Prospecting Engine")
parser.add_argument("--days", type=int, default=7, help="Lookback window in days (default: 7)")
parser.add_argument("--top", type=int, default=15, help="Number of top prospects (default: 15)")
parser.add_argument("--min-score", type=int, default=50, help="Minimum prospect score (default: 50)")
args = parser.parse_args()
run(days=args.days, top=args.top, min_score=args.min_score)

62
seo-ops/.env.example Normal file
View file

@ -0,0 +1,62 @@
# ─────────────────────────────────────────────
# AI SEO Ops — Environment Configuration
# ─────────────────────────────────────────────
# Copy this file to .env and fill in your values.
# All scripts read from these environment variables.
# ─── Required ───────────────────────────────
# Your website domain (used for Ahrefs organic keyword lookup)
YOUR_DOMAIN=example.com
# Google Search Console site URL (run gsc_auth.py to see your verified sites)
# Format: "https://www.example.com/" or "sc-domain:example.com"
GSC_SITE_URL=https://www.example.com/
# Google OAuth credentials (for GSC API access)
# Create at: https://console.cloud.google.com/apis/credentials
# Choose "OAuth 2.0 Client ID" → "Desktop application"
GOOGLE_CLIENT_ID=your-client-id.apps.googleusercontent.com
GOOGLE_CLIENT_SECRET=your-client-secret
# ─── Optional ───────────────────────────────
# Ahrefs API token (enables keyword data + competitor analysis)
# Get from: https://ahrefs.com/api
AHREFS_TOKEN=
# Competitor domains to analyze (comma-separated)
COMPETITORS=competitor1.com,competitor2.com,competitor3.com
# Brave Search API key (enables X/Twitter trend scanning in trend_scout.py)
# Get from: https://brave.com/search/api/
BRAVE_API_KEY=
# ─── Content Configuration ──────────────────
# Directory containing your content files (markdown, JSON atoms)
# Used by content_attack_brief.py for topic fingerprinting
CONTENT_DIR=./content
# Output directory for generated reports and JSON
OUTPUT_DIR=./output
# Content verticals for trend relevance scoring (comma-separated)
CONTENT_VERTICALS=AI marketing automation,SEO trends,content marketing AI,startup growth strategy
# Reddit subreddits to monitor for trends (comma-separated)
TREND_SUBREDDITS=marketing,SEO,startups,entrepreneur,digitalmarketing
# ─── Advanced ───────────────────────────────
# Path to GSC OAuth token file (auto-created by gsc_auth.py)
# GSC_TOKEN_FILE=.gsc-token.json
# Path to Google credentials JSON file (alternative to GOOGLE_CLIENT_ID/SECRET)
# GOOGLE_CREDENTIALS_FILE=/path/to/credentials.json
# Custom topic keywords for content fingerprinting (JSON format)
# TOPIC_KEYWORDS_JSON='{"AI agents": ["ai agent", "autonomous agent"], "SEO": ["seo", "keyword", "ranking"]}'
# Custom seed keywords per topic for Ahrefs research (JSON format)
# TOPIC_TO_SEEDS_JSON='{"AI agents": ["ai agents for marketing", "ai agent platform"]}'

203
seo-ops/README.md Normal file
View file

@ -0,0 +1,203 @@
# AI SEO Ops
**Find the keywords your competitors missed. Automatically.**
An AI-powered SEO operations suite that replaces the manual grind of keyword research, competitor analysis, content briefing, and trend detection. These tools pull data from Google Search Console, Ahrefs, and the open web to surface the exact opportunities your team should act on — ranked by impact, scored by confidence, and ready to execute.
## What's Inside
### 🎯 Content Attack Brief Generator
Synthesizes your content footprint, Ahrefs keyword data, GSC performance, and competitor gaps into a weekly prioritized keyword brief. Scores every keyword on Impact × Confidence and assigns an execution path (fully automated → team-assisted → expert-only).
**What it finds:**
- BOFU money keywords your competitors rank for but you don't
- Trending keywords surging before the competition notices
- Decaying pages losing traffic that need a refresh
- Outside-the-box content angles where you have unique authority
### 📊 GSC Keyword Optimizer
Pulls Google Search Console data and identifies "striking distance" keywords — queries where you rank positions 420 with decent impressions. These are your quick wins: small optimizations that can push you onto page one.
**What it does:**
- Finds keywords on the cusp of page one
- Calculates potential traffic gains from position improvements
- Identifies CTR underperformers (you rank well but nobody clicks)
- Groups related keywords for efficient optimization
### 🔑 GSC Client & Auth
A reusable Google Search Console API client with OAuth flow. Use it standalone or import it as a library in your own scripts.
**Features:**
- Top queries, pages, device splits, country splits
- Striking distance finder (positions 420)
- Query + page matrix for cannibalization analysis
- Daily trend tracking
- CLI and library modes
### 🔥 Trend Scout
Scans Google Trends, Hacker News, Reddit, and X/Twitter to find trending topics in your niche before they peak. Scores each trend for relevance to your content verticals and suggests content angles.
**Sources monitored:**
- Google Trends RSS (US)
- Hacker News (filtered for your niche)
- Reddit (configurable subreddits)
- X/Twitter (via Brave Search)
- YouTube competitor outlier detection
## Quick Start
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
### 2. Configure environment
```bash
cp .env.example .env
# Edit .env with your API keys and site configuration
```
### 3. Authenticate with Google Search Console
```bash
python gsc_auth.py
```
This opens a browser for OAuth consent. Your token is saved locally for subsequent use.
### 4. Run the tools
```bash
# Content Attack Brief — full keyword intelligence report
python content_attack_brief.py
# GSC Keyword Optimizer — find striking distance keywords
python gsc_client.py --striking --days 28
# GSC top queries
python gsc_client.py --queries 50 --days 28
# Trend Scout — find what's trending in your niche
python trend_scout.py
```
## Architecture
```
┌──────────────────────────────────────────────────┐
│ Content Attack Brief │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Content │ │ Ahrefs │ │ Competitor Gap │ │
│ │Fingerprint│ │ Keywords │ │ Analysis │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ └─────────────┼────────────────┘ │
│ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Impact × Confidence Scoring │ │
│ │ Volume · KD · CPC · Trend · Funnel Stage │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ┌──────────┐ ┌─────┴─────┐ ┌──────────────────┐ │
│ │ GSC │ │ Execution │ │ Trend Scout │ │
│ │ Client │ │ Pipeline │ │ (Google Trends │ │
│ │ │ │ │ │ HN · Reddit) │ │
│ └──────────┘ └───────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────┘
```
## Scoring Algorithm
Every keyword gets an **Impact** score (010) and a **Confidence** score (010). Priority = Impact × Confidence.
### Impact factors:
| Signal | Score |
|--------|-------|
| Volume ≥ 10K | +3 |
| Volume ≥ 2K | +2 |
| Volume ≥ 500 | +1 |
| CPC ≥ $15 | +3 |
| CPC ≥ $5 | +2 |
| CPC ≥ $1 | +1 |
| BOFU intent | +2 |
| MOFU intent | +1 |
| Trend surging (>50%) | +2 |
| Trend rising (>20%) | +1 |
### Confidence factors:
| Signal | Score |
|--------|-------|
| KD ≤ 10 | +4 |
| KD ≤ 20 | +3 |
| KD ≤ 35 | +2 |
| KD ≤ 50 | +1 |
| Already ranking top 10 | +3 |
| Ranking 1130 | +2 |
| Ranking 3150 | +1 |
| Topic authority match | +2 |
## Funnel Classification
Keywords are auto-classified into funnel stages:
- **BOFU** (Bottom of Funnel): "agency", "services", "pricing", "best", "vs", "alternative", "hire"
- **MOFU** (Middle of Funnel): "how to", "guide", "strategy", "case study", "roi", "tutorial"
- **TOFU** (Top of Funnel): Everything else
Commercial/transactional intent from Ahrefs automatically promotes to BOFU.
## Trend Detection
The Trend Scout scores each trending topic against your configured content verticals:
- **High relevance (25pts each):** Exact matches to your core topics
- **Medium relevance (10pts each):** Related industry terms
- **Low relevance (5pts each):** Tangential business terms
Trends scoring ≥20 get surfaced with content angle suggestions and recommended platforms.
## Configuration
All configuration is via environment variables (see `.env.example`):
| Variable | Required | Description |
|----------|----------|-------------|
| `GSC_SITE_URL` | Yes | Your GSC property (e.g., `https://www.example.com/`) |
| `GSC_TOKEN_FILE` | No | Path to GSC OAuth token (default: `.gsc-token.json`) |
| `GOOGLE_CLIENT_ID` | Yes | Google OAuth client ID |
| `GOOGLE_CLIENT_SECRET` | Yes | Google OAuth client secret |
| `AHREFS_TOKEN` | No | Ahrefs API token (enables keyword data + competitor analysis) |
| `YOUR_DOMAIN` | Yes | Your root domain for organic keyword tracking |
| `COMPETITORS` | No | Comma-separated competitor domains |
| `CONTENT_VERTICALS` | No | Comma-separated topic verticals for trend scoring |
| `TREND_SUBREDDITS` | No | Comma-separated subreddit names to monitor |
| `BRAVE_API_KEY` | No | Brave Search API key (enables X/Twitter trend scanning) |
| `OUTPUT_DIR` | No | Where to save output files (default: `./output`) |
## Using as a Claude Code Skill
Add this to your `.claude/agents/` directory and use the `SKILL.md` for Claude Code integration. The skill enables Claude to:
1. Run keyword analysis on demand
2. Generate weekly content attack briefs
3. Find and prioritize quick-win keywords from GSC
4. Monitor trending topics and suggest content
## File Structure
```
seo-ops/
├── README.md # This file
├── SKILL.md # Claude Code agent skill definition
├── content_attack_brief.py # Full keyword intelligence pipeline
├── gsc_client.py # GSC API client (library + CLI)
├── gsc_auth.py # GSC OAuth setup flow
├── trend_scout.py # Multi-source trend detection
├── requirements.txt # Python dependencies
└── .env.example # Environment variable template
```
## License
MIT

123
seo-ops/SKILL.md Normal file
View file

@ -0,0 +1,123 @@
# AI SEO Ops
AI-powered SEO operations: keyword intelligence, competitor gap analysis, GSC optimization, and trend detection.
## When to Use
- User asks for keyword research, content brief, or SEO analysis
- User wants to find quick-win keywords from Google Search Console
- User needs a competitor gap analysis
- User wants to identify trending topics for content creation
- User asks about decaying content or traffic drops
- User wants a prioritized list of keywords to target
## Tools
### Content Attack Brief (`content_attack_brief.py`)
Full keyword intelligence pipeline. Requires `AHREFS_TOKEN` and GSC auth.
```bash
# Run the full brief
python content_attack_brief.py
```
**What it produces:**
- Topic fingerprint from your content library
- BOFU money keywords ranked by Impact × Confidence
- Trending keywords with sparkline visualizations
- Competitor gap analysis (keywords they rank for, you don't)
- Decaying page alerts (traffic drops >30%)
- Execution pipeline (auto-create → semi-auto → team)
**Output:** Prints formatted report to stdout + saves JSON to `OUTPUT_DIR/content-attack-brief-latest.json`
### GSC Client (`gsc_client.py`)
Google Search Console API client. Works as CLI or importable library.
```bash
# CLI usage
python gsc_client.py --queries 50 --days 28
python gsc_client.py --striking # Striking distance keywords (pos 4-20)
python gsc_client.py --pages 100 --days 7
python gsc_client.py --trend # Daily click/impression trend
python gsc_client.py --devices # Mobile vs desktop split
python gsc_client.py --sites # List verified properties
python gsc_client.py --json --queries 25 # JSON output
```
```python
# Library usage
from gsc_client import GSCClient
gsc = GSCClient()
rows = gsc.striking_distance(days=28, min_position=4, max_position=20)
for row in rows:
print(f"{row['keys'][0]}: pos {row['position']:.1f}, {row['impressions']} impressions")
```
### GSC Auth (`gsc_auth.py`)
One-time OAuth setup for Google Search Console access.
```bash
python gsc_auth.py
# Opens browser → Google Sign-In → saves token locally
```
### Trend Scout (`trend_scout.py`)
Multi-source trend detection. No API keys required for basic functionality.
```bash
python trend_scout.py
```
**Sources:** Google Trends RSS, Hacker News, Reddit, X/Twitter (needs `BRAVE_API_KEY`), YouTube outlier detection
**Output:** Prints summary + saves JSON to `OUTPUT_DIR/flash-trends-latest.json` and markdown report.
## Configuration
All scripts read from environment variables. Copy `.env.example` to `.env` and fill in your values.
Required:
- `GSC_SITE_URL` — your Google Search Console property URL
- `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` — for GSC OAuth
- `YOUR_DOMAIN` — your root domain
Optional:
- `AHREFS_TOKEN` — enables Ahrefs keyword data and competitor analysis
- `COMPETITORS` — comma-separated competitor domains
- `BRAVE_API_KEY` — enables X/Twitter trend scanning
- `CONTENT_VERTICALS` — comma-separated topics for trend relevance scoring
- `TREND_SUBREDDITS` — comma-separated subreddits to monitor
## Scoring Model
Keywords are scored on two axes:
**Impact (0-10):** Volume + CPC + Funnel Stage + Trend direction
**Confidence (0-10):** Keyword Difficulty + Current ranking position + Topic authority
**Priority = Impact × Confidence** (max 100)
## Funnel Classification
- **BOFU:** Commercial/transactional intent, or keywords containing "agency", "services", "pricing", "best", "vs", "hire"
- **MOFU:** Informational with buying signals — "how to", "guide", "roi", "case study"
- **TOFU:** Pure informational
## Recommended Workflow
1. **Weekly:** Run `content_attack_brief.py` for the full intelligence report
2. **Daily:** Run `gsc_client.py --striking` to monitor striking distance keywords
3. **2x/week:** Run `trend_scout.py` to catch trending topics early
4. **Monthly:** Review competitor gaps and adjust `COMPETITORS` list
## Dependencies
```bash
pip install -r requirements.txt
```

View file

@ -0,0 +1,951 @@
#!/usr/bin/env python3
"""
Content Attack Brief Generator
Synthesizes your content library, Ahrefs keyword data, GSC performance,
and competitor gaps into a weekly prioritized keyword brief.
Usage:
# Set environment variables (see .env.example)
python content_attack_brief.py
# Or export inline
AHREFS_TOKEN="..." YOUR_DOMAIN="example.com" python content_attack_brief.py
"""
import json
import os
import sys
import re
import glob
import importlib.util
import math
import requests
from datetime import datetime, timedelta, date
from collections import Counter, defaultdict
from pathlib import Path
# ─────────────────────────────────────────────
# Config (all from environment variables)
# ─────────────────────────────────────────────
OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "./output"))
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
CONTENT_DIR = Path(os.environ.get("CONTENT_DIR", "./content"))
# Directory containing your content files (markdown, JSON atoms, etc.)
AHREFS_TOKEN = os.environ.get("AHREFS_TOKEN", "")
AHREFS_BASE = "https://api.ahrefs.com/v3"
AHREFS_HEADERS = lambda: {"Authorization": f"Bearer {AHREFS_TOKEN}"}
YOUR_DOMAIN = os.environ.get("YOUR_DOMAIN", "example.com")
COMPETITORS = [c.strip() for c in os.environ.get("COMPETITORS", "").split(",") if c.strip()]
# ─────────────────────────────────────────────
# 1. CONTENT FINGERPRINT
# ─────────────────────────────────────────────
STOPWORDS = {
"the","a","an","is","are","was","were","be","been","being","have","has","had",
"do","does","did","will","would","could","should","may","might","shall",
"and","but","or","nor","for","yet","so","at","by","for","in","of","on","to",
"with","as","that","this","these","those","it","its","i","we","you","they",
"he","she","him","her","our","their","your","my","what","which","who","when",
"where","how","not","all","also","more","very","just","from","about","into",
"than","then","there","so","up","out","if","no","can","one","time","like",
"get","got","use","used","make","made","work","well","way","new","good",
"go","going","know","think","want","need","see","look","come","give",
"take","say","even","most","much","such","here","now","over","any","some",
"them","us","first","two","other","his","her","its",
}
# Customize these topic keywords to match your content verticals
TOPIC_KEYWORDS = {
"AI agents": ["ai agent", "ai agents", "agent fleet", "autonomous agent", "llm agent", "multi-agent"],
"Claude/OpenAI": ["claude", "openai", "gpt", "anthropic", "gemini", "chatgpt"],
"SEO/AEO": ["seo", "aeo", "search engine", "organic", "keyword", "serp", "ranking", "backlink", "gsc"],
"Content marketing": ["content", "blog", "article", "post", "write", "writing", "publishing"],
"Marketing agency": ["agency", "client", "service", "campaign", "marketing"],
"Lead generation": ["lead gen", "leads", "pipeline", "outbound", "inbound", "funnel", "prospect"],
"AI automation": ["automation", "automate", "automated", "workflow", "script", "cron", "pipeline"],
"Revenue/ROI": ["revenue", "roi", "growth", "profit", "income", "mrr", "arr", "monetize"],
"Sales": ["sales", "deal", "close", "outreach", "cold email", "crm"],
"Social media": ["instagram", "tiktok", "youtube", "twitter", "linkedin", "social media", "viral"],
"B2B SaaS": ["saas", "b2b", "software", "product", "platform"],
"Strategy": ["strategy", "strategic", "plan", "roadmap", "framework", "playbook"],
"Analytics/Data": ["analytics", "data", "metrics", "kpi", "ga4", "mixpanel", "dashboard"],
}
# Override with environment variable if set (JSON format)
_custom_topics = os.environ.get("TOPIC_KEYWORDS_JSON")
if _custom_topics:
try:
TOPIC_KEYWORDS = json.loads(_custom_topics)
except json.JSONDecodeError:
print(" [WARN] Invalid TOPIC_KEYWORDS_JSON, using defaults", file=sys.stderr)
def extract_fingerprint():
"""Read content files from CONTENT_DIR, count topic frequencies."""
topic_counts = Counter()
phrase_counts = Counter()
# Load JSON content atoms (if available)
atom_files = sorted(glob.glob(str(CONTENT_DIR / "content-atoms-*.json")))
if atom_files:
latest = atom_files[-1]
try:
with open(latest) as f:
d = json.load(f)
atoms = d.get("atoms", [])
for atom in atoms:
text = (atom.get("content", "") + " " + " ".join(atom.get("tags", []))).lower()
_score_text(text, topic_counts, phrase_counts)
except Exception as e:
print(f" [WARN] Content atoms load error: {e}", file=sys.stderr)
# Load markdown files (last 30 days by filename prefix)
cutoff = date.today() - timedelta(days=30)
if CONTENT_DIR.exists():
for f in sorted(CONTENT_DIR.glob("**/*.md")):
m = re.match(r"(\d{4}-\d{2}-\d{2})", f.name)
if m:
try:
file_date = date.fromisoformat(m.group(1))
if file_date < cutoff:
continue
except ValueError:
pass
try:
text = f.read_text(errors="ignore").lower()
_score_text(text, topic_counts, phrase_counts)
except Exception:
pass
return topic_counts, phrase_counts
def _score_text(text, topic_counts, phrase_counts):
"""Score text against topic keywords and count phrase frequencies."""
for topic, keywords in TOPIC_KEYWORDS.items():
for kw in keywords:
count = text.count(kw)
if count > 0:
topic_counts[topic] += count
# Count meaningful 2-3 word phrases
words = re.findall(r'\b[a-z][a-z\-]{2,}\b', text)
words = [w for w in words if w not in STOPWORDS and len(w) > 3]
for i in range(len(words)-1):
bigram = f"{words[i]} {words[i+1]}"
phrase_counts[bigram] += 1
if i < len(words)-2:
trigram = f"{words[i]} {words[i+1]} {words[i+2]}"
phrase_counts[trigram] += 1
# ─────────────────────────────────────────────
# 2. KEYWORD SEEDS from fingerprint
# ─────────────────────────────────────────────
# Map topics to seed keywords for Ahrefs research
# Customize these for your industry/niche
TOPIC_TO_SEEDS = {
"AI agents": [
"ai agents for marketing", "ai agent platform", "marketing ai agents",
"ai agents b2b", "autonomous ai agents", "ai agent tools",
"build ai agents", "ai agents for business",
],
"Claude/OpenAI": [
"claude ai for business", "openai for marketing", "chatgpt marketing",
"gpt for content marketing", "ai writing tools",
],
"SEO/AEO": [
"seo agency", "ai seo tools", "seo for ai", "aeo optimization",
"answer engine optimization", "seo content strategy", "technical seo services",
"seo reporting tools", "enterprise seo agency",
],
"Content marketing": [
"content marketing agency", "content marketing strategy", "b2b content marketing",
"content marketing roi", "content marketing tools", "ai content marketing",
"content marketing services", "content strategy agency",
],
"Marketing agency": [
"digital marketing agency", "performance marketing agency", "b2b marketing agency",
"marketing agency pricing", "hire marketing agency", "marketing agency services",
"best marketing agencies", "saas marketing agency",
],
"Lead generation": [
"b2b lead generation", "lead generation agency", "lead generation strategy",
"b2b lead gen tools", "outbound lead generation", "lead generation services",
"demand generation agency", "lead generation for saas",
],
"AI automation": [
"marketing automation ai", "ai workflow automation", "automate marketing tasks",
"ai marketing automation tools", "marketing automation platform",
],
"Revenue/ROI": [
"marketing roi", "content marketing roi", "seo roi", "digital marketing roi",
"revenue driven marketing", "roi tracking marketing",
],
"Sales": [
"ai sales tools", "sales automation software", "cold email software",
"outbound sales automation", "ai cold email", "sales engagement platform",
],
"B2B SaaS": [
"saas marketing agency", "b2b saas marketing", "saas seo strategy",
"saas content marketing", "saas growth marketing",
],
"Analytics/Data": [
"marketing analytics tools", "seo analytics platform", "content analytics",
"marketing data analytics",
],
"Strategy": [
"digital marketing strategy", "content strategy consulting",
"marketing strategy agency", "growth strategy consulting",
],
}
# Override with environment variable if set (JSON format)
_custom_seeds = os.environ.get("TOPIC_TO_SEEDS_JSON")
if _custom_seeds:
try:
TOPIC_TO_SEEDS = json.loads(_custom_seeds)
except json.JSONDecodeError:
print(" [WARN] Invalid TOPIC_TO_SEEDS_JSON, using defaults", file=sys.stderr)
def derive_seeds(topic_counts):
"""Return ranked list of keyword seeds based on topic frequency."""
seeds = []
seen = set()
for topic, _ in topic_counts.most_common():
for seed in TOPIC_TO_SEEDS.get(topic, []):
if seed not in seen:
seeds.append(seed)
seen.add(seed)
# Add fallback seeds
fallbacks = [
"ai marketing", "seo services", "content marketing",
"digital marketing agency", "marketing automation",
"b2b lead generation", "marketing strategy",
]
for s in fallbacks:
if s not in seen:
seeds.append(s)
seen.add(s)
return seeds[:150]
# ─────────────────────────────────────────────
# 3. AHREFS KEYWORDS EXPLORER
# ─────────────────────────────────────────────
def fetch_ahrefs_keywords(seeds):
"""Pull Ahrefs Keywords Explorer data in batches of 50."""
if not AHREFS_TOKEN:
print(" [WARN] No AHREFS_TOKEN — skipping keyword data", file=sys.stderr)
return {}
results = {}
today = date.today()
date_to = today.replace(day=1) - timedelta(days=1)
date_from = (date_to.replace(day=1) - timedelta(days=335)).replace(day=1)
batch_size = 50
for i in range(0, len(seeds), batch_size):
batch = seeds[i:i+batch_size]
try:
import urllib.parse
qs = urllib.parse.urlencode({
"country": "us",
"keywords": ",".join(batch),
"select": "keyword,volume,difficulty,cpc,traffic_potential,intents,volume_monthly_history",
"volume_monthly_date_from": date_from.strftime("%Y-%m-%d"),
"volume_monthly_date_to": date_to.strftime("%Y-%m-%d"),
})
resp = requests.get(
f"{AHREFS_BASE}/keywords-explorer/overview?{qs}",
headers=AHREFS_HEADERS(),
timeout=30,
)
if resp.status_code == 200:
data = resp.json()
for kw_data in data.get("keywords", []):
kw = kw_data.get("keyword", "").lower()
if kw:
if "difficulty" in kw_data and "keyword_difficulty" not in kw_data:
kw_data["keyword_difficulty"] = kw_data["difficulty"]
intents = kw_data.get("intents", {})
if isinstance(intents, dict):
kw_data["is_commercial"] = intents.get("commercial", False)
kw_data["is_transactional"] = intents.get("transactional", False)
results[kw] = kw_data
else:
print(f" [WARN] Ahrefs keywords batch {i//batch_size+1}: HTTP {resp.status_code}", file=sys.stderr)
except Exception as e:
print(f" [WARN] Ahrefs keywords batch error: {e}", file=sys.stderr)
return results
# ─────────────────────────────────────────────
# 4. AHREFS ORGANIC KEYWORDS
# ─────────────────────────────────────────────
def fetch_organic_keywords(domain, limit=1000):
"""Pull Ahrefs organic keywords for a domain."""
if not AHREFS_TOKEN:
return []
today = date.today()
first_of_month = today.replace(day=1).strftime("%Y-%m-%d")
try:
resp = requests.get(
f"{AHREFS_BASE}/site-explorer/organic-keywords",
headers=AHREFS_HEADERS(),
params={
"target": domain,
"country": "us",
"date": first_of_month,
"select": "keyword,volume,best_position,keyword_difficulty,sum_traffic,is_commercial,is_transactional,best_position_url",
"order_by": "volume:desc",
"limit": limit,
"mode": "subdomains",
},
timeout=30,
)
if resp.status_code == 200:
return resp.json().get("keywords", [])
else:
print(f" [WARN] Ahrefs organic {domain}: HTTP {resp.status_code}", file=sys.stderr)
except Exception as e:
print(f" [WARN] Ahrefs organic {domain} error: {e}", file=sys.stderr)
return []
# ─────────────────────────────────────────────
# 5. COMPETITOR GAP ANALYSIS
# ─────────────────────────────────────────────
# Terms that indicate keywords are relevant to your business
# Customize this set for your niche
RELEVANT_TERMS = {
"marketing","seo","content","agency","digital","growth","lead","analytics",
"advertising","social","email","conversion","b2b","saas","strategy","ai",
"search","traffic","keyword","backlink","campaign","funnel","inbound",
"outbound","automation","crm","ppc","sem","cro","optimization",
"brand","performance","demand","revenue","roi","reporting","tools",
"software","platform","services","hire","consultant","pricing","best",
"vs","alternative","guide","how to","enterprise","startup","ecommerce",
"agent","aeo","answer engine","generative engine",
}
# Keywords to block from competitor gap results (noise)
GAP_BLOCKLIST = {
"photo search","reverse video","image search","reverse image",
"paragraph generator","paragraph writer","paragraph rewriter",
"sentence rewriter","text rewriter","text humanizer","ai rewrite",
"rewording tool","reword ai","paraphrasing tool","essay writer",
"grammar checker","spell checker","word counter","character counter",
"reviews","review","coupon","promo code","login","sign up","free trial",
"what is","definition of","wikipedia",
}
def is_relevant_keyword(kw):
"""Check if a keyword is relevant to your business."""
kw_lower = kw.lower()
if not any(term in kw_lower for term in RELEVANT_TERMS):
return False
if any(blocked in kw_lower for blocked in GAP_BLOCKLIST):
return False
return True
def find_competitor_gaps(my_keywords, competitor_data):
"""Find keywords where competitors rank top 20 but you don't rank or rank >50."""
my_positions = {}
for item in my_keywords:
kw = item.get("keyword", "").lower()
pos = item.get("best_position", 999)
my_positions[kw] = pos
gaps = []
seen_kws = set()
for comp_domain, comp_keywords in competitor_data.items():
for item in comp_keywords:
kw = item.get("keyword", "").lower()
if not kw or kw in seen_kws:
continue
if not is_relevant_keyword(kw):
continue
comp_pos = item.get("best_position", 999)
my_pos = my_positions.get(kw, 999)
if comp_pos <= 20 and my_pos > 50:
seen_kws.add(kw)
gaps.append({
"keyword": kw,
"volume": item.get("volume", 0),
"kd": item.get("keyword_difficulty", 0),
"competitor": comp_domain,
"comp_pos": comp_pos,
"your_pos": my_pos,
"is_commercial": item.get("is_commercial", False),
"is_transactional": item.get("is_transactional", False),
})
gaps.sort(key=lambda x: x.get("volume", 0), reverse=True)
return gaps
# ─────────────────────────────────────────────
# 6. GSC DATA
# ─────────────────────────────────────────────
def fetch_gsc_data():
"""Import gsc_client and pull 28d + 90d query data."""
try:
# Try importing from same directory
script_dir = Path(__file__).resolve().parent
spec = importlib.util.spec_from_file_location("gsc_client", str(script_dir / "gsc_client.py"))
gsc_mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(gsc_mod)
gsc = gsc_mod.GSCClient()
rows_28 = gsc.query(dimensions=["query"], row_limit=1000, days=28)
rows_90 = gsc.query(dimensions=["query"], row_limit=1000, days=90)
return rows_28, rows_90
except Exception as e:
print(f" [WARN] GSC error: {e}", file=sys.stderr)
return [], []
def find_decaying_pages(rows_28, rows_90):
"""Find queries that lost >30% clicks in 28d vs 90d per-day average."""
clicks_28 = {}
for row in rows_28:
keys = row.get("keys", [])
if keys:
clicks_28[keys[0].lower()] = row.get("clicks", 0)
clicks_90_norm = {}
for row in rows_90:
keys = row.get("keys", [])
if keys:
clicks_90_norm[keys[0].lower()] = row.get("clicks", 0) * (28 / 90)
decaying = []
for kw, c28 in clicks_28.items():
c90 = clicks_90_norm.get(kw, 0)
if c90 > 5:
if c28 < c90 * 0.7:
pct_loss = (c90 - c28) / c90 * 100
decaying.append({
"keyword": kw,
"clicks_28d": round(c28),
"clicks_90d_avg": round(c90),
"pct_loss": round(pct_loss, 1),
})
decaying.sort(key=lambda x: x.get("pct_loss", 0), reverse=True)
return decaying
# ─────────────────────────────────────────────
# 7 & 8. SCORING + TREND
# ─────────────────────────────────────────────
def compute_trend(history):
"""Compare first 3 months avg to last 3 months avg from volume_monthly_history."""
if not history or len(history) < 3:
return 0.0, "→ Stable"
volumes = []
try:
if isinstance(history[0], dict):
sorted_h = sorted(history, key=lambda x: x.get("date", x.get("month", "")))
volumes = [h.get("volume", h.get("search_volume", 0)) for h in sorted_h]
else:
volumes = [int(v) for v in history]
except Exception:
return 0.0, "→ Stable"
if len(volumes) < 6:
return 0.0, "→ Stable"
early_avg = sum(volumes[:3]) / 3
late_avg = sum(volumes[-3:]) / 3
if early_avg == 0:
pct = 100.0 if late_avg > 0 else 0.0
else:
pct = (late_avg - early_avg) / early_avg * 100
if pct > 50:
label = "🔥 Surging"
elif pct > 20:
label = "📈 Rising"
elif pct > 5:
label = "↗️ Growing"
elif pct >= -5:
label = "→ Stable"
elif pct >= -20:
label = "↘️ Declining"
else:
label = "📉 Falling"
return round(pct, 1), label
def make_sparkline(history):
"""ASCII sparkline from volume history."""
SPARKS = "▁▂▃▄▅▆▇█"
if not history:
return ""
try:
if isinstance(history[0], dict):
sorted_h = sorted(history, key=lambda x: x.get("date", x.get("month", "")))
volumes = [h.get("volume", h.get("search_volume", 0)) for h in sorted_h]
else:
volumes = [int(v) for v in history]
except Exception:
return ""
if not volumes or max(volumes) == 0:
return "" * min(len(volumes), 12)
mn, mx = min(volumes), max(volumes)
rng = mx - mn or 1
return "".join(SPARKS[min(7, int((v - mn) / rng * 7))] for v in volumes[-12:])
def funnel_stage(kw, is_commercial=False, is_transactional=False):
"""Classify keyword into funnel stage."""
kw_lower = kw.lower()
bofu_terms = ["agency", "services", "hire", "pricing", "tools", "software",
"best", " vs ", "alternative", "platform", "cost", "price",
"company", "firms", "consultant", "consultancy", "outsource"]
mofu_terms = ["how to", "guide", "strategy", "examples", "case study",
"roi", "tutorial", "template", "checklist", "tips", "framework",
"what is", "explained", "overview", "comparison"]
if is_commercial or is_transactional:
return "BOFU"
if any(t in kw_lower for t in bofu_terms):
return "BOFU"
if any(t in kw_lower for t in mofu_terms):
return "MOFU"
return "TOFU"
def execution_path(kd, current_pos, volume=0):
"""Determine execution path based on difficulty and current ranking."""
has_page = current_pos < 999
if kd <= 20 and not has_page:
return "🤖 AUTO — create new content"
if has_page and kd <= 50:
return "🤖 AUTO — refresh existing content"
if kd <= 40:
return "🤖+👤 SEMI — AI drafts, team reviews"
if kd <= 60:
return "👤+🤖 TEAM — writes content, AI optimizes"
return "👤 TEAM — expert content + link building"
def score_keyword(kw_data, current_pos=999, topic_counts=None):
"""Score a keyword dict with Impact × Confidence."""
volume = kw_data.get("volume", 0) or 0
kd = kw_data.get("keyword_difficulty", kw_data.get("kd", 50)) or 50
cpc = float(kw_data.get("cpc", 0) or 0)
history = kw_data.get("volume_monthly_history", [])
is_commercial = kw_data.get("is_commercial", False)
is_transactional = kw_data.get("is_transactional", False)
kw = kw_data.get("keyword", "").lower()
trend_pct, trend_label = compute_trend(history)
sparkline = make_sparkline(history)
stage = funnel_stage(kw, is_commercial, is_transactional)
# ── Impact (0-10) ──
impact = 0
if volume >= 10000:
impact += 3
elif volume >= 2000:
impact += 2
elif volume >= 500:
impact += 1
if cpc >= 15:
impact += 3
elif cpc >= 5:
impact += 2
elif cpc >= 1:
impact += 1
if stage == "BOFU":
impact += 2
elif stage == "MOFU":
impact += 1
if trend_pct > 50:
impact += 2
elif trend_pct > 20:
impact += 1
impact = min(10, impact)
# ── Confidence (0-10) ──
confidence = 0
if kd <= 10:
confidence += 4
elif kd <= 20:
confidence += 3
elif kd <= 35:
confidence += 2
elif kd <= 50:
confidence += 1
if current_pos <= 10:
confidence += 3
elif current_pos <= 30:
confidence += 2
elif current_pos <= 50:
confidence += 1
# Topic authority: check if keyword topic appears in content fingerprint
if topic_counts:
for topic, cnt in topic_counts.items():
topic_seeds = TOPIC_TO_SEEDS.get(topic, [])
if any(seed.lower() in kw or kw in seed.lower() for seed in topic_seeds):
if cnt > 5:
confidence += 2
break
confidence = min(10, confidence)
priority = impact * confidence
exec_path = execution_path(kd, current_pos, volume)
return {
"keyword": kw,
"volume": volume,
"kd": kd,
"cpc": round(cpc, 2),
"traffic_potential": kw_data.get("traffic_potential", 0),
"current_pos": current_pos if current_pos < 999 else None,
"stage": stage,
"trend_pct": trend_pct,
"trend_label": trend_label,
"sparkline": sparkline,
"impact": impact,
"confidence": confidence,
"priority": priority,
"exec_path": exec_path,
"is_commercial": is_commercial,
"is_transactional": is_transactional,
}
# ─────────────────────────────────────────────
# OUTPUT FORMATTING
# ─────────────────────────────────────────────
def fmt_vol(v):
if not v:
return ""
if v >= 1000000:
return f"{v/1000000:.1f}M"
if v >= 1000:
return f"{v/1000:.1f}K"
return str(v)
def fmt_pos(p):
if p is None:
return ""
return f"#{p}"
def fmt_kd(k):
if k is None:
return ""
if k <= 20:
return f"KD{k}🟢"
if k <= 40:
return f"KD{k}🟡"
if k <= 60:
return f"KD{k}🟠"
return f"KD{k}🔴"
def fmt_cpc(c):
if not c:
return ""
return f"${c:.2f}"
def print_kw_row(scored, idx=None):
prefix = f" {idx:>2}. " if idx else " "
pos_str = fmt_pos(scored.get("current_pos"))
trend = scored.get("trend_label", "→ Stable")
spark = scored.get("sparkline", "")
kw = scored.get("keyword", "")
vol = fmt_vol(scored.get("volume"))
kd = fmt_kd(scored.get("kd"))
cpc = fmt_cpc(scored.get("cpc"))
imp = scored.get("impact", 0)
conf = scored.get("confidence", 0)
pri = scored.get("priority", 0)
stage = scored.get("stage", "TOFU")
ep = scored.get("exec_path", "")
print(f"{prefix}{kw}")
print(f" Vol:{vol} {kd} CPC:{cpc} Pos:{pos_str} [{stage}]")
print(f" Trend: {trend} {spark} ({scored.get('trend_pct', 0):+.0f}%)")
print(f" Impact:{imp} Conf:{conf} Priority:{pri}")
print(f" {ep}")
print()
# ─────────────────────────────────────────────
# MAIN
# ─────────────────────────────────────────────
def main():
today = date.today()
week_str = today.strftime("%B %d, %Y")
print("=" * 68)
print(f"🎯 CONTENT ATTACK BRIEF — {YOUR_DOMAIN}")
print(" Content Fingerprint × Ahrefs × GSC × Competitor Gaps")
print(f" Week of {week_str}")
print("=" * 68)
print()
# ── Step 1: Topic fingerprint ──
print("📡 Ingesting content library...", file=sys.stderr)
topic_counts, phrase_counts = extract_fingerprint()
print()
print("🧬 TOPIC FINGERPRINT (30-day content)")
print()
if topic_counts:
max_count = max(topic_counts.values())
for topic, count in topic_counts.most_common(15):
bar_len = max(1, int(count / max_count * 30))
bar = "" * bar_len
print(f" {topic:<25} {bar} {count}")
else:
print(" [No content found in CONTENT_DIR]")
print()
# ── Step 2: Seeds ──
print("🔍 Deriving keyword seeds...", file=sys.stderr)
seeds = derive_seeds(topic_counts)
print(f" Derived {len(seeds)} keyword seeds from topic fingerprint")
print()
# ── Step 3: Ahrefs keywords ──
print("📊 Pulling Ahrefs keyword data...", file=sys.stderr)
seeds_data = fetch_ahrefs_keywords(seeds)
print(f" Got data for {len(seeds_data)} keywords", file=sys.stderr)
# ── Step 4: Your organic keywords ──
print(f"🌐 Pulling {YOUR_DOMAIN} organic keywords...", file=sys.stderr)
my_keywords = fetch_organic_keywords(YOUR_DOMAIN, limit=1000)
print(f" Got {len(my_keywords)} organic keywords", file=sys.stderr)
my_positions = {}
my_kw_data = {}
for item in my_keywords:
kw = item.get("keyword", "").lower()
my_positions[kw] = item.get("best_position", 999)
my_kw_data[kw] = item
# ── Step 5: Competitor gaps ──
competitor_gaps = []
if COMPETITORS:
print("🕵️ Pulling competitor keywords...", file=sys.stderr)
competitor_data = {}
for comp in COMPETITORS:
print(f" {comp}...", file=sys.stderr)
competitor_data[comp] = fetch_organic_keywords(comp, limit=200)
competitor_gaps = find_competitor_gaps(my_keywords, competitor_data)
print(f" Found {len(competitor_gaps)} competitor gap keywords", file=sys.stderr)
# ── Step 6: GSC data ──
print("📈 Pulling GSC data...", file=sys.stderr)
rows_28, rows_90 = fetch_gsc_data()
print(f" GSC 28d: {len(rows_28)} queries, 90d: {len(rows_90)} queries", file=sys.stderr)
decaying = find_decaying_pages(rows_28, rows_90)
# ── Step 7: Score all keywords ──
print("⚡ Scoring keywords...", file=sys.stderr)
all_scored = []
for kw, data in seeds_data.items():
pos = my_positions.get(kw, 999)
scored = score_keyword(data, current_pos=pos, topic_counts=topic_counts)
all_scored.append(scored)
for item in my_keywords:
kw = item.get("keyword", "").lower()
if kw not in seeds_data:
pos = item.get("best_position", 999)
scored = score_keyword(
{
"keyword": kw,
"volume": item.get("volume", 0),
"keyword_difficulty": item.get("keyword_difficulty", 50),
"cpc": 0,
"is_commercial": item.get("is_commercial", False),
"is_transactional": item.get("is_transactional", False),
},
current_pos=pos,
topic_counts=topic_counts,
)
all_scored.append(scored)
# Deduplicate
seen_kws = set()
deduped = []
for s in sorted(all_scored, key=lambda x: x["priority"], reverse=True):
if s["keyword"] not in seen_kws:
seen_kws.add(s["keyword"])
deduped.append(s)
all_scored = deduped
# ── BOFU: Money Keywords ──
bofu = [s for s in all_scored if s["stage"] == "BOFU"]
bofu.sort(key=lambda x: x["priority"], reverse=True)
print("💰 BOFU: MONEY KEYWORDS (top 12)")
print()
for i, s in enumerate(bofu[:12], 1):
print_kw_row(s, i)
# ── Trending ──
trending = sorted(all_scored, key=lambda x: x.get("trend_pct", 0), reverse=True)
trending = [t for t in trending if t.get("trend_pct", 0) > 5]
print("🔥 TRENDING: Fastest-growing (top 10)")
print()
for i, s in enumerate(trending[:10], 1):
print_kw_row(s, i)
# ── Competitor Gaps ──
if competitor_gaps:
print("🕳️ COMPETITOR GAP (top 15 relevant)")
print()
for i, gap in enumerate(competitor_gaps[:15], 1):
vol = fmt_vol(gap.get("volume", 0))
kd = fmt_kd(gap.get("kd", 0))
comp = gap.get("competitor", "")
comp_pos = gap.get("comp_pos", "?")
your_pos = gap.get("your_pos", 999)
your_pos_str = "not ranking" if your_pos >= 999 else f"#{your_pos}"
stage = funnel_stage(gap["keyword"], gap.get("is_commercial", False), gap.get("is_transactional", False))
ep = execution_path(gap.get("kd", 50), your_pos)
print(f" {i:>2}. {gap['keyword']}")
print(f" Vol:{vol} {kd} [{stage}] {comp} #{comp_pos} You:{your_pos_str}")
print(f" {ep}")
print()
# ── Decay Alert ──
print("📉 DECAY ALERT: Pages losing traffic (top 10)")
print()
if decaying:
for i, d in enumerate(decaying[:10], 1):
kw = d["keyword"]
c28 = d["clicks_28d"]
c90 = d["clicks_90d_avg"]
loss = d["pct_loss"]
pos = my_positions.get(kw)
pos_str = fmt_pos(pos) if pos and pos < 999 else "?"
print(f" {i:>2}. {kw}")
print(f" 28d clicks: {c28} 90d avg: {c90} Loss: {loss:.0f}% Pos: {pos_str}")
print()
else:
print(" No significant decay detected (GSC may be unavailable)")
print()
# ── Execution Pipeline ──
print("⚡ EXECUTION PIPELINE (crawl → walk → run)")
print()
pipeline = defaultdict(list)
for s in all_scored:
pipeline[s["exec_path"]].append(s)
order = [
"🤖 AUTO — create new content",
"🤖 AUTO — refresh existing content",
"🤖+👤 SEMI — AI drafts, team reviews",
"👤+🤖 TEAM — writes content, AI optimizes",
"👤 TEAM — expert content + link building",
]
for path in order:
items = pipeline.get(path, [])
if not items:
continue
print(f" {path} ({len(items)} keywords)")
top = sorted(items, key=lambda x: x["priority"], reverse=True)[:5]
for kw_item in top:
vol = fmt_vol(kw_item["volume"])
kd = kw_item["kd"]
pri = kw_item["priority"]
print(f"{kw_item['keyword']} Vol:{vol} KD:{kd} Pri:{pri}")
print()
# ── Summary ──
print("📊 SUMMARY")
print()
bofu_count = len([s for s in all_scored if s["stage"] == "BOFU"])
mofu_count = len([s for s in all_scored if s["stage"] == "MOFU"])
tofu_count = len([s for s in all_scored if s["stage"] == "TOFU"])
auto_count = len(pipeline.get("🤖 AUTO — create new content", []))
refresh_count = len(pipeline.get("🤖 AUTO — refresh existing content", []))
surging = [s for s in all_scored if "Surging" in s.get("trend_label", "")]
print(f" Keywords analyzed: {len(all_scored)}")
print(f" Competitor gaps: {len(competitor_gaps)}")
print(f" Decaying pages: {len(decaying)}")
print(f" BOFU / MOFU / TOFU: {bofu_count} / {mofu_count} / {tofu_count}")
print(f" Auto-create ready: {auto_count}")
print(f" Auto-refresh ready: {refresh_count}")
print(f" Surging keywords: {len(surging)}")
print(f" Top topics covered: {', '.join(t for t, _ in topic_counts.most_common(5))}")
print()
print("=" * 68)
# ── Save JSON ──
json_output = {
"generated_at": datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ"),
"week_of": week_str,
"domain": YOUR_DOMAIN,
"topic_fingerprint": dict(topic_counts.most_common(20)),
"all_keywords": all_scored,
"competitor_gaps": competitor_gaps[:30],
"decaying_pages": decaying[:20],
"summary": {
"total_keywords": len(all_scored),
"competitor_gaps": len(competitor_gaps),
"decaying_pages": len(decaying),
"bofu": bofu_count,
"mofu": mofu_count,
"tofu": tofu_count,
"auto_create": auto_count,
"auto_refresh": refresh_count,
"surging": len(surging),
},
}
output_path = OUTPUT_DIR / "content-attack-brief-latest.json"
output_path.write_text(json.dumps(json_output, indent=2))
print(f"\n✅ JSON saved to {output_path}", file=sys.stderr)
if __name__ == "__main__":
main()

131
seo-ops/gsc_auth.py Normal file
View file

@ -0,0 +1,131 @@
#!/usr/bin/env python3
"""
Google Search Console OAuth Setup
One-time authentication flow for GSC API access.
Opens a browser for Google Sign-In, exchanges the auth code for a token,
and saves the token locally for use by gsc_client.py.
Prerequisites:
1. Create a Google Cloud project with Search Console API enabled
2. Create OAuth 2.0 credentials (Desktop application type)
3. Set GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET env vars
OR set GOOGLE_CREDENTIALS_FILE to a JSON file with client_id/client_secret
Usage:
python gsc_auth.py
"""
import json
import os
import webbrowser
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import urlparse, parse_qs
import requests
# Configuration
CLIENT_ID = os.environ.get("GOOGLE_CLIENT_ID", "")
CLIENT_SECRET = os.environ.get("GOOGLE_CLIENT_SECRET", "")
REDIRECT_URI = os.environ.get("GSC_REDIRECT_URI", "http://localhost:8765")
SCOPES = "https://www.googleapis.com/auth/webmasters.readonly"
TOKEN_FILE = os.environ.get("GSC_TOKEN_FILE", os.path.join(os.path.dirname(__file__), ".gsc-token.json"))
# Try loading from credentials file if env vars not set
CREDS_FILE = os.environ.get("GOOGLE_CREDENTIALS_FILE", "")
if CREDS_FILE and os.path.exists(CREDS_FILE) and (not CLIENT_ID or not CLIENT_SECRET):
with open(CREDS_FILE) as f:
creds = json.load(f)
CLIENT_ID = CLIENT_ID or creds.get("client_id", "")
CLIENT_SECRET = CLIENT_SECRET or creds.get("client_secret", "")
if not CLIENT_ID or not CLIENT_SECRET:
print("ERROR: Google OAuth credentials required.")
print("Set GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET environment variables,")
print("or set GOOGLE_CREDENTIALS_FILE to a JSON file with client_id/client_secret.")
print("\nTo create credentials:")
print(" 1. Go to https://console.cloud.google.com/apis/credentials")
print(" 2. Create OAuth 2.0 Client ID (Desktop application)")
print(" 3. Download the JSON and set GOOGLE_CREDENTIALS_FILE, or copy the values")
exit(1)
auth_code = None
class CallbackHandler(BaseHTTPRequestHandler):
def do_GET(self):
global auth_code
query = parse_qs(urlparse(self.path).query)
auth_code = query.get("code", [None])[0]
self.send_response(200)
self.send_header("Content-Type", "text/html")
self.end_headers()
self.wfile.write(b"<h1>GSC Authorized! You can close this tab.</h1>")
def log_message(self, *args):
pass
# Build auth URL
auth_url = (
f"https://accounts.google.com/o/oauth2/v2/auth?"
f"client_id={CLIENT_ID}&"
f"redirect_uri={REDIRECT_URI}&"
f"response_type=code&"
f"scope={SCOPES}&"
f"access_type=offline&"
f"prompt=consent"
)
print("Opening browser for Google Sign-In...")
print(f"If the browser doesn't open, visit:\n{auth_url}\n")
webbrowser.open(auth_url)
# Wait for callback
port = int(REDIRECT_URI.split(":")[-1])
server = HTTPServer(("localhost", port), CallbackHandler)
server.handle_request()
if not auth_code:
print("ERROR: No auth code received")
exit(1)
# Exchange code for token
print("Exchanging code for token...")
resp = requests.post("https://oauth2.googleapis.com/token", data={
"code": auth_code,
"client_id": CLIENT_ID,
"client_secret": CLIENT_SECRET,
"redirect_uri": REDIRECT_URI,
"grant_type": "authorization_code"
})
if resp.status_code != 200:
print(f"ERROR: Token exchange failed — {resp.text}")
exit(1)
token_data = resp.json()
with open(TOKEN_FILE, "w") as f:
json.dump(token_data, f, indent=2)
os.chmod(TOKEN_FILE, 0o600)
print(f"✅ GSC token saved to {TOKEN_FILE}")
# Quick verification
try:
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
cred = Credentials(
token=token_data["access_token"],
refresh_token=token_data.get("refresh_token"),
token_uri="https://oauth2.googleapis.com/token",
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
)
service = build("searchconsole", "v1", credentials=cred)
sites = service.sites().list().execute()
site_urls = [s["siteUrl"] for s in sites.get("siteEntry", [])]
print(f"✅ GSC connected! Verified sites: {site_urls}")
print(f"\nSet GSC_SITE_URL to one of the above, e.g.:")
if site_urls:
print(f' export GSC_SITE_URL="{site_urls[0]}"')
except Exception as e:
print(f"⚠️ Token saved but verification failed: {e}")
print("The token should still work — try running gsc_client.py")

250
seo-ops/gsc_client.py Normal file
View file

@ -0,0 +1,250 @@
#!/usr/bin/env python3
"""
Google Search Console API Client
Direct API access via google-api-python-client. Token auto-refreshes on every call.
Usage as library:
from gsc_client import GSCClient
gsc = GSCClient()
gsc = GSCClient(site_url="https://www.example.com/")
# Top queries
rows = gsc.query(dimensions=["query"], row_limit=25, days=28)
# Page performance
rows = gsc.query(dimensions=["page"], row_limit=100, days=7)
# Striking distance keywords (positions 4-20)
rows = gsc.striking_distance(days=28)
# List all verified sites
sites = gsc.list_sites()
Usage as CLI:
python gsc_client.py --queries 25 --days 28
python gsc_client.py --pages 100 --days 7
python gsc_client.py --striking
python gsc_client.py --sites
python gsc_client.py --site "https://www.example.com/" --queries 10
python gsc_client.py --raw '{"dimensions":["query","page"],"rowLimit":5}'
"""
import json, os, sys, argparse
from datetime import datetime, timedelta
# Configuration via environment variables
GSC_SITE_URL = os.environ.get("GSC_SITE_URL", "")
GSC_TOKEN_FILE = os.environ.get("GSC_TOKEN_FILE", os.path.join(os.path.dirname(__file__), ".gsc-token.json"))
GOOGLE_CREDENTIALS_FILE = os.environ.get("GOOGLE_CREDENTIALS_FILE", "")
class GSCClient:
def __init__(self, site_url=None, token_file=None, creds_file=None):
self.site_url = site_url or GSC_SITE_URL
if not self.site_url:
raise ValueError(
"GSC site URL required. Set GSC_SITE_URL env var or pass site_url parameter.\n"
"Example: GSC_SITE_URL='https://www.example.com/'"
)
self.token_file = token_file or GSC_TOKEN_FILE
self.creds_file = creds_file or GOOGLE_CREDENTIALS_FILE
self._service = None
def _get_service(self):
if self._service:
return self._service
from google.oauth2.credentials import Credentials
from google.auth.transport.requests import Request
from googleapiclient.discovery import build
if not os.path.exists(self.token_file):
raise FileNotFoundError(
f"GSC token file not found: {self.token_file}\n"
"Run gsc_auth.py first to authenticate with Google Search Console."
)
with open(self.token_file) as f:
token_data = json.load(f)
# Build credentials — client ID/secret can come from token file, creds file, or env vars
client_id = os.environ.get("GOOGLE_CLIENT_ID", "")
client_secret = os.environ.get("GOOGLE_CLIENT_SECRET", "")
if self.creds_file and os.path.exists(self.creds_file):
with open(self.creds_file) as f:
creds_data = json.load(f)
client_id = client_id or creds_data.get("client_id", "")
client_secret = client_secret or creds_data.get("client_secret", "")
if not client_id or not client_secret:
raise ValueError(
"Google OAuth credentials required. Set GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET "
"env vars, or set GOOGLE_CREDENTIALS_FILE to a JSON file with client_id/client_secret."
)
cred = Credentials(
token=token_data.get("access_token"),
refresh_token=token_data.get("refresh_token"),
token_uri="https://oauth2.googleapis.com/token",
client_id=client_id,
client_secret=client_secret,
scopes=["https://www.googleapis.com/auth/webmasters.readonly"],
)
# Always refresh to ensure valid token
cred.refresh(Request())
token_data["access_token"] = cred.token
with open(self.token_file, "w") as f:
json.dump(token_data, f, indent=2)
self._service = build("searchconsole", "v1", credentials=cred)
return self._service
def list_sites(self):
"""List all verified Search Console sites."""
service = self._get_service()
result = service.sites().list().execute()
return result.get("siteEntry", [])
def query(self, dimensions=None, row_limit=25, days=28, start_date=None,
end_date=None, filters=None, search_type="web", data_state="final"):
"""
Query Search Console analytics.
Args:
dimensions: list of "query", "page", "device", "country", "date", "searchAppearance"
row_limit: max rows (API max 25000)
days: lookback window (ignored if start_date/end_date provided)
start_date: "YYYY-MM-DD" (inclusive)
end_date: "YYYY-MM-DD" (inclusive)
filters: list of {"dimension": str, "operator": str, "expression": str}
search_type: "web", "image", "video", "news", "discover", "googleNews"
data_state: "final" or "all" (all includes fresh/unfinalized data)
Returns:
list of row dicts with keys, clicks, impressions, ctr, position
"""
service = self._get_service()
if not end_date:
end_date = (datetime.now() - timedelta(days=3)).strftime("%Y-%m-%d")
if not start_date:
start_date = (datetime.now() - timedelta(days=days + 2)).strftime("%Y-%m-%d")
body = {
"startDate": start_date,
"endDate": end_date,
"dimensions": dimensions or ["query"],
"rowLimit": min(row_limit, 25000),
"type": search_type,
"dataState": data_state,
}
if filters:
body["dimensionFilterGroups"] = [{"filters": filters}]
result = service.searchanalytics().query(
siteUrl=self.site_url, body=body
).execute()
return result.get("rows", [])
def top_queries(self, n=25, days=28, **kwargs):
"""Convenience: top N queries by clicks."""
return self.query(dimensions=["query"], row_limit=n, days=days, **kwargs)
def top_pages(self, n=100, days=28, **kwargs):
"""Convenience: top N pages by clicks."""
return self.query(dimensions=["page"], row_limit=n, days=days, **kwargs)
def query_page_matrix(self, n=1000, days=28, **kwargs):
"""Get query+page combos for cannibalization analysis."""
return self.query(dimensions=["query", "page"], row_limit=n, days=days, **kwargs)
def daily_trend(self, days=28, **kwargs):
"""Daily clicks/impressions trend."""
return self.query(dimensions=["date"], row_limit=days, days=days, **kwargs)
def device_split(self, days=28, **kwargs):
"""Traffic by device type."""
return self.query(dimensions=["device"], row_limit=10, days=days, **kwargs)
def country_split(self, n=25, days=28, **kwargs):
"""Traffic by country."""
return self.query(dimensions=["country"], row_limit=n, days=days, **kwargs)
def striking_distance(self, days=28, min_position=4, max_position=20, min_impressions=50):
"""Find queries in striking distance (positions 4-20 with decent impressions)."""
rows = self.query(dimensions=["query"], row_limit=5000, days=days)
return [
r for r in rows
if min_position <= r["position"] <= max_position
and r["impressions"] >= min_impressions
]
def main():
parser = argparse.ArgumentParser(description="Google Search Console CLI")
parser.add_argument("--site", default=GSC_SITE_URL, help="Site URL (or set GSC_SITE_URL env var)")
parser.add_argument("--queries", type=int, help="Top N queries")
parser.add_argument("--pages", type=int, help="Top N pages")
parser.add_argument("--days", type=int, default=28, help="Lookback days")
parser.add_argument("--striking", action="store_true", help="Striking distance queries (pos 4-20)")
parser.add_argument("--trend", action="store_true", help="Daily trend")
parser.add_argument("--devices", action="store_true", help="Device split")
parser.add_argument("--countries", type=int, help="Top N countries")
parser.add_argument("--sites", action="store_true", help="List all verified sites")
parser.add_argument("--raw", help="Raw query body as JSON")
parser.add_argument("--json", action="store_true", help="Output as JSON")
args = parser.parse_args()
gsc = GSCClient(site_url=args.site)
if args.sites:
sites = gsc.list_sites()
if args.json:
print(json.dumps(sites, indent=2))
else:
print(f"{'Site URL':<60} {'Permission':<20}")
print("-" * 80)
for s in sorted(sites, key=lambda x: x["siteUrl"]):
print(f"{s['siteUrl']:<60} {s['permissionLevel']:<20}")
return
if args.raw:
body = json.loads(args.raw)
rows = gsc.query(**body)
elif args.queries:
rows = gsc.top_queries(n=args.queries, days=args.days)
elif args.pages:
rows = gsc.top_pages(n=args.pages, days=args.days)
elif args.striking:
rows = gsc.striking_distance(days=args.days)
elif args.trend:
rows = gsc.daily_trend(days=args.days)
elif args.devices:
rows = gsc.device_split(days=args.days)
elif args.countries:
rows = gsc.country_split(n=args.countries, days=args.days)
else:
rows = gsc.top_queries(n=25, days=args.days)
if args.json:
print(json.dumps(rows, indent=2))
else:
if not rows:
print("No data returned.")
return
dims = rows[0]["keys"]
dim_count = len(dims)
print(f"{'|'.join(f'Dim{i+1}' for i in range(dim_count)):<60} {'Clicks':>8} {'Impr':>10} {'CTR':>8} {'Pos':>6}")
print("-" * 95)
for r in rows:
key_str = " | ".join(str(k)[:40] for k in r["keys"])
print(f"{key_str:<60} {r['clicks']:>8} {r['impressions']:>10} {r['ctr']:>7.1%} {r['position']:>6.1f}")
if __name__ == "__main__":
main()

7
seo-ops/requirements.txt Normal file
View file

@ -0,0 +1,7 @@
# Core dependencies
requests>=2.28.0
# Google Search Console API
google-api-python-client>=2.100.0
google-auth>=2.23.0
google-auth-httplib2>=0.1.1

440
seo-ops/trend_scout.py Normal file
View file

@ -0,0 +1,440 @@
#!/usr/bin/env python3
"""
Trend Scout Multi-source trend detection for content marketing.
Scans Google Trends, Hacker News, Reddit, and X/Twitter to find trending
topics in your niche before they peak. Scores each trend for relevance
to your configured content verticals and suggests content angles.
Usage:
python trend_scout.py
Environment variables:
CONTENT_VERTICALS Comma-separated topic verticals (default: marketing-focused set)
TREND_SUBREDDITS Comma-separated subreddits to monitor
BRAVE_API_KEY Brave Search API key (enables X/Twitter scanning)
OUTPUT_DIR Where to save output files (default: ./output)
"""
import json
import os
import sys
import urllib.request
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta
from pathlib import Path
# ─────────────────────────────────────────────
# Config
# ─────────────────────────────────────────────
OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "./output"))
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
# Content verticals — what topics are relevant to you?
# Override with CONTENT_VERTICALS env var (comma-separated)
DEFAULT_VERTICALS = [
"AI marketing automation",
"AI agents for business",
"SEO trends",
"content marketing AI",
"marketing agency transformation",
"programmatic SEO",
"AI search optimization AEO",
"startup growth strategy",
]
_verticals_env = os.environ.get("CONTENT_VERTICALS", "")
VERTICALS = [v.strip() for v in _verticals_env.split(",") if v.strip()] if _verticals_env else DEFAULT_VERTICALS
# Subreddits to monitor
DEFAULT_SUBREDDITS = ["marketing", "SEO", "startups", "entrepreneur", "artificial", "digitalmarketing"]
_subs_env = os.environ.get("TREND_SUBREDDITS", "")
SUBREDDITS = [s.strip() for s in _subs_env.split(",") if s.strip()] if _subs_env else DEFAULT_SUBREDDITS
BRAVE_API_KEY = os.environ.get("BRAVE_API_KEY", "")
# ─────────────────────────────────────────────
# Relevance scoring keywords
# Customize these for your niche
# ─────────────────────────────────────────────
HIGH_RELEVANCE_KEYWORDS = [
"ai marketing", "seo", "ai agent", "marketing agency", "content marketing",
"programmatic seo", "founder", "startup growth", "saas", "ai search",
"ai automation", "marketing automation", "creator economy",
"digital marketing agency", "ai seo",
]
MEDIUM_RELEVANCE_KEYWORDS = [
"ai", "marketing", "google", "search", "business", "revenue",
"growth", "startup", "entrepreneur", "automation", "llm", "gpt",
"chatgpt", "social media", "advertising", "content",
"digital marketing",
]
LOW_RELEVANCE_KEYWORDS = [
"tech", "digital", "platform", "data", "analytics", "strategy",
]
# Override with env vars (JSON arrays)
_high_env = os.environ.get("HIGH_RELEVANCE_KEYWORDS_JSON")
if _high_env:
try:
HIGH_RELEVANCE_KEYWORDS = json.loads(_high_env)
except json.JSONDecodeError:
pass
# ─────────────────────────────────────────────
# Data Sources
# ─────────────────────────────────────────────
def get_google_trends():
"""Pull trending searches from Google Trends RSS."""
url = "https://trends.google.com/trending/rss?geo=US"
try:
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req, timeout=15) as response:
data = response.read().decode("utf-8")
root = ET.fromstring(data)
ns = {"ht": "https://trends.google.com/trending/rss"}
trends = []
for item in root.findall(".//item")[:20]:
title = item.find("title")
traffic = item.find("ht:approx_traffic", ns)
news_items = item.findall("ht:news_item", ns)
news_titles = []
news_urls = []
for ni in news_items[:2]:
nt = ni.find("ht:news_item_title", ns)
nu = ni.find("ht:news_item_url", ns)
if nt is not None:
news_titles.append(nt.text)
if nu is not None:
news_urls.append(nu.text)
trends.append({
"topic": title.text if title is not None else "Unknown",
"traffic": traffic.text if traffic is not None else "N/A",
"news_titles": news_titles,
"news_urls": news_urls,
})
return trends
except Exception as e:
print(f"⚠️ Google Trends fetch failed: {e}")
return []
def get_hackernews_top():
"""Pull top HN stories filtered for relevance."""
try:
url = "https://hacker-news.firebaseio.com/v0/topstories.json"
req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(req, timeout=10) as response:
ids = json.loads(response.read().decode("utf-8"))[:30]
stories = []
# Use all relevance keywords for filtering
keywords = set()
for kw in HIGH_RELEVANCE_KEYWORDS + MEDIUM_RELEVANCE_KEYWORDS:
keywords.update(kw.lower().split())
for story_id in ids:
try:
surl = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
sreq = urllib.request.Request(surl, headers={"User-Agent": "Mozilla/5.0"})
with urllib.request.urlopen(sreq, timeout=5) as sr:
story = json.loads(sr.read().decode("utf-8"))
title = story.get("title", "").lower()
if any(kw in title for kw in keywords):
stories.append({
"title": story.get("title"),
"url": story.get("url", f"https://news.ycombinator.com/item?id={story_id}"),
"score": story.get("score", 0),
"comments": story.get("descendants", 0),
})
except:
continue
if len(stories) >= 10:
break
return stories
except Exception as e:
print(f"⚠️ HN fetch failed: {e}")
return []
def get_reddit_trending():
"""Pull trending posts from configured subreddits."""
posts = []
for sub in SUBREDDITS:
try:
url = f"https://www.reddit.com/r/{sub}/hot.json?limit=5"
req = urllib.request.Request(url, headers={"User-Agent": "TrendScout/1.0"})
with urllib.request.urlopen(req, timeout=10) as response:
data = json.loads(response.read().decode("utf-8"))
for child in data.get("data", {}).get("children", []):
post = child.get("data", {})
if post.get("score", 0) > 50:
posts.append({
"title": post.get("title"),
"subreddit": sub,
"score": post.get("score"),
"comments": post.get("num_comments"),
"url": f"https://reddit.com{post.get('permalink', '')}",
})
except Exception as e:
print(f"⚠️ Reddit r/{sub} failed: {e}")
continue
posts.sort(key=lambda x: x.get("score", 0), reverse=True)
return posts[:10]
def get_x_twitter_trending():
"""Pull trending X/Twitter discussions via Brave Search."""
if not BRAVE_API_KEY:
print(" ⚠️ No BRAVE_API_KEY — skipping X/Twitter scan")
return []
# Build search queries from your verticals
queries = []
for vertical in VERTICALS[:4]:
queries.append(f'site:twitter.com OR site:x.com "{vertical}"')
posts = []
for query in queries:
try:
encoded_q = urllib.request.quote(query)
url = f"https://api.search.brave.com/res/v1/web/search?q={encoded_q}&count=5&freshness=pd"
req = urllib.request.Request(url, headers={
"Accept": "application/json",
"Accept-Encoding": "gzip",
"X-Subscription-Token": BRAVE_API_KEY,
})
with urllib.request.urlopen(req, timeout=10) as response:
data = json.loads(response.read().decode("utf-8"))
for result in data.get("web", {}).get("results", []):
if "twitter.com" in result.get("url", "") or "x.com" in result.get("url", ""):
posts.append({
"title": result.get("title", ""),
"url": result.get("url", ""),
"description": result.get("description", "")[:200],
"source": "X/Twitter",
"query": query.replace("site:twitter.com OR site:x.com ", ""),
})
except Exception as e:
print(f" ⚠️ X search failed: {e}")
continue
return posts[:10]
# ─────────────────────────────────────────────
# Scoring & Analysis
# ─────────────────────────────────────────────
def score_trend(trend_title):
"""Score how relevant a trend is to your content verticals (0-100)."""
title_lower = trend_title.lower()
score = 0
for kw in HIGH_RELEVANCE_KEYWORDS:
if kw in title_lower:
score += 25
for kw in MEDIUM_RELEVANCE_KEYWORDS:
if kw in title_lower:
score += 10
for kw in LOW_RELEVANCE_KEYWORDS:
if kw in title_lower:
score += 5
return min(score, 100)
def generate_content_angles(trends_data):
"""Generate content angle suggestions based on trends."""
angles = []
for trend in trends_data.get("google_trends", [])[:5]:
relevance = score_trend(trend["topic"])
if relevance >= 20:
angles.append({
"source": "Google Trends",
"topic": trend["topic"],
"traffic": trend["traffic"],
"relevance_score": relevance,
"angle_suggestion": f"Your take on '{trend['topic']}' — tie to your niche angle",
"platforms": ["X", "LinkedIn", "Short-form video"],
})
for story in trends_data.get("hackernews", [])[:5]:
relevance = score_trend(story["title"])
if relevance >= 15:
angles.append({
"source": "Hacker News",
"topic": story["title"],
"score": story["score"],
"relevance_score": relevance,
"url": story["url"],
"angle_suggestion": f"Expert perspective on '{story['title']}'",
"platforms": ["X", "YouTube", "LinkedIn"],
})
for post in trends_data.get("reddit", [])[:5]:
relevance = score_trend(post["title"])
if relevance >= 15:
angles.append({
"source": f"Reddit r/{post['subreddit']}",
"topic": post["title"],
"engagement": f"{post['score']} upvotes, {post['comments']} comments",
"relevance_score": relevance,
"url": post["url"],
"angle_suggestion": f"Address this conversation from your expertise",
"platforms": ["X", "LinkedIn", "Short-form video"],
})
for post in trends_data.get("x_twitter", [])[:5]:
relevance = score_trend(post["title"])
if relevance >= 15:
angles.append({
"source": "X/Twitter",
"topic": post["title"][:100],
"relevance_score": relevance,
"url": post.get("url", ""),
"angle_suggestion": f"Jump into this conversation with your take",
"platforms": ["X", "LinkedIn"],
})
angles.sort(key=lambda x: x.get("relevance_score", 0), reverse=True)
return angles[:10]
def format_output(trends_data, angles):
"""Format for human-readable markdown output."""
today = datetime.now().strftime("%Y-%m-%d")
lines = [f"# 🔥 Trend Scout — {today}\n"]
if angles:
lines.append("## Top Content Opportunities\n")
for i, angle in enumerate(angles, 1):
lines.append(f"### {i}. {angle['topic']}")
lines.append(f"**Source:** {angle['source']} | **Relevance:** {angle['relevance_score']}/100")
if angle.get("traffic"):
lines.append(f"**Search volume:** {angle['traffic']}")
if angle.get("engagement"):
lines.append(f"**Engagement:** {angle['engagement']}")
lines.append(f"**Angle:** {angle['angle_suggestion']}")
lines.append(f"**Best for:** {', '.join(angle['platforms'])}")
if angle.get("url"):
lines.append(f"**Ref:** {angle['url']}")
lines.append("")
lines.append("## 📊 Raw Signals\n")
gt = trends_data.get("google_trends", [])
if gt:
lines.append("**Google Trends (US):**")
for t in gt[:8]:
lines.append(f"- {t['topic']} ({t['traffic']})")
lines.append("")
hn = trends_data.get("hackernews", [])
if hn:
lines.append("**Hacker News (filtered):**")
for s in hn[:5]:
lines.append(f"- [{s['title']}]({s['url']}) — {s['score']}pts, {s['comments']} comments")
lines.append("")
rd = trends_data.get("reddit", [])
if rd:
lines.append("**Reddit Hot Posts:**")
for p in rd[:5]:
lines.append(f"- r/{p['subreddit']}: {p['title']} ({p['score']}↑)")
lines.append("")
xt = trends_data.get("x_twitter", [])
if xt:
lines.append("**X/Twitter Trending:**")
for p in xt[:5]:
lines.append(f"- [{p.get('query','')}] {p['title'][:80]}")
lines.append("")
return "\n".join(lines)
# ─────────────────────────────────────────────
# Main
# ─────────────────────────────────────────────
def main():
print("🔥 Trend Scout starting...")
print(f" Verticals: {', '.join(VERTICALS[:5])}{'...' if len(VERTICALS) > 5 else ''}")
print(f" Subreddits: {', '.join(SUBREDDITS)}")
print()
# Gather signals
print(" 📡 Fetching Google Trends...")
google_trends = get_google_trends()
print(" 📡 Fetching Hacker News...")
hackernews = get_hackernews_top()
print(" 📡 Fetching Reddit...")
reddit = get_reddit_trending()
print(" 📡 Fetching X/Twitter...")
x_twitter = get_x_twitter_trending()
trends_data = {
"timestamp": datetime.now().isoformat(),
"verticals": VERTICALS,
"google_trends": google_trends,
"hackernews": hackernews,
"reddit": reddit,
"x_twitter": x_twitter,
}
# Generate content angles
print(" 🧠 Generating content angles...")
angles = generate_content_angles(trends_data)
# Save raw data (JSON)
json_path = OUTPUT_DIR / "flash-trends-latest.json"
with open(json_path, "w") as f:
json.dump({"trends": trends_data, "angles": angles}, f, indent=2)
print(f" 💾 Saved to {json_path}")
# Save formatted output (Markdown)
today = datetime.now().strftime("%Y-%m-%d")
md_path = OUTPUT_DIR / f"flash-trends-{today}.md"
formatted = format_output(trends_data, angles)
with open(md_path, "w") as f:
f.write(formatted)
print(f" 📝 Saved to {md_path}")
# Print summary
print(f"\n✅ Trend Scout complete:")
print(f" - Google Trends: {len(google_trends)} trends")
print(f" - Hacker News: {len(hackernews)} relevant stories")
print(f" - Reddit: {len(reddit)} hot posts")
print(f" - X/Twitter: {len(x_twitter)} discussions")
print(f" - Content angles: {len(angles)} opportunities")
if angles:
print(f"\n🎯 Top 3 angles:")
for i, a in enumerate(angles[:3], 1):
print(f" {i}. [{a['relevance_score']}/100] {a['topic']} ({a['source']})")
return 0
if __name__ == "__main__":
sys.exit(main())