The Intelligence Feed That Builds Itself

One Command. Done.

"Add this article to our feed."

npm run url-news "https://techcrunch.com/article-url"

That's it. Article extracted, analyzed, tagged, and added.

The Architecture: Hybrid Local + LLM

We initially tried two extremes:

  1. Pure LLM: Sending raw HTML to GPT-4. Expensive, slow, and prone to hallucinating metadata.
  2. Pure Scraper: Regex and Cheerio. Fast, but brittle and terrible at summarizing nuance.

The solution was a hybrid architecture.

Phase 1: The Local Extractor (Fast & Cheap)

Before we waste a single token, we process the content locally.

We built a robust extractor that runs right on the machine. It handles:

  1. Fetching & Rendering: Handles redirects and basic cleanup.
  2. Metadata Extraction: Pulls og:title, authors, dates, and site names using standard meta tags.
  3. Content Cleaning: Uses a cascade of selectors to find the actual article body, stripping away navigation, ads, and footers without needing an AI to "read" the page.
  4. Heuristic Relevance Scoring: A weighted keyword algorithm immediately discards irrelevant noise (spam, ads, off-topic posts) before they reach the expensive steps.

Phase 2: The LLM Enrichment (Smart)

Once we have clean, high-signal text, then we bring in the heavy guns. We pass the cleaned JSON to an LLM (Claude or OpenAI) for the tasks that actually require intelligence:

  • Synthesis: "Summarize this for a CTO worried about grid reliability." (Something regex can't do)
  • Sentiment Nuance: Distinguishing between "investment announced" (good) and "project delayed" (bad) in complex sentences.
  • Structured Extraction: Converting "two billion dollars" and "$2.5B" into standardized numbers for our database.

Why This Pattern Matters

1. Cost Control
By cleaning the HTML locally, we reduce the token count by 60-80% before the API call. We pay to process information, not <div> tags.

2. Speed
Local relevance scoring means we can discard low-value URLs in milliseconds without network latency.

3. Reliability
If the LLM is down or hallucinates, we still have the locally extracted title, date, and raw content. The system degrades gracefully.

The Extraction Logic

Title & Content

We use a "waterfall" strategy. Try the most reliable method (Open Graph tags); if missing, fall back to semantic HTML (<article>); if missing, use heuristics (largest text block). This ensures we get something usable from almost any site.

Relevance Scoring

We maintain a weighted dictionary of domain-specific terms ("PPA", "interconnection", "H100"). An article must cross a point threshold to be considered "intelligence." This simple filter saves us from filling our database with generic tech news.

The Bottom Line

You don't need AI for everything.

The most effective AI systems are often 20% AI and 80% solid engineering. By letting code do what code does best (scraping, filtering, formatting), we free up the AI to do what it does best (reasoning and synthesis).


See our Intelligence Feed in action at /intelligence-feed. The source code for our extractor is available in our repository.

More Insights

Sustainability

Your AI Training Cluster Thirsty? Let's Talk Water.

We ran the numbers: A 10k H100 cluster can consume 2 million gallons of water a month. Here is the math and the engineering fix.

AI Architecture

Why We Stopped Building a 'Platform'

Traditional SaaS is too slow for energy markets. We pivoted to 'Autonomous Organization as a Service'—software that works while you sleep.

Technical

The 'Context Tax': How We Slashed Agent Costs by 99%

Giving an agent 30 tools costs $0.45 per run. We implemented a 'Code-First Skills' pattern to drop that to $0.003.

Industry

Google Maps for Electrons: Why 'Tapestry' Matters

Grid interconnection is the #1 bottleneck for AI. Google X's Tapestry project is trying to virtualize the grid to fix it.

Investment

Why We Trust Prediction Markets More Than TechCrunch

News tells you what happened yesterday. Markets tell you what will happen tomorrow. We built an agent to trade on the difference.

Compliance

The Hidden Climate Clause in the EU AI Act

Starting August 2025, mandatory environmental reporting kicks in for AI models. Most CTOs are completely unprepared.

AI Architecture

Six Agents, One Room, No Agreement

We forced our AI agents to fight. The 'Bull' vs. The 'Bear'. The result was better decisions than any single model could produce.

Finance

LCOE: The Baseline 'Truth' in Energy Investing

Installed capacity is a vanity metric. LCOE is the only number that levels the playing field between solar, gas, and nuclear.

Sustainability

The Schedule That Waits for the Wind

Grid carbon intensity varies by 3x throughout the day. We built a scheduler that pauses AI training when the grid is dirty.