The 91% Token Tax: How We Slashed Agent Costs by Rethinking MCP

The Problem We Discovered

We were spending $0.45 per request just on tool definitions.

Not on actual AI work. Just telling the model what tools exist.

With 30 tools in context, you're burning 150,000 tokens before the conversation even starts. At $3/million input tokens for Claude, that adds up fast.

So we rewrote everything.

The Solution: Code-First Skills Pattern

Instead of loading tool schemas into the context window, we:

  1. Compile skills to a tiny catalog (~700 bytes total)
  2. Load catalog only at runtime
  3. Fetch full skill instructions on-demand (only when actually needed)
  4. Execute via single unified function

The Results

MetricBeforeAfterReduction
30 tools token count84,00060099.3%
Cost per request$0.45$0.00399.3%

The Key Insight

Treat skills like files on disk, not system prompts.

Agents discover what they need. They don't carry everything everywhere.

Think about how you work: you don't read the entire manual before starting a task. You look up the specific section when you need it.

How It Works

Step 1: Compile Skills to Catalog

Each skill has a skill.md file with YAML frontmatter. At build time, we extract just the metadata:

{
  "lcoe-calculator": {
    "name": "LCOE Calculator",
    "trigger": "lcoe|levelized cost|energy economics",
    "agents": ["INVESTMENT_INTELLIGENCE", "COST_PREDICTION"]
  }
}

That's it. ~50 bytes per skill instead of 5,000.

Step 2: Match at Runtime

When a query comes in, we do a simple regex match against triggers. If we hit, we load the full skill instructions.

Step 3: Single Executor

Instead of 30 different tool schemas, we have one:

execute_skill(skill_id: string, script: string, input: object)

The agent knows how to call this. The skill instructions tell it what to put in the input.

The Architecture

skills/
└── lcoe-calculator/
    ├── skill.md              # Entry point with metadata
    ├── scripts/              # Executable Python scripts
    │   ├── calculate_lcoe.py
    │   └── sensitivity.py
    ├── schemas/              # JSON schemas for validation
    │   ├── input.schema.json
    │   └── output.schema.json
    └── examples/             # Sample inputs/outputs

Why This Works Better

  • Lazy loading: Only fetch what you need, when you need it
  • Separation of concerns: Skill definitions are independent of the agent
  • Easy testing: Skills are just Python scripts - test them directly
  • Version control: Each skill is a directory you can git diff
  • Multi-language: Skills can be Python, Node, or any executable

The Anthropic Inspiration

We based this on Anthropic's "Code Execution with MCP" pattern but pushed it further. Their insight: agents should write and execute code, not just call predefined tools.

Our extension: predefine the code patterns as skills, so agents get the best of both worlds - reliable, tested scripts with the flexibility of dynamic invocation.

Implementation Tips

  1. Start with zero dependencies. Skills should run with just Python stdlib where possible.
  2. JSON in, JSON out. Standardize the interface so the agent always knows how to call skills.
  3. Include examples. The agent learns from seeing sample inputs/outputs.
  4. Test at build time. Run skill tests as part of CI to catch regressions.

The Bottom Line

If you're building multi-agent systems and hitting token limits or cost walls, look at your tool definitions first. That's probably where the waste is.

The solution isn't fewer tools. It's smarter tool loading.


Full implementation details in our GitHub. Happy to share if anyone's hitting the same wall.

More Insights

Sustainability

Your AI Training Cluster Thirsty? Let's Talk Water.

We ran the numbers: A 10k H100 cluster can consume 2 million gallons of water a month. Here is the math and the engineering fix.

AI Architecture

Why We Stopped Building a 'Platform'

Traditional SaaS is too slow for energy markets. We pivoted to 'Autonomous Organization as a Service'—software that works while you sleep.

Industry

Google Maps for Electrons: Why 'Tapestry' Matters

Grid interconnection is the #1 bottleneck for AI. Google X's Tapestry project is trying to virtualize the grid to fix it.

Investment

Why We Trust Prediction Markets More Than TechCrunch

News tells you what happened yesterday. Markets tell you what will happen tomorrow. We built an agent to trade on the difference.

Compliance

The Hidden Climate Clause in the EU AI Act

Starting August 2025, mandatory environmental reporting kicks in for AI models. Most CTOs are completely unprepared.

AI Architecture

Six Agents, One Room, No Agreement

We forced our AI agents to fight. The 'Bull' vs. The 'Bear'. The result was better decisions than any single model could produce.

Finance

LCOE: The Baseline 'Truth' in Energy Investing

Installed capacity is a vanity metric. LCOE is the only number that levels the playing field between solar, gas, and nuclear.

Sustainability

The Schedule That Waits for the Wind

Grid carbon intensity varies by 3x throughout the day. We built a scheduler that pauses AI training when the grid is dirty.

Technical

The Intelligence Feed That Builds Itself

We didn't want to pay for a Bloomberg terminal, so we wrote a 950-line TypeScript scraper that builds our own intelligence feed.