The 91% Token Tax: How We Slashed Agent Costs by Rethinking MCP

The Problem We Discovered

We were spending $0.45 per request just on tool definitions.

Not on actual AI work. Just telling the model what tools exist.

With 30 tools in context, you're burning 150,000 tokens before the conversation even starts. At $3/million input tokens for Claude, that adds up fast.

So we rewrote everything.

The Solution: Code-First Skills Pattern

Instead of loading tool schemas into the context window, we:

  1. Compile skills to a tiny catalog (~700 bytes total)
  2. Load catalog only at runtime
  3. Fetch full skill instructions on-demand (only when actually needed)
  4. Execute via single unified function

The Results

MetricBeforeAfterReduction
30 tools token count84,00060099.3%
Cost per request$0.45$0.00399.3%

The Key Insight

Treat skills like files on disk, not system prompts.

Agents discover what they need. They don't carry everything everywhere.

Think about how you work: you don't read the entire manual before starting a task. You look up the specific section when you need it.

How It Works

Step 1: Compile Skills to Catalog

Each skill has a skill.md file with YAML frontmatter. At build time, we extract just the metadata:

{
  "lcoe-calculator": {
    "name": "LCOE Calculator",
    "trigger": "lcoe|levelized cost|energy economics",
    "agents": ["INVESTMENT_INTELLIGENCE", "COST_PREDICTION"]
  }
}

That's it. ~50 bytes per skill instead of 5,000.

Step 2: Match at Runtime

When a query comes in, we do a simple regex match against triggers. If we hit, we load the full skill instructions.

Step 3: Single Executor

Instead of 30 different tool schemas, we have one:

execute_skill(skill_id: string, script: string, input: object)

The agent knows how to call this. The skill instructions tell it what to put in the input.

The Architecture

skills/
└── lcoe-calculator/
    ├── skill.md              # Entry point with metadata
    ├── scripts/              # Executable Python scripts
    │   ├── calculate_lcoe.py
    │   └── sensitivity.py
    ├── schemas/              # JSON schemas for validation
    │   ├── input.schema.json
    │   └── output.schema.json
    └── examples/             # Sample inputs/outputs

Why This Works Better

  • Lazy loading: Only fetch what you need, when you need it
  • Separation of concerns: Skill definitions are independent of the agent
  • Easy testing: Skills are just Python scripts - test them directly
  • Version control: Each skill is a directory you can git diff
  • Multi-language: Skills can be Python, Node, or any executable

The Anthropic Inspiration

We based this on Anthropic's "Code Execution with MCP" pattern but pushed it further. Their insight: agents should write and execute code, not just call predefined tools.

Our extension: predefine the code patterns as skills, so agents get the best of both worlds - reliable, tested scripts with the flexibility of dynamic invocation.

Implementation Tips

  1. Start with zero dependencies. Skills should run with just Python stdlib where possible.
  2. JSON in, JSON out. Standardize the interface so the agent always knows how to call skills.
  3. Include examples. The agent learns from seeing sample inputs/outputs.
  4. Test at build time. Run skill tests as part of CI to catch regressions.

The Bottom Line

If you're building multi-agent systems and hitting token limits or cost walls, look at your tool definitions first. That's probably where the waste is.

The solution isn't fewer tools. It's smarter tool loading.


Full implementation details in our GitHub. Happy to share if anyone's hitting the same wall.

More Insights

Sustainability

Is your AI training cluster thirsty? Let's talk water.

A practical look at AI cooling water demand, where the risk concentrates, and how teams can mitigate it.

AI Architecture

Why We Stopped Building a 'Platform'

Why we moved from traditional SaaS patterns to a multi-agent operating model for infrastructure intelligence.

Industry

Google Maps for Electrons: Why 'Tapestry' Matters

Why grid-visibility tooling may become the limiting factor for AI data center expansion.

Investment

Why We Trust Prediction Markets More Than Tech News

Where market-implied probabilities beat headlines for timing-sensitive energy and infrastructure decisions.

Compliance

The Hidden Climate Clause in the EU AI Act

What the EU AI Act means for AI energy reporting, compliance timelines, and exposure management.

AI Architecture

Six Agents, One Room, No Agreement

How structured disagreement between specialist agents produced better portfolio decisions.

Finance

LCOE: The Baseline 'Truth' in Energy Investing

Why LCOE remains a core metric for comparing technologies and underwriting long-horizon energy risk.

Sustainability

The Schedule That Waits for the Wind

How carbon-aware workload scheduling reduces both emissions and compute cost volatility.

Technical

The Intelligence Feed That Builds Itself

Inside our ingestion pipeline for extracting, scoring, and publishing infrastructure signals automatically.

Investment

AI Data Center Energy Crisis: Investment Risks and Opportunities

A portfolio-level briefing on grid constraints, power costs, and capital-allocation implications.

Finance

AI Hyperscalers and the Data Center Financing Boom

Who is funding hyperscale buildout, where structures are changing, and what risk shifts to lenders.

Sustainability

Building Sustainable AI in Enterprise Environments

A practical playbook for lowering AI energy intensity without sacrificing delivery speed.