A Model Context Protocol (MCP) server enabling intelligent delegation from high-capability AI agents to cost-effective LLMs
Getting Started • Key Features • Usage Examples • Architecture
LLM Gateway is an MCP-native server that enables intelligent task delegation from advanced AI agents like Claude 3.7 Sonnet to more cost-effective models like Gemini Flash 2.0 Lite. It provides a unified interface to multiple Large Language Model (LLM) providers while optimizing for cost, performance, and quality.
The server is built on the Model Context Protocol (MCP), making it specifically designed to work with AI agents like Claude. All functionality is exposed through MCP tools that can be directly called by these agents, creating a seamless workflow for AI-to-AI delegation.
The primary design goal of LLM Gateway is to allow sophisticated AI agents like Claude 3.7 Sonnet to intelligently delegate tasks to less expensive models:
delegates to
┌─────────────┐ ────────────────────────► ┌───────────────────┐ ┌──────────────┐
│ Claude 3.7 │ │ LLM Gateway │ ───────►│ Gemini Flash │
│ (Agent) │ ◄──────────────────────── │ MCP Server │ ◄───────│ DeepSeek │
└─────────────┘ returns results └───────────────────┘ │ GPT-4o-mini │
└──────────────┘
Example workflow:
This delegation pattern can save 70-90% on API costs while maintaining output quality.
The most powerful use case is enabling advanced AI agents to delegate routine tasks to cheaper models:
API costs for advanced models can be substantial. LLM Gateway helps reduce costs by:
Avoid provider lock-in with a unified interface:
Process large documents efficiently:
This example shows how Claude can use the LLM Gateway to process a document by delegating tasks to cheaper models:
import asyncio
from mcp.client import Client
async def main():
# Claude would use this client to connect to the LLM Gateway
client = Client("http://localhost:8000")
# Claude can identify a document that needs processing
document = "... large document content ..."
# Step 1: Claude delegates document chunking
chunks_response = await client.tools.chunk_document(
document=document,
chunk_size=1000,
method="semantic"
)
print(f"Document divided into {chunks_response['chunk_count']} chunks")
# Step 2: Claude delegates summarization to a cheaper model
summaries = []
total_cost = 0
for i, chunk in enumerate(chunks_response["chunks"]):
# Use Gemini Flash (much cheaper than Claude)
summary = await client.tools.summarize_document(
document=chunk,
provider="gemini",
model="gemini-2.0-flash-lite",
format="paragraph"
)
summaries.append(summary["summary"])
total_cost += summary["cost"]
print(f"Processed chunk {i+1} with cost ${summary['cost']:.6f}")
# Step 3: Claude delegates entity extraction to another cheap model
entities = await client.tools.extract_entities(
document=document,
entity_types=["person", "organization", "location", "date"],
provider="openai",
model="gpt-4o-mini"
)
total_cost += entities["cost"]
print(f"Total delegation cost: ${total_cost:.6f}")
# Claude would now process these summaries and entities using its advanced capabilities
# Close the client when done
await client.close()
if __name__ == "__main__":
asyncio.run(main())
# Claude can compare outputs from different providers for critical tasks
responses = await client.tools.multi_completion(
prompt="Explain the implications of quantum computing for cryptography.",
providers=[\
{"provider": "openai", "model": "gpt-4o-mini", "temperature": 0.3},\
{"provider": "anthropic", "model": "claude-3-haiku-20240307", "temperature": 0.3},\
{"provider": "gemini", "model": "gemini-2.0-pro", "temperature": 0.3}\
]
)
# Claude could analyze these responses and decide which is most accurate
for provider_key, result in responses["results"].items():
if result["success"]:
print(f"{provider_key} Cost: ${result['cost']}")
# Claude can define and execute complex multi-stage workflows
workflow = [\
{\
"name": "Initial Analysis",\
"operation": "summarize",\
"provider": "gemini",\
"model": "gemini-2.0-flash-lite",\
"input_from": "original",\
"output_as": "summary"\
},\
{\
"name": "Entity Extraction",\
"operation": "extract_entities",\
"provider": "openai",\
"model": "gpt-4o-mini",\
"input_from": "original", \
"output_as": "entities"\
},\
{\
"name": "Question Generation",\
"operation": "generate_qa",\
"provider": "deepseek",\
"model": "deepseek-chat",\
"input_from": "summary",\
"output_as": "questions"\
}\
]
# Execute the workflow
results = await client.tools.execute_optimized_workflow(
documents=[document],
workflow=workflow
)
print(f"Workflow completed in {results['processing_time']:.2f}s")
print(f"Total cost: ${results['total_cost']:.6f}")
# Clone the repository
git clone https://github.com/yourusername/llm_gateway_mcp_server.git
cd llm_gateway_mcp_server
# Install with pip
pip install -e .
# Or install with optional dependencies
pip install -e .[all]
Create a .env
file with your API keys:
# API Keys (at least one provider required)
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GEMINI_API_KEY=your_gemini_key
DEEPSEEK_API_KEY=your_deepseek_key
# Server Configuration
SERVER_PORT=8000
SERVER_HOST=127.0.0.1
# Logging Configuration
LOG_LEVEL=INFO
USE_RICH_LOGGING=true
# Cache Configuration
CACHE_ENABLED=true
CACHE_TTL=86400
# Start the MCP server
python -m llm_gateway.cli.main run
# Or with Docker
docker compose up
Once running, the server will be available at http://localhost:8000
.
Using LLM Gateway for delegation can yield significant cost savings:
Task | Claude 3.7 Direct | Delegated to Cheaper LLM | Savings |
---|---|---|---|
Summarizing 100-page document | $4.50 | $0.45 (Gemini Flash) | 90% |
Extracting data from 50 records | $2.25 | $0.35 (GPT-4o-mini) | 84% |
Generating 20 content ideas | $0.90 | $0.12 (DeepSeek) | 87% |
Processing 1,000 customer queries | $45.00 | $7.50 (Mixed delegation) | 83% |
These savings are achieved while maintaining high-quality outputs by letting Claude focus on high-level reasoning and orchestration while delegating mechanical tasks to cost-effective models.
The LLM Gateway is built natively on the Model Context Protocol:
This ensures seamless integration with Claude and other MCP-compatible agents.
┌─────────────┐ ┌───────────────────┐ ┌──────────────┐
│ Claude 3.7 │ ────────► LLM Gateway MCP │ ────────► LLM Providers│
│ (Agent) │ ◄──────── Server & Tools │ ◄──────── (Multiple) │
└─────────────┘ └───────┬───────────┘ └──────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Completion │ │ Document │ │ Extraction │ │
│ │ Tools │ │ Tools │ │ Tools │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Optimization │ │ Core MCP │ │ Analytics │ │
│ │ Tools │ │ Server │ │ Tools │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Cache │ │ Vector │ │ Prompt │ │
│ │ Service │ │ Service │ │ Service │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
When Claude delegates a task to LLM Gateway:
Claude or other advanced AI agents can use LLM Gateway to:
Process large document collections efficiently:
Research teams can use LLM Gateway to:
This project is licensed under the MIT License - see the LICENSE file for details.