Cruxial sits between your LLM and your tool executors. It catches hallucinated arguments before they hit production, auto-repairs them in a single round-trip, and never adds more than a millisecond of latency. Drop-in for OpenAI, Anthropic, Azure, and any MCP server.
pip install cruxialcopied! · OpenAI · Anthropic · LiteLLM
51 public MCP servers. 603 tools. 352 prompts across 2 runs, 342 tool calls total. Every number below is reproducible from the repo. Intercept rate scales with schema complexity, not with model tier.
| Benchmark | Model | Intercept rate | One-shot repair | Sample |
|---|---|---|---|---|
| Real public MCP servers, pooled across 2 runs github · kubernetes · notion · salesforce · airtable · slack · ms-teams · atlassian · playwright · firecrawl · hubspot · zendesk · supabase · +38 more |
Azure gpt-4o | 5.85%95% CI 3.8 – 8.9% | 90.0%18 / 20 intercepts | 342 calls 51 servers 603 tools 352 prompts |
| Constraint-heavy production schemas enums · formats · regex · nested objects · datetime ranges |
Azure gpt-4o | 17.1% | 66.7% | 15 tools 70 prompts |
| Same schemas, smaller model what changing model tier alone does on the same prompts |
Azure gpt-5-mini-2 | 1.4%92% fewer than gpt-4o | 100% | 74 calls same 15 tools |
| Simple-schema control group filesystem, memory, time, fetch (the easy half of MCP) |
Azure gpt-4o | 0.0% | — | 7 servers 25 prompts |
| Robustness audit (no LLM) synthetic violation classification across the full real-world schema corpus |
classifier only | 100%rejection · 0 silent passes | — | 877 real schemas 1,947 violations |
pip install cruxial && python examples/azure_mcp_suite.pycopied!
Three weeks later a customer tells you the email was never sent. Your logs show HTTP 200.
It passes "sample_id" instead of the real ID. Or an integer where an email is required. The tool fails silently. The model writes "done." Your user never gets what they asked for.
You patch the retry logic. The integration works. Two days later a different tool fails. You spend another afternoon on a problem that should be solved once — at the layer level, not per-feature.
The agent wrote "done" without calling the tool. No error. No log entry. HTTP 200. You find out three days later from a customer — after the damage is already done.
One import. Wrap your tool list. Every call validated from that moment. No configuration, no schema changes, nothing to maintain.
# your existing code response = client.chat( model="gpt-4o", tools=my_tools, messages=messages ) # wrong args pass through. # tool never called. # user finds out last.
# one import, nothing else changes from cruxial import guard response = client.chat( model="gpt-4o", tools=guard(my_tools), messages=messages ) # validated before execution. # bad args corrected + retried. # every failure logged.
Valid calls pass through in under 50ms. Invalid calls get caught, corrected, and logged before the tool ever fires.
Three numbers from independent research, and one line from every developer who's shipped an agent.
The validation engine is open-source and self-hostable. The central workspace that visualizes schema drift, audits token consumption, and tracks agent health is what we manage.
We'll only email you about Cruxial Cloud. No newsletters.
Free forever. No credit card. Works on your existing code without changes.