Catch hallucinated tool-call arguments before they hit production. Auto-repair in a single round-trip, with under a millisecond of overhead.
pip install cruxialcopied! · Python 3.10+ · MIT
Real benchmarks on 51 public MCP servers. Every number below is reproducible from the repo.
| Benchmark | Model | Intercept rate | One-shot repair | Sample |
|---|---|---|---|---|
| Real public MCP servers, pooled across 2 runs github · kubernetes · notion · salesforce · airtable · slack · ms-teams · atlassian · playwright · firecrawl · hubspot · zendesk · supabase · +38 more |
Azure gpt-4o | 5.85%95% CI 3.8 – 8.9% | 90.0%18 / 20 intercepts | 342 calls 51 servers 603 tools 352 prompts |
| Constraint-heavy production schemas enums · formats · regex · nested objects · datetime ranges |
Azure gpt-4o | 17.1% | 66.7% | 15 tools 70 prompts |
| Same schemas, smaller model what changing model tier alone does on the same prompts |
Azure gpt-5-mini-2 | 1.4%92% fewer than gpt-4o | 100% | 74 calls same 15 tools |
| Simple-schema control group filesystem, memory, time, fetch (the easy half of MCP) |
Azure gpt-4o | 0.0% | — | 7 servers 25 prompts |
| Robustness audit (no LLM) synthetic violation classification across the full real-world schema corpus |
classifier only | 100%rejection · 0 silent passes | — | 877 real schemas 1,947 violations |
pip install cruxial && python examples/azure_mcp_suite.pycopied!
Three weeks later a customer tells you the email was never sent. Your logs show HTTP 200.
It passes "sample_id" instead of the real ID. Or an integer where an email is required. The tool fails silently. The model writes "done." Your user never gets what they asked for.
You patch the retry logic. The integration works. Two days later a different tool fails. You spend another afternoon on a problem that should be solved once — at the layer level, not per-feature.
The agent wrote "done" without calling the tool. No error. No log entry. HTTP 200. You find out three days later from a customer — after the damage is already done.
One import. Every tool call validated from that moment on. Nothing to configure.
# your existing code response = client.chat( model="gpt-4o", tools=my_tools, messages=messages ) # wrong args pass through. # tool never called. # user finds out last.
# one import, nothing else changes from cruxial import guard response = client.chat( model="gpt-4o", tools=guard(my_tools), messages=messages ) # validated before execution. # bad args corrected + retried. # every failure logged.
Valid calls pass through in under 50ms. Invalid calls get caught, corrected, and logged before the tool ever fires.
Three numbers from independent research. One line from every developer who's shipped an agent.
The SDK is free and open-source. The hosted dashboard is what we manage.
cruxial stats CLIWe'll only email you about Cruxial Cloud. No newsletters.
Free forever. No credit card. Works on your existing code without changes.