open source · MIT

Your agent said
it sent the email.
It didn't.

Catch hallucinated tool-call arguments before they hit production. Auto-repair in a single round-trip, with under a millisecond of overhead.

Star on GitHub → See how it works

pip install cruxialcopied! · Python 3.10+ · MIT

cruxial · intercepting live ▋

numbers, not vibes

Real runs. Public servers.
Clone it and get your own number.

Real benchmarks on 51 public MCP servers. Every number below is reproducible from the repo.

Benchmark	Model	Intercept rate	One-shot repair	Sample
Real public MCP servers, pooled across 2 runs github · kubernetes · notion · salesforce · airtable · slack · ms-teams · atlassian · playwright · firecrawl · hubspot · zendesk · supabase · +38 more	Azure gpt-4o	5.85%95% CI 3.8 – 8.9%	90.0%18 / 20 intercepts	342 calls 51 servers 603 tools 352 prompts
Constraint-heavy production schemas enums · formats · regex · nested objects · datetime ranges	Azure gpt-4o	17.1%	93.8%	15 tools 70 prompts
Same schemas, smaller model what changing model tier alone does on the same prompts	Azure gpt-5-mini-2	1.4%92% fewer than gpt-4o	100%	74 calls same 15 tools
Simple-schema control group filesystem, memory, time, fetch (the easy half of MCP)	Azure gpt-4o	0.0%	—	7 servers 25 prompts
Robustness audit (no LLM) synthetic violation classification across the full real-world schema corpus	classifier only	100%rejection · 0 silent passes	—	877 real schemas 1,947 violations

Zero silent passes across 342 live LLM tool calls. Twice. Same corpus, same result on the one metric that matters for a validation layer.

pooled across 2 independent runs · gpt-4o · 51 real public MCP servers · Wilson 95% upper bound: 1.1%

Your agent said it sent the email — and didn't. We catch that too: 100% recall, and we only ever act on a model-confirmed re-emission, so we never fabricate an action.

tool_bypass · 132-scenario adversarial set · acted-on precision 100% (Claude sonnet-4-6) / 98.4% (gpt-4o) · the catch validators structurally can't make

reproduce yourself git clone https://github.com/cruxial-ai/cruxial && cd cruxial && python examples/audit_mcp_schemas.pycopied!

the problem

You've shipped the integration.
It works in staging. Then production.

Three weeks later a customer tells you the email was never sent. Your logs show nothing wrong.

01 · wrong arguments

hallucination

The model invents its own arguments

It passes "sample_id" instead of the real ID. Or an integer where an email is required. The tool fails silently. The model writes "done." Your user never gets what they asked for.

02 · regressions

the fix loop

Every fix breaks something else

You patch the retry logic. The integration works. Two days later a different tool fails. You spend another afternoon on a problem that should be solved once — at the layer level, not per-feature.

03 · silent errors

no signal

Failures your logs will never show

The agent wrote "done" without calling the tool. No call ever went out — no error, no log entry, nothing to grep for. You find out three days later from a customer — after the damage is already done.

integration

Add it in 30 seconds.
Remove it just as fast.

One call wraps the whole tool step — validate, auto-repair, execute, log. Works with your OpenAI, Anthropic, Azure, or LiteLLM client. Nothing to configure.

before — failing silently

# your existing agent loop
for call in response.tool_calls:
  run_tool(call.name, call.arguments)

# wrong args pass straight through.
# the tool fails silently.
# your user finds out last.

after — caught before it fires

# one call runs the whole tool step
import cruxial

result = cruxial.run(
  client,            # openai / anthropic / azure
  model="gpt-4o",
  messages=messages,
  tools=tools,        # your existing defs
  executors=executors,
)
# called · validated · auto-repaired
# executed · logged — you keep your loop

mechanics

Five things happen
on every tool call.

Valid calls pass through with near-zero (<1ms) overhead. Invalid calls get caught, corrected, and logged before the tool ever fires.

Intercept

always · zero overhead

Validate

before execution

Correct

on failure

Retry

on failure

Log

always

01 / 05

Intercept

Wraps your tool list in one function call. Sits between the model response and your execution layer. Zero changes to your tool definitions, prompts, or app logic.

guard(schemas, executors) → validation layer → execution

interception log · 14ms ago

{

"tool": "send_email",

"status": "intercepted",

"original_args": { "recipient": 12345, "subject": "Follow up" },

"failure": "type_mismatch: recipient expected string, got int",

"retry_args": { "recipient": "user@company.com", "subject": "Follow up" },

"retry_outcome": "success",

"cruxial_overhead_ms": 0.4,

"repair_roundtrip_ms": 340

}

what the community is reporting

It's not just us saying this.

Three numbers from independent research. One line from every developer who's shipped an agent.

~0^%

GPT-4's success rate on real-world multi-step agent tasks. Most failures trace back to tool calls, not reasoning.

τ-bench · Anthropic + Princeton · 2024 · arxiv 2406.12045

0–19^%

Accuracy drop for top frontier models when the user simply rephrases the same request.

Berkeley Function Calling Leaderboard · gorilla.cs.berkeley.edu

0^%

Schema reliability OpenAI added in Structured Outputs. They shipped it because the unenforced baseline wasn't good enough.

openai.com · Structured Outputs launch · Aug 2024

The hardest part of running an LLM agent in production isn't getting it to call the right tool. It's everything after: the call succeeds, returns 200, and silently does the wrong thing.

recurring sentiment across developer discussions · Hacker News · 2024–2025

deployment models

Open Core for dev.
Cloud for scale.

The SDK is free and open-source. The hosted dashboard is what we manage.

cruxial core

Free

MIT licensed · Self-Hosted

JSON Schema validation on every tool call
Auto-repair via one corrective round-trip to the model
Fail-open — never crashes your app
Local SQLite telemetry + cruxial stats CLI

View GitHub Repository →

cruxial cloud

Waitlist

Fully Managed Middleware Service

Everything in Cruxial Core
Hosted dashboard for cross-app interception analytics
Schema drift detection across your tool registry
Slack, webhook, and PagerDuty alerts

We'll only email you about Cruxial Cloud. No newsletters.

Your agent saidit sent the email.It didn't.

Real runs. Public servers.Clone it and get your own number.

You've shipped the integration.It works in staging. Then production.

The model invents its own arguments

Every fix breaks something else

Failures your logs will never show

Add it in 30 seconds.Remove it just as fast.

Five things happenon every tool call.