tracingviolet

// the audit

Your tools have an agent problem Let's fix that

An Agent Readiness Audit is a live, multi-model diagnostic of your MCP server or API — traced from user intent to task outcome, delivered as fixes specific enough to ship.

// the report

What lands in your inbox

Four excerpts from a sample diagnostic audit, recreated below.

illustrative — fictional service

03 · executive summary

The Bottom Line

AGENT READINESS SCORE

68
READINESS GAP / 100   Ranges 66–71 across 4 models

SCORE BREAKDOWN5 DIMENSIONS
RELIABILITY
91
UTILITY
58
INVOCATION
94
EFFICIENCY
82
RECOVERABILITY
71
illustrative — fictional service

05 · tracing the agent's journey

Agent Readiness Map

≥ 95
AGENT-READY
90–94
MINOR FRICTION
80–89
NEEDS ATTENTION
< 80
READINESS GAP
01
Invocation
Did the agent try to use the tool at all?
96%
412 / 429 observations
02
Presentability
Could the model load your tools without hitting platform limits?
100%
no risk — small tool surface
03
Input Friction
Could the agent reach a useful call without prep work?
88%
363 / 412 invoked observations
04
Reliability
Did the API call return successfully?
93%
752 / 808 API calls
05
Utility
When the call succeeded, was the response useful?
64%
481 / 752 successful calls
Task Completion — did the user get what they came for?
71%
illustrative — fictional service

07 · model breakdown

How Different Models Perform

Agent readiness isn't one number — it varies by which model is driving. Each dimension is scored per model; cells use the same four-band scale as the rest of this report.

DIMENSION claude-opus-4.7 claude-sonnet-4.6 gpt-5 gemini-3.1 Overall
Invocation 95 97 91 93 94
Reliability 93 92 89 90 91
Utility 62 59 54 57 58
Efficiency 83 85 79 81 82
Recoverability 72 66 76 70 71
OVERALL 71 69 66 67 68
≥ 95
AGENT-READY
90–94
MINOR FRICTION
80–89
NEEDS ATTENTION
< 80
READINESS GAP
illustrative — fictional service

22 · engineering tickets

Fix Specifications · Utility

FIX F-01

Each fix is written as an engineering ticket — the problem, the evidence behind it, the change, and how to confirm it shipped. "Done when" confirms the change is live.

F-01
Return a structured result from docs.create
API change

Addresses 31 of 87 observations

PROBLEM

docs.create confirms creation in prose. The agent can't parse an artifact id from a sentence, so it can't verify the document exists or cite it — task completion rests on trusting the prose. Agents either re-query for the document or hand the user an unverifiable answer.

EVIDENCE

31 of 87 docs.create observations across 4 models.

EXAMPLE OBSERVATION session-04d5
CURRENT
Created your document! View it at arvossi.dev/d/8f3k2
REVISED
{
  "artifact": {
    "id": "doc_8f3k2",
    "url": "https://arvossi.dev/d/8f3k2"
  },
  "status": "created"
}
DONE WHEN
docs.create responses carry artifact.id and status — and the agent's final answer can cite them.

// how we test

Live, end to end

Real models from multiple providers receive your tool surface, mirrored faithfully from your real schemas. We then send them realistic prompts built around what your users actually ask. The API calls they make hit your real endpoints, so the failures we find are the same ones real agents run into. And because we trace each task from intent → tool choice → call sequence → outcome, every finding comes with the exact step where things went wrong and a fix that targets it.

// what we need from you

Four questions to start

// pricing

$10,000

One engagement, one price, no tiers, no surprises.

// questions

You may be wondering...

Is live testing safe for our production service?
We strongly prefer a sandbox or test account, honor your rate limits, and prefer read-only credentials. If a test task would write data, we agree on that scope with you first. And we only ever test your service from the outside — we hand you fixes; we never touch your code or configuration.
What access do you need?
Typically an API key or a test-account login. We tell you exactly what we need for your setup before anything starts.
What happens to our credentials and data?
Used only for the engagement. We'll sign a mutual NDA before you share anything sensitive.
Which AI models do you test with?
Current Claude, GPT, and Gemini models, spanning flagship and fast tiers. We refresh the lineup as new versions ship, and your report names the exact models and versions used.
How much improvement should we expect?
We don't project improvement — after you ship fixes, lift is established by retesting and measuring, not by forecasting. What we observe is labeled as observed; recommendations describe expected direction, not promised numbers, and category-level numbers are directional patterns. We'd rather under-claim than over-promise — the report reads the same way.
How is this different from monitoring or observability tools?
Your logs show the API calls. They can't show what the user asked, why the agent called what it called, or whether the user got their answer. We trace that, from the outside, the way an agent actually experiences your service.

// book an audit

Start the conversation

$10,000 — no tiers, no surprises