// the audit

Your tools have an agent problem Let's fix that

Q: Is live testing safe for our production service?

We strongly prefer a sandbox or test account, honor your rate limits, and prefer read-only credentials. If a test task would write data, we agree on that scope with you first. And we only ever test your service from the outside — we hand you fixes; we never touch your code or configuration.

Q: What access do you need?

Typically an API key or a test-account login. We tell you exactly what we need for your setup before anything starts.

Q: What happens to our credentials and data?

Used only for the engagement. We'll sign a mutual NDA before you share anything sensitive.

Q: Which AI models do you test with?

Current Claude, GPT, and Gemini models, spanning flagship and fast tiers. We refresh the lineup as new versions ship, and your report names the exact models and versions used.

Q: How much improvement should we expect?

We don't project improvement — after you ship fixes, lift is established by retesting and measuring, not by forecasting. What we observe is labeled as observed; recommendations describe expected direction, not promised numbers, and category-level numbers are directional patterns. We'd rather under-claim than over-promise — the report reads the same way.

Q: How is this different from monitoring or observability tools?

Your logs show the API calls. They can't show what the user asked, why the agent called what it called, or whether the user got their answer. We trace that, from the outside, the way an agent actually experiences your service.

An Agent Readiness Audit is a live, multi-model diagnostic of your MCP server or API — traced from user intent to task outcome, delivered as fixes specific enough to ship.

Book an audit → or hello@tracingviolet.dev

// the report

What lands in your inbox

Four excerpts from a sample diagnostic audit, recreated below.

illustrative — fictional service

03 · executive summary

The Bottom Line

AGENT READINESS SCORE

68

READINESS GAP / 100 Ranges 66–71 across 4 models

SCORE BREAKDOWN5 DIMENSIONS

RELIABILITY

91

UTILITY

58

INVOCATION

94

EFFICIENCY

82

RECOVERABILITY

71

illustrative — fictional service

05 · tracing the agent's journey

Agent Readiness Map

≥ 95

AGENT-READY

90–94

MINOR FRICTION

80–89

NEEDS ATTENTION

< 80

READINESS GAP

01

Invocation

Did the agent try to use the tool at all?

96%

412 / 429 observations

02

Presentability

Could the model load your tools without hitting platform limits?

100%

no risk — small tool surface

03

Input Friction

Could the agent reach a useful call without prep work?

88%

363 / 412 invoked observations

04

Reliability

Did the API call return successfully?

93%

752 / 808 API calls

05

Utility

When the call succeeded, was the response useful?

64%

481 / 752 successful calls

→

Task Completion — did the user get what they came for?

71%

illustrative — fictional service

07 · model breakdown

How Different Models Perform

Agent readiness isn't one number — it varies by which model is driving. Each dimension is scored per model; cells use the same four-band scale as the rest of this report.

DIMENSION	claude-opus-4.7	claude-sonnet-4.6	gpt-5	gemini-3.1	Overall
Invocation	95	97	91	93	94
Reliability	93	92	89	90	91
Utility	62	59	54	57	58
Efficiency	83	85	79	81	82
Recoverability	72	66	76	70	71
OVERALL	71	69	66	67	68

≥ 95

AGENT-READY

90–94

MINOR FRICTION

80–89

NEEDS ATTENTION

< 80

READINESS GAP

illustrative — fictional service

22 · engineering tickets

Fix Specifications · Utility

FIX F-01

Each fix is written as an engineering ticket — the problem, the evidence behind it, the change, and how to confirm it shipped. "Done when" confirms the change is live.

F-01

Return a structured result from docs.create

API change

Addresses 31 of 87 observations

PROBLEM

docs.create confirms creation in prose. The agent can't parse an artifact id from a sentence, so it can't verify the document exists or cite it — task completion rests on trusting the prose. Agents either re-query for the document or hand the user an unverifiable answer.

EVIDENCE

31 of 87 docs.create observations across 4 models.

EXAMPLE OBSERVATION session-04d5

CURRENT

Created your document! View it at arvossi.dev/d/8f3k2

REVISED

{
  "artifact": {
    "id": "doc_8f3k2",
    "url": "https://arvossi.dev/d/8f3k2"
  },
  "status": "created"
}

DONE WHEN

docs.create responses carry artifact.id and status — and the agent's final answer can cite them.

// what you get

an Agent Readiness Score (0–100) with five sub-scores
an Agent Readiness Map — where agents drop off, stage by stage
where agents get stuck — the named failure patterns behind the numbers, each backed by traced sessions
fixes at engineering-ticket specificity — exact schemas, exact description text, exact error-message formats
a per-model breakdown — how different AI models behave on your service
run-to-run stability — every engagement runs multiple independent passes; we report how stable the numbers are
"How to verify in your own logs" — for every fix, what to check before and after; the one thing your logs can't show you is whether the user's question was actually answered — that's what we measure
the full engineering appendix — every tool exercised, every prompt we asked. Nothing about the test is a black box

// how we test

Live, end to end

Real models from multiple providers receive your tool surface, mirrored faithfully from your real schemas. We then send them realistic prompts built around what your users actually ask. The API calls they make hit your real endpoints, so the failures we find are the same ones real agents run into. And because we trace each task from intent → tool choice → call sequence → outcome, every finding comes with the exact step where things went wrong and a fix that targets it.

// what we need from you

Four questions to start

What does your service do? and who uses it
Where can an AI assistant connect to it? MCP endpoint, API docs, or OpenAPI spec
What do people ask agents to do with it? real examples or pasted user prompts
How do we get access to test? read-only or sandbox credentials preferred; we test carefully and honor your rate limits

// pricing

$10,000

One engagement, one price, no tiers, no surprises.

// questions

You may be wondering...

Is live testing safe for our production service?

We strongly prefer a sandbox or test account, honor your rate limits, and prefer read-only credentials. If a test task would write data, we agree on that scope with you first. And we only ever test your service from the outside — we hand you fixes; we never touch your code or configuration.

What access do you need?

Typically an API key or a test-account login. We tell you exactly what we need for your setup before anything starts.

What happens to our credentials and data?

Used only for the engagement. We'll sign a mutual NDA before you share anything sensitive.

Which AI models do you test with?

Current Claude, GPT, and Gemini models, spanning flagship and fast tiers. We refresh the lineup as new versions ship, and your report names the exact models and versions used.

How much improvement should we expect?

We don't project improvement — after you ship fixes, lift is established by retesting and measuring, not by forecasting. What we observe is labeled as observed; recommendations describe expected direction, not promised numbers, and category-level numbers are directional patterns. We'd rather under-claim than over-promise — the report reads the same way.

How is this different from monitoring or observability tools?

Your logs show the API calls. They can't show what the user asked, why the agent called what it called, or whether the user got their answer. We trace that, from the outside, the way an agent actually experiences your service.

// book an audit

Your tools have an agent problem Let's fix that

What lands in your inbox

The Bottom Line

Agent Readiness Map

How Different Models Perform

Fix Specifications · Utility

Live, end to end

Four questions to start

You may be wondering...

Start the conversation