tracingviolet

Agent Readiness Isn't One Thing: Codebase vs. Tool Surface

Agent readiness is becoming a category. It's also two completely different things — and confusing them is a fast way to optimize the wrong half of your stack.

In January 2026, FactoryAI launched a feature called Agent Readiness — a framework that scores your codebase across nine pillars (style, build, testing, docs, dev environment, observability, security, task discovery, product) and assigns a maturity level from Functional through Autonomous. They just raised a $150M Series C. The framework is good. The category they're building is real.

It's also half the picture.

// key findings
  • Agent readiness has two halves: codebase readiness — whether AI agents can work IN your code — and tool-surface readiness — whether AI agents can USE your APIs.
  • FactoryAI defined the codebase half with a structured 9-pillar / 5-maturity-level framework and automated PR-based remediation. Right answer for internal engineering teams making source code agent-friendly.
  • Tool-surface readiness lives at the API / MCP boundary, requires behavioral testing (not static analysis), and is where most externally-facing publishers leak the most agent value.
  • An agent readiness audit should cover both — they fail in different ways, on different timescales, and require completely different fixes.

// finding 01

What is an agent readiness audit?

An agent readiness audit is a structured assessment of whether autonomous AI agents can successfully work with your software — find the right entry points, supply correct inputs, get useful results back, and complete the user's task without humans intervening to clean up. It's the agent-era cousin of a security audit or an accessibility audit: a focused diagnostic on a specific dimension of how the world interacts with your code.

The catch: "your software" has at least two surfaces an agent has to deal with, and they're not the same problem.

There's the inside — your application source code, where coding agents like Claude Code, Devin, or Factory's own Droids are doing the work. They need to understand your repo structure, follow your conventions, run your tests, and not break your build. If your codebase is undocumented, untested, and inconsistently styled, the agent will produce garbage. Factory's framework measures this.

And there's the outside — your APIs, MCP servers, SDKs, anything you expose to other people's agents. These agents don't write code in your repo. They send a request, parse what comes back, and decide what to do next. If your tool descriptions are vague, your error messages are opaque, or your endpoints require ID lookups before any real work can happen, the agent will fail in ways your application logs can't explain. This is what we measure at tracingviolet.

Same word, two different problems, two completely different audits.

// finding 02

Why does agent readiness have two halves?

Because agents have two relationships with your software, and they break down differently.

When an agent works inside your code — refactoring, fixing a bug, adding a feature — it needs to understand the architecture, follow the patterns, and stay within the rails your engineering culture has built. Failures here look like: agents that introduce inconsistent style, agents that skip tests, agents that miss obvious context because there's no AGENTS.md or CONTRIBUTING.md file. These failures are static. You can audit them with file-existence checks, config parsing, and lint rules. Factory's framework is exactly this.

When an agent works with your tool surface — calling an MCP server, hitting your API on behalf of a user — it has a totally different problem. It has to pick the right tool from a list of options, send semantically appropriate parameters, get a useful response back, and synthesize an answer. Failures here look like: agents that never call your tool because the description doesn't signal what it does, agents that send the wrong asset slug because there's no discovery endpoint, agents that retry blindly because your 404 message says nothing actionable. These failures are dynamic — you can only see them by running real models against your real surface and watching what happens.

Static analysis works for the first kind. It cannot reach the second kind. To put it more directly: Factory's framework cannot tell you whether agents are actually using your tools. Our framework cannot tell you whether your CODEOWNERS file is up to date. They're orthogonal.

Agent Readiness Has Two Halves
Agent Readiness Has Two Halves

// finding 03

What does codebase readiness measure?

Codebase readiness measures whether your application source code is structured so that AI coding agents — the kind that touch files, run commands, open PRs — can work effectively in it. Factory's nine-pillar framework is the most complete public answer:

Pillar What it checks
Style & Validation Linters, formatters, type checkers configured and enforced
Build System Reproducible build, documented dependencies
Testing Unit + integration test coverage, CI runs them
Documentation README, AGENTS.md, runbooks, decision records
Dev Environment One-command setup, devcontainer, scripts/bootstrap
Debugging & Observability Logging, structured errors, traces
Security Secret scanning, dependency audits, CODEOWNERS
Task Discovery Issues structured, scope clear, well-named
Product & Experimentation Feature flags, evaluation infrastructure

You unlock the next maturity level — Functional, Documented, Standardized, Optimized, Autonomous — when 80% of the criteria at the previous level pass. Factory's tooling generates a report, and (this is the part competitors rarely match) opens PRs to fix the failing criteria automatically. It's smart. It works. It's the right thing to do if your agents are working in your code.

The ceiling: it's a linter. A repository that scores Level 5 on every pillar can still expose a tool surface that no agent can successfully use. The codebase audit doesn't reach the boundary where most agent traffic actually lives.

// finding 04

What does tool-surface readiness measure?

Tool-surface readiness measures whether your tool / API surface — the part agents call from outside your codebase — is structured so agents can discover it, choose it, use it correctly, and get useful results back. It's a behavioral question, not a structural one. You can't answer it by parsing files; you have to send real requests with real models and observe what happens.

We've measured this across thousands of agent-tool interactions, multiple model families, and several verticals. The Agent Readiness Map breaks the agent's journey into stages:

  1. Invocation — does the agent attempt a tool call at all, or answer from training data?
  2. Presentability — can the model even load your tool surface? (OpenAI models hard-cap at 128 tools.)
  3. Input friction — how many preparatory calls does the agent need before a value-producing call?
  4. Reliability — does the API actually return something?
  5. Utility — is the response useful, or empty/wrong/malformed?
  6. Task completion — did all of the above add up to the user getting their answer?

Each stage has its own failure mode and its own fix. A tool that fails at invocation needs better descriptions. One that fails at presentability needs to be bundled. One that fails at input friction needs natural-language inputs or a discovery endpoint. One that fails at reliability needs better infrastructure. One that fails at utility needs API design changes. One that fails at task completion needs all the above to compound.

You can't see any of this from your codebase. Your application logs show the API calls; they don't show why the agent picked your tool, or didn't, or whether the user actually got their question answered.

// finding 05

Are your APIs ready for AI agents?

The honest answer for most teams: probably not, and you can't tell from the inside.

Consider the data. Across our testing of agent interactions in weather, search, crypto, and developer tools verticals, we've seen the failure point shift dramatically by service type. Weather APIs fail at reliability (one provider's API timed out 100% of the time we tested it). Search APIs all execute successfully but vary wildly in utility — some return relevant content for one query type and zero useful results for another. Dev tool APIs hit a wall at presentability before any tool call happens, because their surface is too large for the model to load.

These are publisher-side problems. None of them show up in a codebase audit. None of them are visible in your API logs in any way that points to a fix. The 4xx rate looks normal. The latency looks normal. The agents are just quietly losing.

Hasan and colleagues found, in a study of 856 MCP tools, that 97% have at least one description-quality issue that affects how agents perceive them (arXiv 2602.14878). The pattern isn't that some publishers got it wrong — it's that almost everyone has, because the failure modes are invisible from the publisher side. You can ship a perfectly functional API and have agents systematically choose your competitor, and your dashboard will tell you nothing useful.

// finding 06

How does agent readiness differ from agent evaluation?

This trips people up. Worth being precise about.

Agent evaluation asks: how good is the agent? Given a set of tasks, does the model pick the right tool, call it correctly, recover from errors, and finish the job? The answer informs which model to deploy, how to prompt it, and where the model's limits are. Anthropic, Arize, LangSmith, DeepEval, Galileo, and a dozen others build great tooling for this. Worth using if you're operating an agent.

Agent readiness asks the inverse: how good is what the agent is working with? Given a competent agent, can it succeed against your codebase / your API / your MCP server? The answer informs which fixes to ship in your code or your tool surface. It's the publisher-side question, not the operator-side question.

The two are complementary. An evaluation says "this model succeeds on most of these tasks, and here's where it slips." A readiness audit says "your tool surface is the reason agents using any model are failing on the tasks where they fail." Both true. Different audiences. Different fixes.

If you're operating an agent: invest in evaluation tooling. If you're publishing a tool surface that agents call: invest in readiness — both halves.

// finding 07

How do I know which is my bottleneck?

A simple decision rubric:

You ship code that other people's coding agents work in? (Open-source projects, internal monorepos with Devin / Cursor agents running, anyone whose codebase is being modified by autonomous AI.) Codebase readiness is your primary lever. Factory's framework is the right starting point.

You ship APIs / MCP servers / SDKs that other people's agents call? (Public APIs, MCP server publishers, anyone whose tool surface is being consumed by agents you don't control.) Tool-surface readiness is your primary lever. Most teams here are leaking agent value they can't even see.

You ship both? Both apply. They don't conflict — codebase fixes and tool-surface fixes touch different files and don't trade off against each other. The order to start in: whichever surface has more agent traffic right now. If you don't know which has more agent traffic, that itself is a finding — your observability isn't yet calibrated for agent-era usage patterns.

The most common mistake we see: teams who run a codebase audit, score well, and assume their agent story is done. The codebase audit was real, the score was real, the assumption is wrong. Their tool surface is still systematically losing agent traffic to competitors who happen to have better-described tools, and nothing in the codebase audit will tell them.

// finding 08

What's coming next?

Two predictions worth tracking.

First, the term "agent readiness" itself is going to become a search category. It isn't yet — we ran the data this week and it has effectively zero search volume despite Factory's launch. But $150M of marketing dollars are about to be spent making the term familiar, and when search demand materializes the question buyers will ask is "what KIND of agent readiness?" The answer is going to be both halves, every time. Plan accordingly.

Second, the next layer of the stack — multi-agent systems where agents call each other, agent marketplaces where multiple competing tools are available for the same task, runtime selection at the model layer — will make tool-surface readiness more important, not less. As the surface area of agent-to-agent interaction expands, the publishers who've made their tools easy for agents to find, choose, and use correctly will compound advantages. The publishers who haven't will quietly lose share to nobody in particular.

The good news: both halves are fixable, and the fixes are cheaper than most engineering teams expect. Codebase readiness is mostly config and documentation work. Tool-surface readiness is mostly description and error-message work. The hard part isn't the fix — it's knowing which fix to make, in which order, against what evidence.

That's what an audit is for.


tracingviolet runs free agent-readiness scans on your tool surface — paste an OpenAPI spec or MCP server URL, see what real agents do with your tools in under 30 seconds. Try it.

Want this run against your own tools?

We test your endpoints live across multiple models and deliver a report with specific, ship-ready fixes.

Book an audit