Your AI agents act in production. The audit trail does not.
Agents now ship code, file tickets, send email, move money. The systems they touch keep their usual logs — but the agent's reasoning, the tools it considered, and the version of the prompt it was running rarely make it anywhere durable. That gap is where the next class of incidents will be diagnosed, after the fact, with no data.
When an engineer pushed a deploy in 2018, the company ended up with five durable artifacts: the git commit, the CI run, the deploy log, the access log, and a Slack message. Reconstructing what happened was tedious, never impossible.
When an AI agent does the same work in 2026, most organizations end up with one artifact — the change in the target system — and almost nothing else. The agent's reasoning, the tools it considered, the inputs it read, the version of its system prompt: typically not stored, often not even logged.
That gap is where the next class of AI incidents will be investigated. After the fact. With no data.
What "agent in production" actually means now
Two years ago, "AI in production" mostly meant a chat surface generating text. Today it means something different. A non-trivial fraction of routine internal work is now done by agents:
- Triaging tickets and assigning owners
- Writing PR descriptions and pushing low-risk patches
- Drafting customer responses
- Pulling data, summarizing it, posting to internal channels
- Configuring infrastructure, rotating credentials, opening firewall rules
Each of these is an action in a system of record. Tickets, PRs, customer accounts, infra, identity. The systems themselves keep their usual audit trails — Jira's audit log, GitHub's PR history, your IAM's CloudTrail. None of those record why the agent did what it did, what context it had, or what version of itself was running at the time.
The actual surface
If an agent caused an incident last Tuesday, here are the questions you would want to answer. For each, ask whether your organization can answer it today.
Which agent took this action? Most teams don't tag agent-originated changes distinctly. The PR was opened by bot-deploy; the ticket was closed by an integration user; the email came from a shared mailbox. The agent identity collapses into a generic service principal, indistinguishable from a human or from a different agent.
What model and version was running? Provider model strings drift. claude-sonnet-4-5 and claude-sonnet-4-6 are different models with different behaviors, and most agent frameworks pin loosely. By the time anyone investigates, the version that produced the bad output may already be retired.
What system prompt was active? System prompts in production are configuration. They change. A team adjusts a guardrail on Friday; the bad action happens on Monday. There is rarely a versioned record of the exact prompt that was loaded for that specific run.
What inputs did it read? Resources, RAG queries, the ten-page document a user attached, the contents of a webhook. Models reason over inputs that may not be retained anywhere after the call returns.
What tools did it consider but not call? This is the question that catches near-misses. A model that almost called delete_database and then chose archive_record is one prompt-injection away from the bad outcome. Without traces of considered-but-rejected tool calls, you cannot even count near-misses.
Who reviewed it, if anyone? In a "human in the loop" setup, who approved, when, and based on what context? Most organizations capture an approval flag and nothing else.
A useful benchmark for any AI ops team this quarter: pick one agent action from last week and try to assemble those six items into a single page. Time it.
What durable agent observability looks like
The shape of the answer is not new. SRE has solved this class of problem for distributed systems for a decade. Agent ops is roughly the same problem with different primitives.
A workable target state has four properties.
Per-action provenance. Every agent-originated action in a system of record carries a stable agent run ID. The system being acted on accepts and stores this ID alongside its own audit log. If your ticketing tool, your VCS, and your IAM cannot ingest a custom run ID today, that is the first plumbing job.
A run-level trace. For each agent run, one durable record stores: agent identity, model and version, system prompt hash, list of tools available, list of tools called (and rejected), inputs by reference, outputs, approver if any, total token count.
A retention policy. Agent traces are kept long enough to satisfy whoever asks the post-incident question. Six months is a starting point; regulated environments will land higher. Note that "long enough" is often longer than the model provider's own retention.
A diff loop. When the system prompt changes, when the available tool list changes, when the model version changes — those changes are events with their own retention. The agent's configuration has an audit trail too, not just its actions.
None of this is exotic engineering. The reason most organizations don't have it yet is that it has to be a deliberate decision, the way logging request IDs through every microservice was a deliberate decision a decade ago.
Why this is urgent now
The volume of agent-originated production work is past the point where ad-hoc spot checks suffice. A team running a single internal coding agent can easily produce hundreds of PRs per week. An agent triaging customer tickets resolves thousands of cases per day. The base rate of bad outcomes does not have to be high for the absolute number to matter.
The other reason is regulatory. The first AI-specific incident postmortems landing in regulated industries — finance, healthcare, public sector — are starting to follow the same script as data-breach postmortems. "Show me the trace" is the second question after "what did the system do." Organizations that cannot produce a trace will spend the next twelve months building one under deadline pressure rather than ahead of it.
A first artifact
Before instrumenting anything, write down a one-page agent inventory. Each agent. Its identity in the systems it acts on. Its model and prompt source. Its tool list. Its owner.
Then pick the single agent that touches the most consequential system of record — usually the one with write access to customer data or to production infrastructure — and build the run-level trace for that one agent first. Resist the urge to do it across every agent at once. The shape of a good trace is easier to find on one example than on twelve.
This is unglamorous work. It is also exactly the work that turns a post-incident "we don't know" into a post-incident "here is what happened, here is the fix, here is the regression test." That difference is the entire value of an audit trail.
The tools your agents connect to are one supply chain. The actions they take are another. Both need an inventory before they need a control plane.
From the operator
Basenull AI Ops ships purpose-built tools for the IT executive whose org is already running AI in production. Governance, supply-chain security, agent ops, observability — the operational layer that usually arrives after the first incident.