The AI Assurance Gap: Why your agentic AI may be flying blind

27 May

Written By Geoff Willshire, Chief Product Officer

Most agentic AI deployments either stall in testing purgatory or ship without the evidence to prove they're safe. In both cases, the cost is real customers, real revenue, and a reputation you can't easily repair.

The whole industry is talking about Agentic AI. The dollars are staggering, the vendor demos are impressive, and the boardroom pressure to deploy is at an all-time high. According to IDC’s Worldwide Artificial Intelligence IT Spending Market Forecast, global spending on AI is forecast to reach $1.3 trillion by 2029, global spending on AI (driven specifically by agentic applications) is forecast to reach $1.3 trillion by 2029. A share of that is flowing directly into customer experience: building AI agents to handle customer interactions autonomously, personalise service in real time, and resolve issues without human intervention.

But here is the question few are asking loudly enough: how many of them are actually working?

Most pilots never leave the building

McKinsey's State of AI 2025, drawing on nearly 2,000 organisations worldwide, found that while 62% of organisations are experimenting with AI agents, fewer than 25% are scaling them - and only 10% of organisations will have fully deployed AI agents in a single business function.

Two thirds of organisations are stuck in what McKinsey calls "pilot purgatory" - running proofs of concept that demonstrate results in controlled conditions but never graduate to production.

That is not a just capability problem. The models often work. The problem is what happens between a successful demo and a system you would actually trust with real customers.

To understand why, it helps to think about what agentic AI is, and why it is fundamentally different from what came before it.

A traditional IVR works like a fixed thermostat. It turns on, routes calls along predictable paths and turns off. Testing it is straightforward: script every path; verify every output. The system is deterministic. Pass or fail.

Generative AI changed the nature of the problem. An LLM does not follow a script. It produces variable responses based on context, phrasing, conversational history and “temperature” (randomness). Testing it means validating behaviour across thousands of possible inputs, it’s more like calibrating a smart thermostat that learns your preferences than inspecting a fixed appliance.

Agentic AI is a different category entirely. An agentic system does not just respond to inputs - it pursues goals, takes autonomous actions, coordinates across tools and systems and adapts its behaviour based on what it discovers along the way. Testing it is like assuring an autonomous smart-building climate system: you need to prove it can intelligently adapt when a thousand people flood the lobby; the outside temperature spikes by ten degrees; and a backend cloud server lags, even if they happen simultaneously. Not in a controlled test, but in the real world under conditions that no script anticipated.

Even then, passing that test is only half the challenge. A smart building does not get inspected once and then left to run unattended. It is monitored continuously, because the environment never stops changing. The same is true of agentic AI in production. A system that behaved correctly at launch can drift as customer behaviour evolves, backend systems change, and the AI itself adapts from new interactions. Without continuous monitoring, you are not assuring a live system, you are trusting a memory of how it performed on the day it was deployed.

Testing proves it works. Monitoring means you find out if it stops, preferably before your customers do and before your brand pays the price.

Many organisations do not have the tools to do that. So, the pilot sits in staging. Or it gets quietly shelved. Or it goes live, and nobody really knows what it is doing.

The AI assurance gap

The model is not why most agentic AI pilots fail to reach production. The thinking part works. What doesn't work is everything around it: the integration layer; data quality; governance framework, and critically, the inability to validate that the system behaves correctly, safely, and reliably when real customers use it. Research consistently identifies missing monitoring infrastructure and weak validation frameworks as leading causes of the gap between pilot and production.

This is the AI Assurance Gap: the disconnect between the velocity of AI deployment and the capability of existing tools to validate it.

Legacy testing platforms were built for a deterministic world: scripted inputs, expected outputs, pass/fail results. They work for what they were designed for. But agentic AI is not deterministic. It does not follow a script. Every conversation branches differently. Every autonomous decision creates a new path that no pre-written test case anticipated.

According to MIT’s GenAI Divide, the gap shows up in three ways:

Non-deterministic paths - An agentic AI can generate thousands of unique responses to the same customer query, depending on tone, history, intent and context. There is no finite list of paths to script against. Traditional testing tools, built to verify that input A produces output B, have no framework for validating a system where the same input reliably produces different outputs, and where "different" is not a bug but a feature.

Multi-vendor handoffs - Many enterprise CX environments do not run on a single platform. They span NICE, Genesys, Amazon Connect, custom LLM layers, and legacy IVR infrastructure, with customer interactions passing between them in real time. Testing each platform in isolation leaves the handoffs unmonitored. The handoffs are where cascade failures occur: a digital channel fails; a customer escalates to voice; the IVR cannot retrieve the context and the customer is routed in a loop. Each component passed its individual test, but the system failed the customer.

Production drift - AI systems do not behave the same way in production as they did in staging. Models drift. Customer behaviour is unpredictable. Backend systems introduce latency that staging environments never replicated. A system that was validated at deployment can develop failure modes weeks later that no pre-launch test would have caught. And if the monitoring platform itself introduces delay - surfacing alerts minutes or hours after the fact rather than in real time - the window between failure and customer impact widens further.

Traditional testing partially catches the first problem and misses the others altogether. That is why pilots stall. The validation gap is not a technical obstacle that more rigorous scripting will overcome. It is a structural mismatch between the tools available and the systems being deployed.

The economics of legacy testing make this worse, not better. Most traditional CX testing platforms were priced for a deterministic world, charging per port, per test session, or per API call. In an agentic AI environment, where the number of possible interaction paths is effectively unlimited, that pricing model creates a direct financial incentive to test less. Organisations cap their testing coverage not because they want to, but because comprehensive coverage becomes prohibitively expensive at scale. The result is that the most complex, highest-risk deployments, the ones that need the most rigorous assurance, are precisely the ones that get the least.

Which brings us to the uncomfortable truth, enterprises are spending millions designing, building, and deploying agentic AI systems to improve the customer experience. The talent investment is real. The platform investment is real. The opportunity cost of the engineering time is real. And every dollar of that investment is exposed the moment the system goes live without adequate testing and monitoring in place. A poorly validated agentic AI does not just underperform; it actively damages the customer relationship it was built to improve.

The commercial stakes of getting this wrong are higher than many organisations realise. Consumers do not extend the same grace to AI that they extend to human agents. According to Glance's 2026 CX Trends Report, 75% of consumers were left frustrated by AI customer service in 2025, reporting loops, dead ends, and declining trust. Nearly 90% report reduced loyalty when human support is removed. Forrester, in their 2026 B2C predictions, forecast that a third of companies will actively harm brand trust through premature AI deployment.

The investment in building the system is wasted if the system cannot be trusted. Testing and monitoring are not an overhead on top of the AI investment. They are what makes the AI investment safe.

The systems that do go live may be flying blind

For the minority of organisations that are currently pushing agentic AI into production, a new problem emerges — one documented in the just-released TDWI Agentic AI Readiness Benchmark (May 2026), co-sponsored by Snowflake.

The benchmark found that only 21% of organisations have real-time monitoring in place across all their AI systems. A further 29% are in the process of planning it. The remaining 50% have not started planning it at all. This is consistent with Gartner's finding in their Market Guide for AI Evaluation and Observability Platforms, that only 18% of organisations currently use AI evaluation tools, a figure they expect to reach 60% by 2028.

Which means that (conservatively) half of all enterprises deploying live agentic AI today have no visibility into whether it is working correctly once it goes live.

This is not a minor operational gap. It is a significant brand issue, and it is the same underlying problem as the pilot failure rate, just manifesting at a later stage.

Think of it this way. Traditional monitoring tools measure aggregate metrics: average handle time, call volume, abandonment rate, CSAT scores. When a failure affects a specific routing path, a particular language group, or a specific combination of customer intent and AI state, aggregate metrics often do not move. The failure is real and total for every customer it affects , but invisible at the system level until the damage is done.

Agentic AI makes this problem dramatically worse, for the same reason it makes scripted testing inadequate. A system that makes autonomous, non-deterministic decisions can develop failure modes that look like normal variation until they have affected tens of thousands of customers. Without continuous monitoring at the interaction level, not the aggregate level, you are flying blind.

Gartner predicts that 75% of regulated organisations will be exposed to fines exceeding 5% of global revenue through 2027 - specifically because of compliance processes that rely on periodic snapshots rather than continuous, real-time monitoring of AI behaviour. Whether that means manual review or scheduled automated tests, the risk is the same.

The regulations creating that exposure are already arriving. The EU AI Act's general provisions apply from August 2026, with high-risk AI obligations following in December 2027 — covering financial services, healthcare, and any AI system making consequential decisions about individuals, including multinationals serving European customers. California's CCPA ADMT rules are already in force, with full compliance required by January 2027 for businesses using AI in significant consumer decisions. And in the UK, the FCA has made clear that Consumer Duty, already in force, requires demonstrable evidence of good AI-driven outcomes across every customer journey it touches; no AI-specific exemption exists. None of these frameworks are satisfied by a periodic test cycle or a manual audit conducted quarterly.

But for CX leaders, the more immediate risk may be the one that arrives before any regulator. Forrester predicts a third of companies will actively harm brand trust through premature AI deployment. Deploying before it's ready is one version of that risk. Deploying without monitoring what it does once it's live is another. Regulation punishes you later. Customers punish you immediately.

The same problem, twice

The AI Assurance Gap. the inability to prove that an AI system behaves correctly, safely, and reliably. is what can kill pilots before launch. And it is what can leave the survivors unprotected once they go live.

Closing it requires three things that legacy testing platforms were not designed to provide:

Continuous, not point-in-time, validation AI systems change in production. Assurance must be an ongoing process, monitoring every interaction and detecting drift in real time, not a gate that is passed at deployment and revisited quarterly.

Non-deterministic path testing Instead of scripting expected paths, effective assurance dynamically maps and validates behaviour as it emerges, without requiring engineering teams to maintain brittle test libraries that break every time the system learns something new.

Vendor-independent coverage A single assurance layer that covers the entire CX environment, across all platforms, all handoffs, all channels, not one that leaves blind spots wherever two vendors meet.

The organisations that build this infrastructure are the ones whose agentic AI pilots reach production. And they are the ones whose production systems stay safe.

A Final Thought

The TDWI benchmark finding that only 21% of organisations have real-time monitoring across all their AI systems is striking. But what strikes me more is the other number: that 50% have not started planning for it at all.

These are not organisations that are unaware of the problem. They are organisations that are deploying agentic AI right now, in live customer environments, without a safety net.

Think of it like the canary in a coal mine. The canary's job was not to fix the gas leak. It was to provide continuous, real-time warning before conditions became dangerous for everyone else. PumpCX ‘s solution is built on the same principle: not to replace the AI system, but to monitor it continuously. Detecting the failures that aggregate metrics miss, in the moment they occur, before real customers are affected.

The gap between "AI in the demo" and "AI you can trust" is the AI Assurance Gap. The research is telling us it is wider than most organisations realise.

At PumpCX, closing the AI Assurance Gap in what Gartner calls the emerging technology category of AI Evaluation and Observability Platforms is what we are focussed on. We're always keen to learn more about how organisations are experiencing these challenges in practice, and what it would take to genuinely solve them.

I'll be at Snowflake Summit

I'll be moving around the US to talk about the current issues in CX, including agentic AI, as well as attending the upcoming Snowflake Summit user conference in San Francisco, NiCE World in Orlando and CCW in Las Vegas.

I want to talk to the CX leaders and CX executives who are looking to not just deploy CX faster, but deploy with confidence.

If you're navigating the realities of validating and monitoring CX under real-world conditions, come and find me at Moscone or reach out to me in advance on LinkedIn.

Geoff Willshire, Chief Product Officer