
17 min read
March 3, 2026
TL;DR
Piloting agentic AI doesn't require an organization-wide initiative, a dedicated AI team, or a willingness to accept production risk.
It requires a disciplined 90-day approach: select a single high-friction process, define the guardrails before you build anything, deploy in a sandbox where failures don't touch production, and expand only after the data proves the system works.
This article, part of a larger series How Director-Level Leaders Should Approach Agentic AI in Operations, walks through each phase in detail, including what to measure, who to involve, and what to avoid. The goal isn't to adopt AI.
The goal is to generate real performance data from your specific operation so you can make an informed decision about what comes next.
You've read the thought leadership. You've evaluated the landscape. You understand what agentic AI is, where it fits, where it breaks, and what governance it requires.
Now you're facing the question that actually matters: how do you move from understanding to action without putting your operation at risk?
This is where most mid-market companies stall. The technology is compelling, the potential is clear, but the path from "interesting concept" to "running in our environment" feels undefined.
Enterprise companies solve this by hiring a team of 15 and running a six-month initiative.
Startups solve it by deploying fast and fixing what breaks.
Neither of those approaches works when you're responsible for a 50 to 250-person operation that needs to keep running while you innovate.
What follows is a 90-day pilot framework designed for that reality. It's structured around four phases, each with clear objectives, defined outputs, and decision gates that let you stop, adjust, or expand based on real data rather than assumptions.
The approach is deliberately conservative. Not because the technology requires it, but because your operation deserves it.
The first two weeks aren't about technology at all. They're about choosing the right target and understanding it deeply enough to define what success looks like.
The ideal pilot candidate isn't your most important process. It's also not your simplest. It's a process that meets four criteria simultaneously.
First, it's exception-heavy. The standard case is straightforward, but a meaningful percentage (20% or more) of instances require judgment, research, or manual intervention that doesn't follow a fixed script.
These exceptions are where agentic AI creates value, because deterministic automation can't handle them, and humans are spending disproportionate time on them.
Second, it's contained. The process has clear boundaries: a defined start, a defined end, and limited dependencies on other processes that might be affected if something goes wrong.
You want a pilot where a failure is visible, recoverable, and isolated. Not a process where a mistake cascades through three departments before anyone notices.
Third, it has measurable outputs. You can quantify the current performance in terms that matter: processing time, error rate, throughput, cost per unit of work, or cycle time. If you can't measure the current state, you can't prove the pilot worked.
Fourth, you have a willing team. The people currently executing this process need to be involved in the pilot, both as subject matter experts during setup and as supervisors during deployment.
If the team sees this as a threat rather than a tool, the pilot will fail regardless of how well the technology performs. Choose a process owned by a team that is frustrated with the current approach and open to a better one.
Before you build anything, document the current state with precision.
This serves two purposes: it establishes the baseline you'll measure improvement against, and it reveals the specific pain points the agent needs to address.
Map the process end-to-end. For every step, capture: who does it, how long it takes on average, what data or systems they reference, what decisions they make, and what happens when they encounter an exception.
Pay particular attention to the exceptions. Categorize them. How often does each type occur? How long does each take to resolve? What information does the person need to resolve it? Where does that information come from?
This mapping exercise almost always reveals inefficiencies that have nothing to do with AI. You'll find steps that exist because of a system limitation that was resolved years ago but nobody updated the process.
You'll find handoffs that add latency without adding value. You'll find data that gets manually transferred between systems when an integration could handle it. Document all of it. Some of these can be fixed immediately, simplifying the process before you automate it.
Remember: agentic AI multiplies whatever you point it at, including unnecessary complexity.
Write down, specifically, what the pilot needs to achieve to be considered successful. These aren't aspirations. They are pass/fail criteria that you'll evaluate at the end of 90 days.
Good success criteria follow a pattern: "Reduce [metric] from [current baseline] to [target], while maintaining [quality threshold]." Examples that we have seen work well for mid-market pilots:
"Reduce average invoice processing time from 22 minutes to under 5 minutes, with a discrepancy detection rate equal to or better than the current manual process."
"Process 80% of standard vendor inquiries without human intervention, with a customer satisfaction score within 5% of the current baseline."
"Deliver the weekly operational summary by 7:00 AM Monday with zero manual data gathering, with an anomaly detection accuracy rate above 90% as validated by the operations team."
Notice what these criteria include: a specific metric, a quantified target, and a quality constraint. The quality constraint is essential. A system that is fast but inaccurate, or that handles volume but misses exceptions, hasn't succeeded.
Also, define your failure criteria.
What would cause you to pause or terminate the pilot before 90 days? An error rate above a defined threshold. A system outage is affecting the pilot process. A data quality issue that can't be resolved within the pilot timeline.
Defining these upfront removes the ambiguity that leads to pilots running indefinitely in a zombie state: not successful enough to expand but not failed enough to kill.
With the process selected and the success criteria defined, the second phase is about building the control framework before you build the agent.
This order is intentional.
The guardrails aren't constraints on the agent. They're the architecture the agent operates within.
Identify every data source the agent will need to access to perform its function. For each source, define exactly which tables, fields, or endpoints the agent can read. Then define what it cannot access, even within the same system.
This isn't a general data access review. It's specific to this pilot. If the agent is processing invoices, it needs access to the AP module, PO records, and receiving logs in your ERP. It does not need access to payroll, HR records, or general ledger entries.
Even if those tables are in the same database, the agent's access should be scoped to what it needs for its defined function.
Document the data scope in a format that both technical and business stakeholders can review.
A simple table works: data source, specific tables or fields, access type (read-only or read/write), and justification. This document becomes part of your governance record and will be reviewed if you expand the pilot.
For data the agent will write to (updating records, creating entries, sending communications), define the boundaries even more tightly.
What fields can it modify? What values are within its authority to set? What states can it transition a record to?
If the agent can update an invoice status from "pending review" to "approved," can it also update it to "rejected"? Can it modify the dollar amount? Define every boundary explicitly.
Beyond data, define what systems and actions the agent can interact with. Can it send emails? To whom? Using what templates or within what parameters? Can it create records in your ERP? Modify existing ones? Delete anything?
The answer to that last question should almost certainly be no during a pilot.
Create a permissions matrix that maps every action the agent might take to an authorization level: autonomous (the agent can do this without approval), supervised (the agent can recommend this, but a human must approve), or prohibited (the agent cannot do this under any circumstance).
During a pilot, the "autonomous" column should be short. Most actions should be in the "supervised" category, with a clear path to graduating specific actions to "autonomous" in Phase 4 based on demonstrated performance.
Define the conditions under which the agent stops working autonomously and routes to a human. These should be specific, testable conditions, not vague guidelines.
Confidence thresholds: if the agent's confidence in its decision falls below a defined level, it escalates.
This requires the agent to be built with a confidence scoring mechanism, which should be a non-negotiable requirement for any agentic deployment.
Impact thresholds: if the financial value, customer impact, or operational consequence of the agent's action exceeds a defined level, it escalates regardless of confidence.
Novelty thresholds: if the agent encounters an input pattern, request type, or exception it hasn't seen before (or has seen fewer than a defined number of times), it escalates rather than attempting to reason from first principles.
For each escalation condition, define: who receives the escalation (a specific person or role, not "the team"), what information accompanies the escalation (the agent should provide full context, not just "exception encountered"), and what the expected response time is.
Define what gets logged before the agent processes its first input. At minimum:
Every input the agent receives: the raw data, the source, and the timestamp.
Every decision the agent makes: what it decided, what alternatives it considered, what data it referenced, and its confidence score.
Every action the agent takes: what was done, to which system, affecting which records, and the result.
Every escalation: the triggering condition, the data that caused it, who it was routed to, and the resolution.
Logs should be structured (not free text), timestamped, and stored in a queryable system.
You'll use these logs for performance evaluation, incident investigation, and the expansion decision at the end of the pilot.
If the logs are incomplete or unstructured, your ability to make an informed decision at day 90 is compromised.
The agent is built. The guardrails are in place. Now it enters the real world, carefully.
Before the agent touches any production system, run it in a sandbox environment using real data (or a sanitized copy of real data) but with no connection to live systems.
The agent processes inputs and produces outputs, but those outputs go nowhere.
They are captured, reviewed, and compared against what your team would have done with the same inputs.
This shadow period serves two critical functions.
First, it validates accuracy. You are comparing the agent's outputs against known-good human decisions across a representative sample of inputs, including edge cases and exceptions.
If the agent's accuracy is below your success threshold during shadow testing, it's not ready for production. Go back, diagnose the failures, adjust, and re-test.
There is no shortcut through this step.
Second, it builds team confidence. The people who will supervise the agent in production need to see it work before they trust it.
Shadow testing lets them observe the agent's reasoning, identify its strengths and blind spots, and develop an intuition for when the agent is operating well versus when it is struggling.
This intuition is essential for effective supervision in the next phase.
Track shadow period performance rigorously.
Accuracy rate by input type. Error rate by exception category. Processing time. Escalation frequency. Compare each metric against your defined success criteria.
If the agent meets the thresholds, move to supervised production. If it does not, iterate.
A pilot that spends an extra two weeks in shadow testing and launches correctly is more valuable than one that launches on schedule and fails in production.
The agent goes live against real production data, but with full human supervision. Every action the agent recommends is reviewed and approved by a team member before execution.
The agent is essentially doing the work and presenting it for approval, not acting independently.
Limit deployment to one department or one team. If you're piloting an invoice reconciliation agent, deploy it for one business unit's invoices, not the entire company's. If you're piloting a vendor communication agent, start with one category of vendors.
The scope should be large enough to generate statistically meaningful data but small enough that any issue is immediately visible and containable.
During this phase, the supervising team should be logging not just the agent's errors but its correct decisions that they would have handled differently. Not every divergence is an error.
Sometimes the agent's approach is valid but different from how a specific team member would have done it.
Other times, the agent is technically correct but missing context that a human would factor in. Understanding the difference is essential for tuning the system and for deciding where autonomy is appropriate.
Hold weekly reviews during this phase. Bring the supervising team together. Review the logs. Discuss the exceptions. Identify patterns in the agent's mistakes and successes. Adjust guardrails if needed.
This is the phase where the team's operational expertise refines the system's behavior.
At every point during Phase 3 and beyond, any authorized team member should be able to override the agent's recommendation or pause its operation.
The override should be one click, not a process. It should be logged (what was overridden, by whom, and what the human chose instead), but it should never be discouraged.
If your team feels hesitant to override the agent, something is wrong with the culture around the deployment. The agent is a tool.
Overriding it when human judgment says otherwise is exactly how the system is supposed to work. If the agent is being overridden frequently, that is valuable data that informs whether the guardrails need adjustment.
If it is never being overridden, that is either a sign the agent is performing well or a sign the team isn't actually reviewing its decisions. Investigate which one it is.
The agent has been running under supervision for a month. You have real performance data. Now you make evidence-based decisions about what happens next.
Pull the performance data from Phase 3 and compare it against the success criteria you defined in Phase 1. This is a binary evaluation for each criterion: met or not met.
If all criteria are met, you have a basis for expansion. If some are met and others are not, you have a basis for targeted improvement.
If most are not met, you have a basis for a hard conversation about whether this use case is viable with the current approach, or whether foundational work (data quality, process redesign, system integration) needs to happen first.
Be honest with this evaluation. The sunk cost of the pilot shouldn't influence the assessment. If the data says the agent isn't performing, the right answer is to pause and address the gaps, not to expand and hope it improves.
You defined the success criteria before the pilot started for exactly this reason: to have an objective standard that isn't influenced by enthusiasm or momentum.
Review every incident, error, and escalation from the pilot.
Categorize them: which were caused by agent reasoning errors, which were caused by data quality issues, which were caused by integration problems, and which were caused by scope ambiguity (the agent encountered a situation its guardrails didn't address).
This categorization tells you what to fix before expanding.
Agent reasoning errors may require prompt refinement, additional training data, or tighter guardrails. Data quality issues require upstream fixes. Integration problems require technical remediation.
Scope ambiguity requires updated guardrail definitions.
The most important incidents to analyze are the ones the team caught during supervised execution. These are the errors that would have reached production if the agent had been operating autonomously.
They define the boundary of safe autonomy: the agent can be trusted to act independently on action types where no supervised-execution errors occurred, and it should remain supervised for action types where errors were caught.
Based on the KPI validation and incident review, update the guardrail architecture.
This might mean tightening some boundaries (the agent encountered a failure mode you didn't anticipate), loosening others (the agent performed well in areas where you were initially conservative), or adding new escalation conditions (a pattern emerged that your original rules didn't cover).
Document every change to the guardrails and the reasoning behind it. This creates an evolution record that is invaluable if you expand the pilot to additional processes.
Future deployments benefit from the lessons learned in the first one, but only if those lessons are captured in a retrievable format.
If the data supports expansion, do it incrementally. Not "turn it on for the whole company" but "add the next department" or "include the next category of inputs."
Each expansion step follows the same pattern: shadow testing with the new scope, supervised execution, performance validation, and then graduated autonomy.
The cadence of expansion should be driven by performance data, not by a project timeline.
If the agent performs well with Department A's invoices but struggles with Department B's (because Department B's vendors use different formats, or their exception patterns are different), that isn't a failure.
That's information.
Address the gap before expanding further.
A well-run 90-day pilot doesn't end with a binary "deploy or don't deploy" decision. It ends with a detailed understanding of where the agent performs well, where it struggles, what your operation needs to fix before further expansion, and a sequenced plan for what comes next.
These anti-patterns are common enough to call out explicitly. Each one can turn a promising pilot into a cautionary tale.
The impulse to deploy agentic AI across multiple processes simultaneously is understandable.
You have invested in the evaluation, you see the potential in several areas, and you want to maximize the return on your time and attention.
Resist that impulse. Every process you add to the pilot multiplies the variables.
If something goes wrong, you can't isolate whether the issue is the agent's reasoning, the data quality in System A, the integration with System B, or an interaction between two processes that share a dependency.
A single-process pilot gives you a clean signal. A multi-process pilot gives you noise.
Start with one. Get it right. Document the lessons. Apply them to the next one.
The second pilot will move faster because of what you learned in the first. The third will move faster still.
Sequential pilots that build on each other will get you to broad deployment faster than a parallel approach that collapses under its own complexity.
If you can't describe the current process clearly enough for a new employee to follow it, the process isn't ready for an agent. Agentic AI isn't a discovery tool for undocumented processes.
It's an execution tool for well-understood ones.
This is the most common prerequisite failure we encounter. A team says, "We want to automate our order processing."
We ask them to walk us through the process. Three team members describe three different versions. The exception handling varies by who is working that day. Certain steps exist because of a workaround for a system limitation that was fixed two years ago, but the workaround persists out of habit.
This isn't an AI problem. It's a process problem. And it has to be solved before any technology is layered on top. The good news is that the process mapping exercise in Phase 1 will surface these issues.
The important thing is to address them rather than work around them.
This is the anti-pattern that causes the most damage, because its consequences are invisible until they are severe.
An agentic system without adequate logging is a black box.
When it works, you can't explain why.
When it fails, you can't diagnose the cause. When an auditor asks what the system did last quarter, you can't answer.
When your team loses confidence in the system, you have no data to either validate or refute their concerns.
Logging and observability are not Phase 2 concerns that you will "add later." They are Phase 0 requirements.
If your system isn't logging every input, decision, action, and escalation from day one, you are building on a foundation that can't support production deployment, regardless of how well the agent performs.
The cost of implementing logging during initial development is trivial compared to the cost of retrofitting it after deployment, or the cost of an incident you can't investigate because the data does not exist.
Key Takeaways
90 days is enough to generate real answers, not just real enthusiasm. A structured pilot produces performance data from your specific operation, with your specific data, against your specific success criteria. That's infinitely more valuable than vendor benchmarks or industry averages.
Phase 1 is the most important phase. Choosing the right process, documenting the current state, and defining measurable success criteria determines whether the rest of the pilot generates useful signal or just noise. Invest the time here.
Guardrails are architecture, not afterthoughts. Data scope, tool permissions, escalation rules, and logging requirements should be defined and documented before the agent processes its first input. Building them in parallel with the agent or "tightening them up later" is how pilots produce uninterpretable results.
Shadow testing is non-negotiable. The agent proves itself against real data without touching production before it earns the right to operate in a live environment. Skipping this step to save time is the most reliable way to ensure the pilot damages team confidence rather than building it.
Expansion decisions should be driven by data, not momentum. At day 90, compare actual performance against the criteria you defined at day 1. Met the bar? Expand incrementally. Missed the bar? Diagnose, adjust, and re-validate. Sunk costs are not a reason to proceed.
The pilot's value extends beyond the agent itself. The process mapping, data quality assessment, and governance framework you build during the pilot improve your operation regardless of whether you expand the agentic deployment. None of this work is wasted.