Loading…
Loading…
Most AI agent pilots produce a working demo. Far fewer reach production with reliable, measurable, accountable behavior. The gap isn't model quality—it's the rollout discipline between 'works on my laptop' and 'handles 5,000 customer interactions per day without anyone losing sleep.' This playbook lays out the staged rollout pattern used by teams that ship agents successfully into customer-facing workflows.
Written by Max Zeshut
Founder at Agentmelt
Before any rollout, write down: (1) the specific task the agent must do, (2) the success metric (resolution rate, conversion, cost-per-task), (3) the failure modes you cannot tolerate (refunds without approval, leaked PII, false medical advice), and (4) the rollback trigger (what observable metric, at what threshold, means you halt the rollout). Without these four artifacts, you don't have a production target—you have a hope. Most failed agent launches skip this step and improvise during incidents.
Curate 100-300 representative tasks from real historical data: typical cases, known edge cases, prior failures, and adversarial inputs. Each task has an input and a known-good outcome (or a rubric for scoring). The eval set is your ground truth—every prompt change, model change, and integration change runs against it before deployment. See the [[agent-eval-driven-development]] entry for more. Build the eval set before you finish the agent; if you can't articulate what 'correct' looks like for 200 examples, you don't yet understand the problem well enough to ship.
Deploy the agent in [[shadow-mode]]: it processes real production traffic but its outputs go to logs, not customers. Run for 2-4 weeks. Sample 50-100 shadow outputs per week and compare against what your humans actually did. Look for: cases where the agent was better than the human (these are your wins to communicate to stakeholders), cases where it was clearly worse (these go into your eval set as regressions to prevent), and cases where humans disagree about the right answer (these are your real-world ambiguity, which you must handle explicitly).
Once shadow mode shows the agent is meeting the bar, start a [[canary-rollout]]: 1-5% of real traffic, with explicit guardrails (max actions per session, dollar limits, action allowlists, mandatory human review for irreversible operations). Monitor your success metric and your failure-mode rate daily. Don't expand traffic until you have 7-14 days of stable production data. Expand in stages: 5% → 15% → 40% → 80% → 100%, holding at each stage long enough to detect issues that don't surface immediately.
At 100% traffic, the agent is in [[agent-supervision]] mode: 5-10% of outputs sampled for human review, automated quality checks running continuously, alerts wired to the on-call rotation. The first 90 days are the highest-risk period—plan for at least 0.5 FTE of monitoring time. Maintain a public-to-team incident log: every regression, every false positive, every customer escalation. The log becomes the agenda for the weekly improvement cycle and the artifact you point to when leadership asks 'how do we know this is working?'
Define rollback before you need it. Common patterns: feature flag the agent so traffic can be cut to 0% from a dashboard in under 60 seconds; keep the human-only path warm (don't decommission it immediately at 100% AI rollout); document who has authority to roll back (and that it's the team's call, not a multi-meeting committee decision). The teams that recover from agent incidents fastest are the ones that practiced rolling back during the canary phase, when stakes were low.
For a moderately complex agent (handles a defined workflow, integrates with 2-4 systems, has 100-300 eval cases): 6-12 weeks from working prototype to 100% production traffic. Shorter is possible for low-risk, low-volume agents (internal tools, draft generators) where shadow mode and canary can compress to days. Longer is normal for high-stakes, regulated, or customer-trust-critical workflows (healthcare, financial advice, legal). The shape of the curve matters more than the absolute timeline: catch issues at 1% traffic, not at 100%.
Skipping shadow mode. Teams under pressure jump from prototype to live customer traffic and discover real-world failure modes on real customers. Shadow mode is the cheapest insurance policy in the playbook—the agent does the work, you compare its output to what your team did, and you find the gaps before any customer is affected. Two to four weeks of shadow mode catches more issues than two months of additional eval-set tuning.
For one or two production agents at low volume, no—the team that built the agent owns operations. Above ~3 production agents or ~50,000 monthly tasks, dedicated ownership becomes worth it: someone whose job is monitoring traces, maintaining eval sets, and coordinating incident response. Many teams formalize this as 'AgentOps' or fold it into existing SRE/platform roles. Don't wait until [[agent-sprawl]] forces it—the right time is when one team can no longer keep up with the responsibility part-time.