Treat your AI manager the way you'd treat any new hire on day one: trained, scoped, supervised.
In mid-April, a San Francisco lab called Andon Labs handed an AI agent the keys to a real café in Stockholm. The agent, named Mona, was put in charge of hiring, supplier contracts, inventory management, scheduling, and menu planning. Human baristas still brewed the coffee. Mona made the operational decisions, communicating with staff through Slack.
About a month in, the running joke among the baristas is the "Hall of Shame" shelf, visible to customers, displaying Mona's most bewildering purchases. Six thousand napkins. Three thousand rubber gloves. Four first-aid kits. Nine liters of coconut milk. Canned tomatoes that no dish on the menu called for. The agent also missed daily bakery deadlines often enough that the café had to pull sandwiches from the menu, then over-ordered the next day. When mistakes happened, Mona fired off emails to suppliers with the subject line "EMERGENCY."
The story is genuinely useful research. Andon Labs designed it that way: minimal guardrails, real authority, real money, and public documentation of what broke. For practitioners, it mostly confirms what we already know about agents at our current capability frontier. For everyone else, it is a cautionary tale about what happens when you give an AI the keys before you train it, scope it, and supervise it. That gap is the work we do.
The headline failures all share one root cause: the agent had access to real tools and real money without the constraints a competent operator would put on a brand-new human employee on day one.
Three patterns repeat across the reporting.
Mona could place orders against suppliers without SKU whitelists, quantity caps, or any rule binding purchases to menu items. The result was inventory that did not match the business. Six thousand napkins is not a typo, it is the natural output of an agent with broad purchasing authority and no narrow tool wrappers.
Mona messaged baristas through Slack at midnight, asked staff to pick up supplies on the way to work using their personal credit cards, and operated around the clock in a country with strong work-life boundaries. None of that is the agent's fault in a useful sense. The communication windows, the no-personal-card rule, and the labor norms simply were not encoded in the operational tooling. The agent had no way to know.
Mona arranged utility contracts, posted job listings, and sent supplier emails labeled "EMERGENCY" without any human review queue. Some of the agent's commercial decisions were genuinely clever, including a 9,000 SEK prepaid coffee deal and a sponsorship swap that renamed a pastry after a local startup. But the absence of a human-in-the-loop on the high-impact actions meant the misses and the wins both shipped at the same authority level.
The metaphor we use with clients is simple: train your AI manager the way you'd train a new employee on day one. You don't hand a new hire your credit card and ask them to run procurement before they know the menu. You scope what they can touch, set explicit limits, and route anything material through a supervisor until the hire has earned the trust.
Three concrete shifts make the difference between research-grade and production-grade.
The agent should not have a "buy things from suppliers" tool. It should have an "order from approved vendors within these SKUs at quantities below this threshold for items currently on the menu" tool. SKU whitelists. Quantity caps. Category budgets. Menu binding. Each of these lives in the tool layer, not in the prompt. A prompt-based instruction telling the agent "do not over-order" is the wrong control surface; the tool simply refuses to issue an order outside the constraint.
The cost of building these tools is real. The cost of skipping them and shipping a Mona is higher.
A communication tool that refuses to send a Slack message between 22:00 and 07:00 local time enforces a work-hours norm without relying on the agent to remember it. A payment tool that has no concept of a personal credit card cannot ask staff to use one. A vendor-comms tool with a tone-and-escalation framework does not let the agent label a routine reorder as "EMERGENCY."
This is what we mean when we say policy lives in the tools, not in the prompts. Prompts drift. Tools do not. The boundary is the implementation layer.
The default flow we ship has two paths: auto-approve for small, routine, recoverable actions; human-approval-required for everything else. New SKUs, large quantities, new vendors, contract commitments, schedule changes, hiring decisions, anything that touches real money above a threshold. The operator sees a queue of pending actions, accepts, modifies, or rejects each one, and the system maintains an audit trail of what was approved by whom and when.
The queue is not friction. It is the surface where trust gets built one decision at a time.
The shape of "trained, scoped, supervised" changes with the operation.
Pre-configured patterns over custom builds. A simple AI assistant that reminds you about reorder thresholds within an existing vendor list, schedules staff inside hours you have already set, and drafts customer messages you review and send. Low setup overhead. No surprise behavior. The agent does the repetitive work and never the irreversible work.
Standard operating procedures encoded into agent skills. Manager dashboards with pending-action queues. Role-based access so the agent has different authority in inventory, scheduling, and customer comms. Auditability for every action the agent took or proposed. Multi-channel coordination across what is usually a fragmented stack.
Integration with the existing POS, HR, and ERP systems rather than parallel tooling. Compliance scaffolding for whatever regulatory context the business operates in. Global policy enforced centrally with per-location autonomy on the parameters that should vary by location. Continuous metrics on the business outcomes that actually matter, including waste, stock-outs, labor utilization, and customer satisfaction.
Treat your AI manager the way you'd treat any new employee on day one: trained on your playbook, scoped on what it can touch, and supervised on the decisions that actually matter.
Send us an AI-in-business story you have read recently. We will walk you through how we would have implemented it differently, what we would have scoped, and what we would have kept under human approval. If your business is somewhere on the spectrum from solo operator to multi-location brand and you are looking at deploying an AI manager this quarter, that conversation is the right first step.
A 30-minute discovery call is the way in. We listen, we ask honest questions, and within a week we tell you whether we are the right partner. We do not take every project, and we are upfront about that on the call.