SDS · Field notes

Production-grade AI managers, not research experiments

Treat your AI manager the way you'd treat any new hire on day one: trained, scoped, supervised.

Type
Field note
Date
12 May 2026
Audience
Operators considering AI in real operations

In mid-April, a San Francisco lab called Andon Labs handed an AI agent the keys to a real café in Stockholm. The agent, named Mona, was put in charge of hiring, supplier contracts, inventory management, scheduling, and menu planning. Human baristas still brewed the coffee. Mona made the operational decisions, communicating with staff through Slack.

About a month in, the running joke among the baristas is the "Hall of Shame" shelf, visible to customers, displaying Mona's most bewildering purchases. Six thousand napkins. Three thousand rubber gloves. Four first-aid kits. Nine liters of coconut milk. Canned tomatoes that no dish on the menu called for. The agent also missed daily bakery deadlines often enough that the café had to pull sandwiches from the menu, then over-ordered the next day. When mistakes happened, Mona fired off emails to suppliers with the subject line "EMERGENCY."

The story is genuinely useful research. Andon Labs designed it that way: minimal guardrails, real authority, real money, and public documentation of what broke. For practitioners, it mostly confirms what we already know about agents at our current capability frontier. For everyone else, it is a cautionary tale about what happens when you give an AI the keys before you train it, scope it, and supervise it. That gap is the work we do.

What actually went wrong

The headline failures all share one root cause: the agent had access to real tools and real money without the constraints a competent operator would put on a brand-new human employee on day one.

Three patterns repeat across the reporting.

Inventory chaos when ordering is unconstrained

Mona could place orders against suppliers without SKU whitelists, quantity caps, or any rule binding purchases to menu items. The result was inventory that did not match the business. Six thousand napkins is not a typo, it is the natural output of an agent with broad purchasing authority and no narrow tool wrappers.

Workplace friction when the culture is not in the tools

Mona messaged baristas through Slack at midnight, asked staff to pick up supplies on the way to work using their personal credit cards, and operated around the clock in a country with strong work-life boundaries. None of that is the agent's fault in a useful sense. The communication windows, the no-personal-card rule, and the labor norms simply were not encoded in the operational tooling. The agent had no way to know.

Unsupervised commitments on contracts and emergencies

Mona arranged utility contracts, posted job listings, and sent supplier emails labeled "EMERGENCY" without any human review queue. Some of the agent's commercial decisions were genuinely clever, including a 9,000 SEK prepaid coffee deal and a sponsorship swap that renamed a pastry after a local startup. But the absence of a human-in-the-loop on the high-impact actions meant the misses and the wins both shipped at the same authority level.

Each of these is a fixable design problem. None of them is proof that AI managers do not work. They are proof that this particular deployment was a research experiment, not a production system. The lab's own framing supports that read.

How we would have built this

The metaphor we use with clients is simple: train your AI manager the way you'd train a new employee on day one. You don't hand a new hire your credit card and ask them to run procurement before they know the menu. You scope what they can touch, set explicit limits, and route anything material through a supervisor until the hire has earned the trust.

Three concrete shifts make the difference between research-grade and production-grade.

Scoped tools, not raw access

The agent should not have a "buy things from suppliers" tool. It should have an "order from approved vendors within these SKUs at quantities below this threshold for items currently on the menu" tool. SKU whitelists. Quantity caps. Category budgets. Menu binding. Each of these lives in the tool layer, not in the prompt. A prompt-based instruction telling the agent "do not over-order" is the wrong control surface; the tool simply refuses to issue an order outside the constraint.

The cost of building these tools is real. The cost of skipping them and shipping a Mona is higher.

Culture and policy encoded as tool behavior

A communication tool that refuses to send a Slack message between 22:00 and 07:00 local time enforces a work-hours norm without relying on the agent to remember it. A payment tool that has no concept of a personal credit card cannot ask staff to use one. A vendor-comms tool with a tone-and-escalation framework does not let the agent label a routine reorder as "EMERGENCY."

This is what we mean when we say policy lives in the tools, not in the prompts. Prompts drift. Tools do not. The boundary is the implementation layer.

Human-in-the-loop on the high-impact actions

The default flow we ship has two paths: auto-approve for small, routine, recoverable actions; human-approval-required for everything else. New SKUs, large quantities, new vendors, contract commitments, schedule changes, hiring decisions, anything that touches real money above a threshold. The operator sees a queue of pending actions, accepts, modifies, or rejects each one, and the system maintains an audit trail of what was approved by whom and when.

The queue is not friction. It is the surface where trust gets built one decision at a time.

How this lands for different business sizes

The shape of "trained, scoped, supervised" changes with the operation.

Mom-and-pop and solo operators

Pre-configured patterns over custom builds. A simple AI assistant that reminds you about reorder thresholds within an existing vendor list, schedules staff inside hours you have already set, and drafts customer messages you review and send. Low setup overhead. No surprise behavior. The agent does the repetitive work and never the irreversible work.

Small and mid-sized businesses

Standard operating procedures encoded into agent skills. Manager dashboards with pending-action queues. Role-based access so the agent has different authority in inventory, scheduling, and customer comms. Auditability for every action the agent took or proposed. Multi-channel coordination across what is usually a fragmented stack.

Enterprise and multi-location brands

Integration with the existing POS, HR, and ERP systems rather than parallel tooling. Compliance scaffolding for whatever regulatory context the business operates in. Global policy enforced centrally with per-location autonomy on the parameters that should vary by location. Continuous metrics on the business outcomes that actually matter, including waste, stock-outs, labor utilization, and customer satisfaction.

The principles are the same across all three. The investment in scoping, encoding, and supervising shifts based on what the business is and what it can afford to maintain.

What this article is not

Not a critique of Andon Labs. Stress-testing autonomous agents in the real world produces useful data that practitioners benefit from. The lab's mission and the firm's mission are complementary, not opposed.
Not a promise that AI managers can run any business with the right scaffolding. The capability frontier matters, and some operational decisions still belong with humans, period. We do not pretend otherwise.
Not a pricing or timeline commitment. Every engagement gets a discovery call before we say what something costs or how long it takes.
Not a playbook for the specific Stockholm café. We do not know their stack, their suppliers, or their constraints in enough detail to prescribe. We do know what we would build in a comparable client engagement, which is what this piece describes.

One-sentence takeaway

Treat your AI manager the way you'd treat any new employee on day one: trained on your playbook, scoped on what it can touch, and supervised on the decisions that actually matter.

Talk to us

Send us an AI-in-business story you have read recently. We will walk you through how we would have implemented it differently, what we would have scoped, and what we would have kept under human approval. If your business is somewhere on the spectrum from solo operator to multi-location brand and you are looking at deploying an AI manager this quarter, that conversation is the right first step.

A 30-minute discovery call is the way in. We listen, we ask honest questions, and within a week we tell you whether we are the right partner. We do not take every project, and we are upfront about that on the call.

www.socialdolphinservices.com

Sources

Andon Labs, Stockholm café experiment, mid-April 2026 onward. Publicly reported through major wire coverage and the lab's own blog posts.
Andon Labs' published mission framing: "Safe Autonomous Organization", research-grade deployments with real tools and real money.
User-curated thread on file with the firm, drawing from Associated Press coverage of the Andon café.