articles

Production Agent Architecture at Scale

Name: Production Agent Architecture at Scale
Author: Klara (iFood), Alex (Arcade)

Klara (iFood), Alex (Arcade)

44 highlights

agentic-tool-design agentic-design-patterns agentic-concepts agentic-product-philosophy rag review agentic-context-engineering agentic-prompt-ui-design

Highlights & Annotations

The insight is simple to state but profound in its implications: a tool is the inverse of an API. Where an API is a service contract describing how a downstream service works, a tool is a service contract describing what an agent intends to accomplish.

Ref. 9678-A

THE COST OF GETTING THIS WRONG Every tool call is a decision point. Every decision point is a potential failure. When you expose service complexity to an agent, error rates multiply across the reasoning chain. A 95% success rate per step becomes 77% over five steps, 59% over ten. The maze you built becomes a probability trap.

Ref. 7E05-B

This is a fundamentally different cognitive operation. When you design an API, you’re describing the capabilities and constraints of a service. When you design a tool, you’re describing a potential action that might satisfy an agent’s goal. The API says ‘here’s what I can do.’ The tool says ‘here’s what you might want to do.’

Ref. 86B7-C

Consider the implications. An API designer optimizes for completeness and composability—expose all the primitives, let the consumer assemble them. A tool designer optimizes for intention matching—make it obvious when this tool is the right choice, and make the wrong choice costly through absence rather than complexity.

Ref. 8B99-D

The iFood team learned this through painful iteration. Their initial tool definitions ‘made a lot of sense to us,’ Kiara explains, ‘but it didn’t make sense to someone external that would see them for the first time.’ The tools were designed as APIs—describing what the services could do. Edge cases accumulated in the definitions: ‘whenever the user wants to order something and you don’t have enough information, make sure to call the get information tool.’ The definitions became procedural code masquerading as natural language.

Ref. 6922-E

The fix required a complete reframe. Instead of asking ‘what can this service do?’, the team asked ‘if I shared this tool with another team for their agent, would they immediately understand when and how to use it?’ This is the external reader test—a forcing function for intention-matched design.

Ref. E69E-F

“Let somebody else look at your tool definitions. Let them see if they can understand it. And if they can, then you can try it with the agent. But if they can’t, you already know where the problem is.”

Ref. 5C63-G

The result was transformative. After redesigning tools around intentions rather than capabilities, iFood saw ‘a massive improvement in latency because we could cut a lot of tokens and our system became much more stable.’ The tools became self-documenting. The agent stopped reasoning about infrastructure and started reasoning about user needs.

Ref. 27D2-H

At the top are tools designed precisely for particular agents’ intentions. These are highly specialized, with minimal parameters, domain-specific language, and embedded knowledge about the agent’s goals and constraints. ‘Get customer brochure’ wraps document retrieval with customer context, product matching, and version selection.

Ref. AF02-I

The benefits of this architecture compound. The foundation layer provides reliable infrastructure that multiple teams share without duplication. The workflow layer captures patterns that would otherwise be reimplemented inconsistently. The agent-specific layer enables accuracy and latency optimization without sacrificing reusability.

Ref. B9AE-J

“If you look at APIs, this pattern already exists. You’ve got your low-level system APIs, your workflow APIs in the backend, and then the APIs that the mobile application talks to. Three different sets. Tools should work the same way

Ref. 2597-K

The External Reader Test The most reliable method for validating tool design emerged organically from iFood’s iteration cycles: show the tool definition to someone unfamiliar with the system and observe their comprehension. If they cannot immediately understand what the tool does, when to use it, and what inputs are required, the agent will struggle with the same ambiguities.

Ref. 731A-L

PRACTITIONER INSIGHT The external reader test forces you to make implicit knowledge explicit. When your colleague asks ‘but how does the agent know which folder to look in?’, you’ve identified a hidden assumption. Either embed that knowledge in the tool implementation or accept that your agent will face the same confusion.

Ref. 92A5-M

Kiara describes the discipline: ‘The idea is to make them as clear as possible. Would the name of the tool even make sense?’ This question alone eliminates many common failures. Tools named after internal systems (‘process_via_legacy_bridge’), implementation details (‘invoke_rpc_handler’), or technical abstractions (‘execute_data_transform’) give agents no semantic signal about intention. Tools named after goals (‘find_similar_products’, ‘schedule_delivery’, ‘suggest_alternatives’) are self-documenting.

Ref. 3A92-N

Tool Output as Instruction Vector

Ref. F657-O

The traditional model treats tool output as pure information: the agent calls a tool, receives data, and decides what to do next based on its system prompt and general capabilities. But this forces all guidance into the system prompt, where it bloats context regardless of relevance.

Ref. 6B61-P

Kiara frames this as ‘need to know basis’ context management. Instructions appear when they’re actionable, not when they might someday be relevant. This reduces prompt size, improves focus, and enables dynamic behavior that would be impossible with static system prompts.

Ref. 5B45-Q

Limiting Agent Choices

Ref. 7A7C-R

‘In a perfect world, you give it one tool. In a perfect world, the agent doesn’t have to think at all and you don’t even need an LLM. Because it’s faster, it’s cheaper, it’s deterministic.’ This is the starting intuition: determinism is desirable for everything that can be made deterministic. The value of the LLM lies precisely in handling what cannot be predetermined.

Ref. AC69-S

‘We ourselves have achieved incredible things in the lab that we haven’t yet announced,’ Alex notes, ‘where we’re making it possible to turn that dial up on the tool level.’ The implication is significant: tool selection, today a major source of agent errors, may soon become reliable enough to support much larger tool inventories without proportional accuracy loss.

Ref. E0BE-T

This asymmetry drives architectural choices. The agent uses a deliberately simple React-style architecture—no fancy multi-agent orchestration—because complexity adds latency. Each handoff between agents introduces delay. When users are hungry and expecting instant response, those delays are unacceptable.

Ref. 1A26-U

Voice and UI Elements in Tools

Ref. 6A92-V

Brazil’s 160 million WhatsApp users present a unique design surface. Users already order food via WhatsApp—sending voice notes directly to restaurants. The conversational pattern is established. The mental model of ‘send message, receive food’ is deeply ingrained.

Ref. F94F-W

TRUST THROUGH UNDERSTANDING Food is personal. Dietary restrictions are health-critical. Taste preferences are emotionally loaded. The agent must demonstrate felt understanding, not just technical accuracy. When a user says ‘something healthy,’ the agent that knows they’re a meat lover suggests grilled chicken, not vegetable stir-fry. That’s the difference between generic recommendation and genuine assistance.

Ref. 933F-X

Error-First Methodology

Ref. 7D07-Y

The error taxonomy becomes the foundation for everything that follows. LLM-as-judge prompts are written to detect specific error types, not general quality levels. Test sets are constructed to exercise each error category. Evaluation scores become actionable because they point to specific failure modes with specific remediation paths.

Ref. FA19-Z

Missing user context Judge evaluation Irrelevant recommendations UI element failures Persona drift Outcome analysis Synthetic agents Style judges Context injection Filtering logic Element validation Prompt reinforcement

Ref. 9AEA-A

Synthetic User Agents

Ref. F58A-B

Smart Tool Evaluation

Ref. 77B8-C

Dynamic System Prompts

Ref. C2B4-D

Preemptive Context Loading

Ref. 1F71-E

Tool Ownership and Versioning

Ref. FD1F-F

Cross-Functional Tool Teams

Ref. 8CA6-G

Part VIII: Latency as Architecture

Ref. 56A3-H

The Multi-Agent Convergence

Ref. DECD-I

iFood’s parent organization targets 30,000 agents. This number transforms every aspect of agent development. Individual expertise cannot scale to review every tool definition, every prompt, every evaluation result. Systematic approaches become mandatory.

Ref. 6FEF-J

The question shifts from ‘how do we build good agents?’ to ‘how do we make it easy for any team to build good agents?’ This is a platform question, not an implementation question. The answer involves templates that encode best practices, standards that prevent common mistakes, automated checks that catch problems before deployment, and platform capabilities that guide teams toward success.

Ref. 1BDB-K

The Intention-Contract Pattern

Ref. 9F1A-L

The External Reader Test

Ref. 43A0-M

The Tool Layering Architecture

Ref. 8FF6-N

Dynamic Prompt Strategy

Ref. 3B15-O

Tool-Output-as-Instruction

Ref. A959-P