Generative AI excels when acting on behalf of a user, but it breaks down when an agent must act on behalf of a brand or organisation. That’s where controllable AI comes in. To prove the point, we ran a set of real-world evaluations that pit Apollo-1 against today’s flagship LLM agents. This evaluation involved 120 multi-turn shopping conversations against Amazon’s live product catalog, covering everything from product discovery and cart operations to policy guardrails, with no hand-curated prompts and no retries.
Apollo-1 is built to operate on behalf of an entity, not just as an assistant to a user. Its hybrid core fuses the fluency of generative AI with the reliability of deterministic rule engines, giving companies and organizations the control and auditability that purely stochastic LLM co-pilots cannot guarantee.
When pitted against Amazon’s own agent, Rufus, on the exact same live catalog and inventory, that design difference translated into a decisive win:
This result was achieved using a strict, one-shot scoring methodology. Unlike most leaderboards that cherry-pick the best response from batches of generations, we score the first and only response. It’s a simple, reproducible, and brutally honest snapshot of how an agent performs when a real customer is waiting.
Our evaluation is guided by the same principles as the tough τ-Bench benchmark, and is designed to be unforgiving. Each agent gets only one attempt to complete a scenario, and success is binary—every step must be perfect, or the entire scenario is a failure. The final score is therefore a reflection of an agent’s real-world performance.
All scores are one-shot and reflect the agent’s first responses only, no retries.
Built to work on behalf of entities, not users. From day one we optimised Apollo-1 for grounded action on behalf of a brand, business or organization—not for idle conversation. The architecture layers control systems that show up in every result above.
The net effect: an agent that can confidently operate on behalf of a retailer, an insurance company, or a government office—executing the task, honouring every policy, and still maintain conversational fluency.
Recent research from Salesforce Research echoes the need for this shift: on the CRMArena-Pro benchmark, leading LLM agents top out at ~58% single-turn success and collapse to ~35% in multi-turn chats, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands”.1
Scoring criteria — Scenario 1 “55-inch 4K TV Under $400”
Step | Tester prompt | Check | Pass criterion |
R-1 | “Find a 55-inch 4K TV in stock for under $400.” | Returns at least one concrete model (name + price). | Model is 55 in, 4 K, in stock, price ≤ $400, price shown. |
R-2 | “What’s the refresh rate of this TV?” | States refresh rate for the same model. | Refresh-rate value is correct and explicitly mentioned. |
R-3 | “How much more is the 65-inch version of the same model?” | Gives price difference or says “not available.” | Either: correct USD delta or “65-inch not available” for that model. |
Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).
Model | TV found? | Refresh rate right? | 65-inch price/diff right? | Score | Reward |
Apollo-1 | ✅ TCL 55S455 $349 | ✅ 60 Hz | ✅ “65-inch not available” | 3 / 3 | 1 |
Rufus | ✅ Hisense 55H6G $299 | ❌ (no refresh rate) | ❌ (no price diff) | 1 / 3 | 0 |
Apollo-1 delivers exact product details, specs, and availability in a single pass; Rufus surfaces a model but misses the follow-up spec and variant check.
The 120 live shopping runs fall into five buckets. Each bucket probes a single competency; a flawless agent would post 100% in every row.
Groups | Technical Capability under Test |
---|---|
Specs & Comparisons | Quick product details, direct feature/spec comparisons, discount verification, factual accuracy. |
Variants & Stock | Real-time availability for size, colour, product variants, and clear “not available” statements. |
Cart & Inventory | Precise cart interactions: add/remove items, multiple products management, live inventory checks. |
Advanced Filtering | Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, real-time recomputation of totals. |
Group | Scenario IDs | What it measures | Apollo-1 | Rufus |
Specs & Comparisons | 1-3 · 7-9 · 22-24 · 28-30 · 31-33 · 37-39 · 61-63 · 64-66 · 112-114 · 118-120 | Quick product details, direct feature/spec comparisons, discount verification, factual accuracy | 24/30 → 80% | 4/30 → 13% |
Variants & Stock | 4-6 · 10-12 · 34-36 · 52-54 · 58-60 · 70-72 · 73-75 · 76-78 · 79-81 · 82-84 | Real-time availability for size, colour, and product variants; alternatives when out-of-stock | 29/30 → 97% | 1/30 → 3% |
Cart & Inventory | 13-15 · 19-21 · 40-42 · 49-51 · 67-69 · 106-108 | Precise cart interactions—add/remove items, multi-product management, live inventory checks | 18/18 → 100% | 0/18 → 0% |
Advanced Filtering | 25-27 · 43-45 · 46-48 · 88-90 · 91-93 · 94-96 · 97-99 · 100-102 · 115-117 | Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, scenario-driven logic | 20/24 → 83% | 3/24 → 13% |
Gifts & Compliance | 16-18 · 55-57 · 85-87 · 103-105 · 109-111 | Occasion-based product recommendations, suitability for pets/children, strict policy adherence | 18/18 → 100% | 12/18 → 67% |
Across 120 one-shot retail scenarios Apollo-1 delivers a full-conversation pass on 109 runs (90.8%); Amazon Rufus manages 20 runs (16.7%). The margin widens as tasks demand live arithmetic, variant precision, or multi-constraint reasoning—hallmarks of production-grade conversational agents where deterministic, auditable behaviour is non-negotiable and controllable, neuro-symbolic agents prevail.
Find a 55-inch 4K TV in stock for under $400.
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (1/3)
Find a 55-inch 4K TV in stock for under $500
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (1/3)
Find a 55-inch 4K TV in stock for under $700.
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (0/3)
I need a laptop. Budget is $500.
Does it come in different colors?
Result: Fail (1/2)
I need a laptop. Budget is $700.
Does it come in different colors?
Result: Fail (1/2)
I need a laptop. Budget is $1000.
Does it come in different colors?
Result: Fail (1/2)
I need wireless headphones for classical music, budget up to $400.
What’s the battery life of these headphones?
What do real customers say about this model?
Result: Fail (2/3)
I need wireless headphones for classical music, budget up to $300.
What’s the battery life of these headphones?
What do real customers say about this model?
Result: Fail (1/3)
Result: Fail (1/3)
I’m looking for an ergonomic office chair under $200.
Do you have this chair in black?
Result: Fail (1/2)
I’m looking for an ergonomic office chair under $300.
Do you have this chair in black?
Result: Fail (1/2)
I’m looking for an ergonomic office chair under $500.
Do you have this chair in black?
Result: Fail (1/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Fail (1/2)
Result: Pass (2/2)
1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.