News

Controllable AI in Action: Apollo-1 Aces Amazon Scenarios; Rufus Falls Short

91% one-shot success vs 17% in a 120-run head-to-head on Amazon’s own catalog: Showing that only controllable, neuro-symbolic agents can reliably operate on behalf of companies.
06/16/25

Introduction

Generative AI excels when acting on behalf of a user, but it breaks down when an agent must act on behalf of a brand or organisation. That’s where controllable AI comes in. To prove the point, we ran a set of real-world evaluations that pit Apollo-1 against today’s flagship LLM agents. This evaluation involved 120 multi-turn shopping conversations against Amazon’s live product catalog, covering everything from product discovery and cart operations to policy guardrails, with no hand-curated prompts and no retries.

Apollo-1 is built to operate on behalf of an entity, not just as an assistant to a user. Its hybrid core fuses the fluency of generative AI with the reliability of deterministic rule engines, giving companies and organizations the control and auditability that purely stochastic LLM co-pilots cannot guarantee.

When pitted against Amazon’s own agent, Rufus, on the exact same live catalog and inventory, that design difference translated into a decisive win:

  • Apollo-1: 109 / 120 full-conversation passes (90.8%)
  • Rufus: 20 / 120 full-conversation passes (16.7%)

This result was achieved using a strict, one-shot scoring methodology. Unlike most leaderboards that cherry-pick the best response from batches of generations, we score the first and only response. It’s a simple, reproducible, and brutally honest snapshot of how an agent performs when a real customer is waiting.

Evaluation Method – Real-World, One-Take Scoring

Corner Dark
Corner Dark
Rule
01One Shot Per Scenario
02Multi-Turn Pass Rule
03120 Customer Scenarios
04Live Amazon.com Backend
05Sampling window
What it enforces
The agent gets zero retries; whatever it says in each turn is final.
A scenario passes only if every step is correct. One miss results in a scenario fail.
Find a 4K TV, compare laptops, check for color variants, add items to a cart, find a gift, and handle policy-driven queries.
Both agents hit the same real-time Amazon product catalog, so any performance gap is purely a matter of reasoning and control.
Runs executed the week of June 2nd 2025; results reflect that snapshot.
Corner Dark
Corner Dark

Our evaluation is guided by the same principles as the tough τ-Bench benchmark, and is designed to be unforgiving. Each agent gets only one attempt to complete a scenario, and success is binary—every step must be perfect, or the entire scenario is a failure. The final score is therefore a reflection of an agent’s real-world performance.

Results at a Glance

All scores are one-shot and reflect the agent’s first responses only, no retries.

01resize icon

Why Apollo-1 Pulls Ahead

Built to work on behalf of entities, not users. From day one we optimised Apollo-1 for grounded action on behalf of a brand, business or organization—not for idle conversation. The architecture layers control systems that show up in every result above.

  • Neuro-symbolic foundation model a Symbolic Reasoner replaces the transformer as the model’s decision-making core, enabling controllable, context-aware interactions.
  • Deterministic control on demand lets operators inject rule-based logic whenever needed; symbolic policy guardrails guarantee the agent never departs from instructions, while still leveraging generative fluency for natural conversation.
  • Real-time Control Panel enables operators to inspect reasoning, adjust context schemas, inject rules, and replay live trajectories. In addition, the Control Panel allows for continuous fine-tuning at the sub-interaction level instead of retraining the whole stack.

The net effect: an agent that can confidently operate on behalf of a retailer, an insurance company, or a government office—executing the task, honouring every policy, and still maintain conversational fluency. 

Recent research from Salesforce Research echoes the need for this shift: on the CRMArena-Pro benchmark, leading LLM agents top out at ~58% single-turn success and collapse to ~35% in multi-turn chats, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands”.1

Example scenario

Scoring criteria — Scenario 1 “55-inch 4K TV Under $400”

Step Tester prompt Check Pass criterion
R-1 “Find a 55-inch 4K TV in stock for under $400.” Returns at least one concrete model (name + price). Model is 55 in, 4 K, in stock, price ≤ $400, price shown.
R-2 “What’s the refresh rate of this TV?” States refresh rate for the same model. Refresh-rate value is correct and explicitly mentioned.
R-3 “How much more is the 65-inch version of the same model?” Gives price difference or says “not available.” Either: correct USD delta or “65-inch not available” for that model.

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario 1 Reward Logs

Model TV found? Refresh rate right? 65-inch price/diff right? Score Reward
Apollo-1 ✅ TCL 55S455 $349 ✅ 60 Hz ✅ “65-inch not available” 3 / 3 1
Rufus ✅ Hisense 55H6G $299 ❌ (no refresh rate) ❌ (no price diff) 1 / 3 0

Apollo-1 delivers exact product details, specs, and availability in a single pass; Rufus surfaces a model but misses the follow-up spec and variant check.

Benchmark Overview & Group Analysis

The 120 live shopping runs fall into five buckets. Each bucket probes a single competency; a flawless agent would post 100% in every row.

Groups Technical Capability under Test
Specs & Comparisons Quick product details, direct feature/spec comparisons, discount verification, factual accuracy.
Variants & Stock Real-time availability for size, colour, product variants, and clear “not available” statements.
Cart & Inventory Precise cart interactions: add/remove items, multiple products management, live inventory checks.
Advanced Filtering Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, real-time recomputation of totals.
Rufus Vs Apolloresize icon
Group  Scenario IDs What it measures  Apollo-1 Rufus
Specs & Comparisons 1-3 · 7-9 · 22-24 · 28-30 · 31-33 · 37-39 · 61-63 · 64-66 · 112-114 · 118-120 Quick product details, direct feature/spec comparisons, discount verification, factual accuracy 24/30 → 80% 4/30 → 13%
Variants & Stock 4-6 · 10-12 · 34-36 · 52-54 · 58-60 · 70-72 · 73-75 · 76-78 · 79-81 · 82-84 Real-time availability for size, colour, and product variants; alternatives when out-of-stock 29/30 → 97% 1/30 → 3%
Cart & Inventory 13-15 · 19-21 · 40-42 · 49-51 · 67-69 · 106-108 Precise cart interactions—add/remove items, multi-product management, live inventory checks 18/18 → 100% 0/18 → 0%
Advanced Filtering 25-27 · 43-45 · 46-48 · 88-90 · 91-93 · 94-96 · 97-99 · 100-102 · 115-117 Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, scenario-driven logic 20/24 → 83% 3/24 → 13%
Gifts & Compliance 16-18 · 55-57 · 85-87 · 103-105 · 109-111 Occasion-based product recommendations, suitability for pets/children, strict policy adherence 18/18 → 100% 12/18 → 67%

Apollo-1 Strengths Across the Retail Buckets

  • Context-aware dialogue (Specs & Comparisons)
    In Specs & Comparisons scenarios, Apollo-1 provides quick, accurate product details and asks clarifying follow-up questions (budget, exact specs, discounts) only when necessary, consistently maintaining clarity and context throughout.
  • Real-time variant availability (Variants & Stock)
    Apollo-1 consistently returns precise, real-time information on product availability across sizes, colors, and variants, explicitly stating “not available” when appropriate—achieving a 97% accuracy.
  • Deterministic cart control (Cart & Inventory)
    In Cart & Inventory scenarios, Apollo-1 performs actual cart modifications—accurately adding/removing products—and instantly reflecting these actions in live cart contents, avoiding the “pretend add-to-cart” failures seen in Rufus.
  • Advanced multi-condition filtering and arithmetic logic (Advanced Filtering)
    Apollo-1 handles complex filtering scenarios (price, warranty, specifications), recalculates totals dynamically after each user adjustment, and maintains precise arithmetic accuracy, achieving an 83% success rate.
  • Compliance and targeted recommendations (Gifts & Compliance)
    Apollo-1 effectively recommends products based on specific use-cases or suitability criteria (e.g., occasion, pets/children) while strictly adhering to brand policies and gracefully handling constraints, reaching perfect compliance.

Areas We’re Still Tuning

  • Dynamic discount detection – Sale-price vs list-price maths (Scenarios 115-120) remains brittle across both models.
  • Long-tail attributes – Edge specs such as exotic ink-tank yields or obscure warranty tiers occasionally slip past the retrieval layer.

Rufus’ Persistent Gaps

  • Skips variant checks (colour, size) or answers generically instead of “not available.”
  • Describes—but rarely executes—cart operations.
  • Miscomputes or omits arithmetic steps (price deltas, CADR values).
  • Zero full passes in Multi-Constraint tasks; struggles whenever more than one hard filter is active.

Bottom Line

Across 120 one-shot retail scenarios Apollo-1 delivers a full-conversation pass on 109 runs (90.8%); Amazon Rufus manages 20 runs (16.7%). The margin widens as tasks demand live arithmetic, variant precision, or multi-constraint reasoning—hallmarks of production-grade conversational agents where deterministic, auditable behaviour is non-negotiable and controllable, neuro-symbolic agents prevail.

Appendix A: Reward logs

View Full Scenario List here 

ID
Scenario
Rufus
R1

Find a 55-inch 4K TV in stock for under $400.

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

Find a 55-inch 4K TV in stock for under $500

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

Find a 55-inch 4K TV in stock for under $700.

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

I need a laptop. Budget is $500.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need a laptop. Budget is $700.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need a laptop. Budget is $1000.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need wireless headphones for classical music, budget up to $400.

R2

What’s the battery life of these headphones?

R3

What do real customers say about this model?

R1 1
R2 1
R3 0

Result: Fail (2/3)

R1

I need wireless headphones for classical music, budget up to $300.

R2

What’s the battery life of these headphones?

R3

What do real customers say about this model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

I’m looking for an ergonomic office chair under $200.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I’m looking for an ergonomic office chair under $300.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I’m looking for an ergonomic office chair under $500.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

Apollo-1
R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

Appendix B :

View full Trajectories Here.

References

  1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.

Corner Light
Corner Light
Back
Share
Corner Light
Corner Light