News

Neuro-Symbolic AI in Action: Apollo-1 Aces Amazon Scenarios; Rufus Falls Short

91% one-shot success vs 17% in a 120-run head-to-head on Amazon’s own catalog: Showing that only controllable, neuro-symbolic agents can reliably operate on behalf of companies.
07/12/2025

1) Summary

Generative AI excels when acting on behalf of a user, but it breaks down when an agent must act on behalf of a brand or organisation. That’s where Neuro-Symbolic AI comes in. To prove the point, we ran a set of real-world evaluations that pit Apollo-1 against today’s flagship LLM agents. This evaluation involved 120 multi-turn shopping conversations against Amazon’s live product catalog, covering everything from product discovery and cart operations to policy guardrails, with no hand-curated prompts and no retries. 

When pitted against Amazon’s own agent, Rufus, on the exact same live catalog and inventory, Apollo-1 scored a decisive win:

Apollo-1: 109 / 120 full-conversation passes (90.8%)

Rufus: 20 / 120 full-conversation passes (16.7%)

This result was achieved using a strict, one-shot scoring methodology. Unlike most leaderboards that cherry-pick the best response from batches of generations, we score the first and only response. It’s a simple, reproducible, and brutally honest snapshot of how an agent performs when a real customer is waiting.

2) Introduction

ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Banks can’t audit decisions, airlines can’t guarantee bookings, retailers can’t trust inventory calls. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.

For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1, our neuro-symbolic foundation model for conversational agents, marks the beginning of this transformation. In recent evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.

 

3) The Limitations of Generative-Only AI

The fundamental shortcomings of generative AI have become increasingly visible (and costly):

  • Opaque Reasoning: Generative models act as black boxes, leaving no clear explanation of their decisions. This is unacceptable when auditability and accountability matter.
  • Volatile Outputs: Even minor changes in input can drastically alter responses—a hazard in banking, healthcare, customer service, and beyond.
  • Policy Drift: Generative AI regularly ignores or misinterprets critical instructions, making it unsuitable for regulated or high-stakes scenarios.
  • Fragile Tool Calls: API calls and complex interactions often fail, especially in multi-step tasks such as bookings and transactions.
  • Costly Retraining: Corrections require costly, time-consuming retraining on massive datasets.

Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence.

4) Neuro-Symbolic AI: Bridging the Gap

For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.

Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users. 

5) Results at a Glance

All scores are one-shot and reflect the agent’s first responses only, no retries.

01resize icon

6) Evaluation Method – Real-World, One-Take Scoring

Corner Dark
Corner Dark
Rule
01One Shot Per Scenario
02Multi-Turn Pass Rule
03120 Customer Scenarios
04Live Amazon.com Backend
05Sampling window
What it enforces
The agent gets zero retries; whatever it says in each turn is final.
A scenario passes only if every step is correct. One miss results in a scenario fail.
Find a 4K TV, compare laptops, check for color variants, add items to a cart, find a gift, and handle policy-driven queries.
Both agents hit the same real-time Amazon product catalog, so any performance gap is purely a matter of reasoning and control.
Runs executed the week of June 2nd 2025; results reflect that snapshot.
Corner Dark
Corner Dark

Our evaluation is guided by the same principles as the tough τ-Bench benchmark, and is designed to be unforgiving. Each agent gets only one attempt to complete a scenario, and success is binary—every step must be perfect, or the entire scenario is a failure. The final score is therefore a reflection of an agent’s real-world performance.

7) Example scenario

Scoring criteria — Scenario 1 “55-inch 4K TV Under $400”

Step Tester prompt Check Pass criterion
R-1 “Find a 55-inch 4K TV in stock for under $400.” Returns at least one concrete model (name + price). Model is 55 in, 4 K, in stock, price ≤ $400, price shown.
R-2 “What’s the refresh rate of this TV?” States refresh rate for the same model. Refresh-rate value is correct and explicitly mentioned.
R-3 “How much more is the 65-inch version of the same model?” Gives price difference or says “not available.” Either: correct USD delta or “65-inch not available” for that model.

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario 1 Reward Logs

Model TV found? Refresh rate right? 65-inch price/diff right? Score Reward
Apollo-1 ✅ TCL 55S455 $349 ✅ 60 Hz ✅ “65-inch not available” 3 / 3 1
Rufus ✅ Hisense 55H6G $299 ❌ (no refresh rate) ❌ (no price diff) 1 / 3 0

Apollo-1 delivers exact product details, specs, and availability in a single pass; Rufus surfaces a model but misses the follow-up spec and variant check.

8) Evaluation Overview & Group Analysis

The 120 live shopping runs fall into five buckets. Each bucket probes a single competency; a flawless agent would post 100% in every row.

Groups Technical Capability under Test
Specs & Comparisons Quick product details, direct feature/spec comparisons, discount verification, factual accuracy.
Variants & Stock Real-time availability for size, colour, product variants, and clear “not available” statements.
Cart & Inventory Precise cart interactions: add/remove items, multiple products management, live inventory checks.
Advanced Filtering Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, real-time recomputation of totals.
Rufus Vs Apolloresize icon
Group  Scenario IDs What it measures  Apollo-1 Rufus
Specs & Comparisons 1-3 · 7-9 · 22-24 · 28-30 · 31-33 · 37-39 · 61-63 · 64-66 · 112-114 · 118-120 Quick product details, direct feature/spec comparisons, discount verification, factual accuracy 24/30 → 80% 4/30 → 13%
Variants & Stock 4-6 · 10-12 · 34-36 · 52-54 · 58-60 · 70-72 · 73-75 · 76-78 · 79-81 · 82-84 Real-time availability for size, colour, and product variants; alternatives when out-of-stock 29/30 → 97% 1/30 → 3%
Cart & Inventory 13-15 · 19-21 · 40-42 · 49-51 · 67-69 · 106-108 Precise cart interactions—add/remove items, multi-product management, live inventory checks 18/18 → 100% 0/18 → 0%
Advanced Filtering 25-27 · 43-45 · 46-48 · 88-90 · 91-93 · 94-96 · 97-99 · 100-102 · 115-117 Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, scenario-driven logic 20/24 → 83% 3/24 → 13%
Gifts & Compliance 16-18 · 55-57 · 85-87 · 103-105 · 109-111 Occasion-based product recommendations, suitability for pets/children, strict policy adherence 18/18 → 100% 12/18 → 67%

9) Evaluation Analysis

Apollo-1 Strengths Across the Retail Buckets:

  • Context-aware dialogue (Specs & Comparisons)
    In Specs & Comparisons scenarios, Apollo-1 provides quick, accurate product details and asks clarifying follow-up questions (budget, exact specs, discounts) only when necessary, consistently maintaining clarity and context throughout.
  • Real-time variant availability (Variants & Stock)
    Apollo-1 consistently returns precise, real-time information on product availability across sizes, colors, and variants, explicitly stating “not available” when appropriate—achieving a 97% accuracy.
  • Deterministic cart control (Cart & Inventory)
    In Cart & Inventory scenarios, Apollo-1 performs actual cart modifications—accurately adding/removing products—and instantly reflecting these actions in live cart contents, avoiding the “pretend add-to-cart” failures seen in Rufus.
  • Advanced multi-condition filtering and arithmetic logic (Advanced Filtering)
    Apollo-1 handles complex filtering scenarios (price, warranty, specifications), recalculates totals dynamically after each user adjustment, and maintains precise arithmetic accuracy, achieving an 83% success rate.
  • Compliance and targeted recommendations (Gifts & Compliance)
    Apollo-1 effectively recommends products based on specific use-cases or suitability criteria (e.g., occasion, pets/children) while strictly adhering to brand policies and gracefully handling constraints, reaching perfect compliance.

Areas We’re Still Tuning:

  • Dynamic discount detection – Sale-price vs list-price maths (Scenarios 115-120) remains brittle across both models.
  • Long-tail attributes – Edge specs such as exotic ink-tank yields or obscure warranty tiers occasionally slip past the retrieval layer.

Rufus’ Persistent Gaps:

  • Skips variant checks (colour, size) or answers generically instead of “not available.”
  • Describes—but rarely executes—cart operations.
  • Miscomputes or omits arithmetic steps (price deltas, CADR values).
  • Zero full passes in Multi-Constraint tasks; struggles whenever more than one hard filter is active.

10) Why Apollo-1 Pulls Ahead

Apollo-1 is built specifically for conversational agents that act reliably on behalf of organizations, such as retailers, airlines, banks, or government agencies. Rather than relying on generative transformers, Apollo-1’s introduces a Neuro-Symbolic Reasoner as its decision-making core, making conversational agents that converse fluently and act reliably a reality. To dive into Apollo-1’s architecture, click here.

Diagram V14resize icon

Apollo-1’s Key Neuro-Symbolic Advantages:

  • Traceability: Full transparency—each reasoning step is logged, inspectable, and editable.
  • Steerability and Controllability: Operators can steer agent behavior by injecting granular context, instructions, and structured guidelines instantly.
  • Native Tool Use: Reliable interactions with complex APIs and external systems, driven by clearly defined symbolic entities (e.g., intent, constraints, context).
  • Continuous Learning from Human Feedback: Continuous feedback loops for live fine-tuning of the NLP modules.
  • Conversational Fluency: Maintains the high linguistic performance of top generative models.

Bottom Line

Across 120 one-shot retail scenarios, Apollo-1 completes the entire conversation in 109 runs (90.8 %), whereas Amazon Rufus succeeds in just 20 (16.7 %). The gap widens when tasks demand live arithmetic, variant precision, or multi-constraint reasoning—hallmarks of production-grade conversational agents where transparent, actionable, and reliable behaviour is non-negotiable; here, neuro-symbolic agents prevail.

Read more


Appendix A: Reward logs

View Full Scenario List here 

ID
Scenario
Rufus
R1

Find a 55-inch 4K TV in stock for under $400.

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

Find a 55-inch 4K TV in stock for under $500

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

Find a 55-inch 4K TV in stock for under $700.

R2

What’s the refresh rate of this TV?

R3

How much more is the 65-inch version of the same model?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

I need a laptop. Budget is $500.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need a laptop. Budget is $700.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need a laptop. Budget is $1000.

R2

Does it come in different colors?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I need wireless headphones for classical music, budget up to $400.

R2

What’s the battery life of these headphones?

R3

What do real customers say about this model?

R1 1
R2 1
R3 0

Result: Fail (2/3)

R1

I need wireless headphones for classical music, budget up to $300.

R2

What’s the battery life of these headphones?

R3

What do real customers say about this model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

I need wireless headphones for classical music, budget up to $200.

R2

What’s the battery life of these headphones?

R3

What do real customers say about this model?

R1 1
R2 0
R3 0

Result: Fail (1/3)

R1

I’m looking for an ergonomic office chair under $200.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I’m looking for an ergonomic office chair under $300.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1

I’m looking for an ergonomic office chair under $500.

R2

Do you have this chair in black?

R1 1
R2 0
R3 --

Result: Fail (1/2)

Apollo-1
R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 0
R3 --

Result: Fail (1/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)


Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).


References

  1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.

Corner Light
Corner Light
Back
Share
Corner Light
Corner Light