Neuro-Symbolic AI in Action: Apollo-1 Aces Amazon Scenarios; Rufus Falls Short

1) Summary

Generative AI excels when acting on behalf of a user, but it breaks down when an agent must act on behalf of a brand or organisation. That’s where Neuro-Symbolic AI comes in. To prove the point, we ran a set of real-world evaluations that pit Apollo-1 against today’s flagship LLM agents. This evaluation involved 120 multi-turn shopping conversations against Amazon’s live product catalog, covering everything from product discovery and cart operations to policy guardrails, with no hand-curated prompts and no retries.

When pitted against Amazon’s own agent, Rufus, on the exact same live catalog and inventory, Apollo-1 scored a decisive win:

Apollo-1: 109 / 120 full-conversation passes (90.8%)

Rufus: 20 / 120 full-conversation passes (16.7%)

This result was achieved using a strict, one-shot scoring methodology. Unlike most leaderboards that cherry-pick the best response from batches of generations, we score the first and only response. It’s a simple, reproducible, and brutally honest snapshot of how an agent performs when a real customer is waiting.

2) Introduction

ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Banks can’t audit decisions, airlines can’t guarantee bookings, retailers can’t trust inventory calls. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.

For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1, our neuro-symbolic foundation model for conversational agents, marks the beginning of this transformation. In recent evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.

3) The Limitations of Generative-Only AI

The fundamental shortcomings of generative AI have become increasingly visible (and costly):

Opaque Reasoning: Generative models act as black boxes, leaving no clear explanation of their decisions. This is unacceptable when auditability and accountability matter.
Volatile Outputs: Even minor changes in input can drastically alter responses—a hazard in banking, healthcare, customer service, and beyond.
Policy Drift: Generative AI regularly ignores or misinterprets critical instructions, making it unsuitable for regulated or high-stakes scenarios.
Fragile Tool Calls: API calls and complex interactions often fail, especially in multi-step tasks such as bookings and transactions.
Costly Retraining: Corrections require costly, time-consuming retraining on massive datasets.

Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence.

4) Neuro-Symbolic AI: Bridging the Gap

For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.

Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users.

5) Results at a Glance

All scores are one-shot and reflect the agent’s first responses only, no retries.

6) Evaluation Method – Real-World, One-Take Scoring

Our evaluation is guided by the same principles as the tough τ-Bench benchmark, and is designed to be unforgiving. Each agent gets only one attempt to complete a scenario, and success is binary—every step must be perfect, or the entire scenario is a failure. The final score is therefore a reflection of an agent’s real-world performance.

7) Example scenario

Scoring criteria — Scenario 1 “55-inch 4K TV Under $400”

Step	Tester prompt	Check	Pass criterion
R-1	“Find a 55-inch 4K TV in stock for under $400.”	Returns at least one concrete model (name + price).	Model is 55 in, 4 K, in stock, price ≤ $400, price shown.
R-2	“What’s the refresh rate of this TV?”	States refresh rate for the same model.	Refresh-rate value is correct and explicitly mentioned.
R-3	“How much more is the 65-inch version of the same model?”	Gives price difference or says “not available.”	Either: correct USD delta or “65-inch not available” for that model.

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario 1 Reward Logs

Model	TV found?	Refresh rate right?	65-inch price/diff right?	Score	Reward
Apollo-1	✅ TCL 55S455 $349	✅ 60 Hz	✅ “65-inch not available”	3 / 3	1
Rufus	✅ Hisense 55H6G $299	❌ (no refresh rate)	❌ (no price diff)	1 / 3	0

Apollo-1 delivers exact product details, specs, and availability in a single pass; Rufus surfaces a model but misses the follow-up spec and variant check.

8) Evaluation Overview & Group Analysis

The 120 live shopping runs fall into five buckets. Each bucket probes a single competency; a flawless agent would post 100% in every row.

Groups	Technical Capability under Test
Specs & Comparisons	Quick product details, direct feature/spec comparisons, discount verification, factual accuracy.
Variants & Stock	Real-time availability for size, colour, product variants, and clear “not available” statements.
Cart & Inventory	Precise cart interactions: add/remove items, multiple products management, live inventory checks.
Advanced Filtering	Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, real-time recomputation of totals.

Group	Scenario IDs	What it measures	Apollo-1	Rufus
Specs & Comparisons	1-3 · 7-9 · 22-24 · 28-30 · 31-33 · 37-39 · 61-63 · 64-66 · 112-114 · 118-120	Quick product details, direct feature/spec comparisons, discount verification, factual accuracy	24/30 → 80%	4/30 → 13%
Variants & Stock	4-6 · 10-12 · 34-36 · 52-54 · 58-60 · 70-72 · 73-75 · 76-78 · 79-81 · 82-84	Real-time availability for size, colour, and product variants; alternatives when out-of-stock	29/30 → 97%	1/30 → 3%
Cart & Inventory	13-15 · 19-21 · 40-42 · 49-51 · 67-69 · 106-108	Precise cart interactions—add/remove items, multi-product management, live inventory checks	18/18 → 100%	0/18 → 0%
Advanced Filtering	25-27 · 43-45 · 46-48 · 88-90 · 91-93 · 94-96 · 97-99 · 100-102 · 115-117	Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, scenario-driven logic	20/24 → 83%	3/24 → 13%
Gifts & Compliance	16-18 · 55-57 · 85-87 · 103-105 · 109-111	Occasion-based product recommendations, suitability for pets/children, strict policy adherence	18/18 → 100%	12/18 → 67%

9) Evaluation Analysis

Apollo-1 Strengths Across the Retail Buckets:

Context-aware dialogue (Specs & Comparisons) –
In Specs & Comparisons scenarios, Apollo-1 provides quick, accurate product details and asks clarifying follow-up questions (budget, exact specs, discounts) only when necessary, consistently maintaining clarity and context throughout.
Real-time variant availability (Variants & Stock) –
Apollo-1 consistently returns precise, real-time information on product availability across sizes, colors, and variants, explicitly stating “not available” when appropriate—achieving a 97% accuracy.
Deterministic cart control (Cart & Inventory) –
In Cart & Inventory scenarios, Apollo-1 performs actual cart modifications—accurately adding/removing products—and instantly reflecting these actions in live cart contents, avoiding the “pretend add-to-cart” failures seen in Rufus.
Advanced multi-condition filtering and arithmetic logic (Advanced Filtering) –
Apollo-1 handles complex filtering scenarios (price, warranty, specifications), recalculates totals dynamically after each user adjustment, and maintains precise arithmetic accuracy, achieving an 83% success rate.
Compliance and targeted recommendations (Gifts & Compliance) –
Apollo-1 effectively recommends products based on specific use-cases or suitability criteria (e.g., occasion, pets/children) while strictly adhering to brand policies and gracefully handling constraints, reaching perfect compliance.

Areas We’re Still Tuning:

Dynamic discount detection – Sale-price vs list-price maths (Scenarios 115-120) remains brittle across both models.
Long-tail attributes – Edge specs such as exotic ink-tank yields or obscure warranty tiers occasionally slip past the retrieval layer.

Rufus’ Persistent Gaps:

Skips variant checks (colour, size) or answers generically instead of “not available.”
Describes—but rarely executes—cart operations.
Miscomputes or omits arithmetic steps (price deltas, CADR values).
Zero full passes in Multi-Constraint tasks; struggles whenever more than one hard filter is active.

10) Why Apollo-1 Pulls Ahead

Apollo-1 is built specifically for conversational agents that act reliably on behalf of organizations, such as retailers, airlines, banks, or government agencies. Rather than relying on generative transformers, Apollo-1’s introduces a Neuro-Symbolic Reasoner as its decision-making core, making conversational agents that converse fluently and act reliably a reality. To dive into Apollo-1’s architecture, click here.

Apollo-1’s Key Neuro-Symbolic Advantages:

Traceability: Full transparency—each reasoning step is logged, inspectable, and editable.
Steerability and Controllability: Operators can steer agent behavior by injecting granular context, instructions, and structured guidelines instantly.
Native Tool Use: Reliable interactions with complex APIs and external systems, driven by clearly defined symbolic entities (e.g., intent, constraints, context).
Continuous Learning from Human Feedback: Continuous feedback loops for live fine-tuning of the NLP modules.
Conversational Fluency: Maintains the high linguistic performance of top generative models.

Bottom Line

Across 120 one-shot retail scenarios, Apollo-1 completes the entire conversation in 109 runs (90.8 %), whereas Amazon Rufus succeeds in just 20 (16.7 %). The gap widens when tasks demand live arithmetic, variant precision, or multi-constraint reasoning—hallmarks of production-grade conversational agents where transparent, actionable, and reliable behaviour is non-negotiable; here, neuro-symbolic agents prevail.

Appendix A: Reward logs

View Full Scenario List here

Appendix B :

View full Trajectories Here.

Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).

References

¹Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.

News