Generative AI excels when acting on behalf of a user, but it breaks down when an agent must act on behalf of a brand or organisation. That’s where Neuro-Symbolic AI comes in. To prove the point, we ran a set of real-world evaluations that pit Apollo-1 against today’s flagship LLM agents. This evaluation involved 120 multi-turn shopping conversations against Amazon’s live product catalog, covering everything from product discovery and cart operations to policy guardrails, with no hand-curated prompts and no retries.
When pitted against Amazon’s own agent, Rufus, on the exact same live catalog and inventory, Apollo-1 scored a decisive win:
Apollo-1: 109 / 120 full-conversation passes (90.8%)
Rufus: 20 / 120 full-conversation passes (16.7%)
This result was achieved using a strict, one-shot scoring methodology. Unlike most leaderboards that cherry-pick the best response from batches of generations, we score the first and only response. It’s a simple, reproducible, and brutally honest snapshot of how an agent performs when a real customer is waiting.
ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Banks can’t audit decisions, airlines can’t guarantee bookings, retailers can’t trust inventory calls. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.
For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1, our neuro-symbolic foundation model for conversational agents, marks the beginning of this transformation. In recent evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.
The fundamental shortcomings of generative AI have become increasingly visible (and costly):
Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence.
For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.
Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users.
All scores are one-shot and reflect the agent’s first responses only, no retries.
Our evaluation is guided by the same principles as the tough τ-Bench benchmark, and is designed to be unforgiving. Each agent gets only one attempt to complete a scenario, and success is binary—every step must be perfect, or the entire scenario is a failure. The final score is therefore a reflection of an agent’s real-world performance.
Scoring criteria — Scenario 1 “55-inch 4K TV Under $400”
Step | Tester prompt | Check | Pass criterion |
R-1 | “Find a 55-inch 4K TV in stock for under $400.” | Returns at least one concrete model (name + price). | Model is 55 in, 4 K, in stock, price ≤ $400, price shown. |
R-2 | “What’s the refresh rate of this TV?” | States refresh rate for the same model. | Refresh-rate value is correct and explicitly mentioned. |
R-3 | “How much more is the 65-inch version of the same model?” | Gives price difference or says “not available.” | Either: correct USD delta or “65-inch not available” for that model. |
Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).
Model | TV found? | Refresh rate right? | 65-inch price/diff right? | Score | Reward |
Apollo-1 | ✅ TCL 55S455 $349 | ✅ 60 Hz | ✅ “65-inch not available” | 3 / 3 | 1 |
Rufus | ✅ Hisense 55H6G $299 | ❌ (no refresh rate) | ❌ (no price diff) | 1 / 3 | 0 |
Apollo-1 delivers exact product details, specs, and availability in a single pass; Rufus surfaces a model but misses the follow-up spec and variant check.
The 120 live shopping runs fall into five buckets. Each bucket probes a single competency; a flawless agent would post 100% in every row.
Groups | Technical Capability under Test |
---|---|
Specs & Comparisons | Quick product details, direct feature/spec comparisons, discount verification, factual accuracy. |
Variants & Stock | Real-time availability for size, colour, product variants, and clear “not available” statements. |
Cart & Inventory | Precise cart interactions: add/remove items, multiple products management, live inventory checks. |
Advanced Filtering | Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, real-time recomputation of totals. |
Group | Scenario IDs | What it measures | Apollo-1 | Rufus |
Specs & Comparisons | 1-3 · 7-9 · 22-24 · 28-30 · 31-33 · 37-39 · 61-63 · 64-66 · 112-114 · 118-120 | Quick product details, direct feature/spec comparisons, discount verification, factual accuracy | 24/30 → 80% | 4/30 → 13% |
Variants & Stock | 4-6 · 10-12 · 34-36 · 52-54 · 58-60 · 70-72 · 73-75 · 76-78 · 79-81 · 82-84 | Real-time availability for size, colour, and product variants; alternatives when out-of-stock | 29/30 → 97% | 1/30 → 3% |
Cart & Inventory | 13-15 · 19-21 · 40-42 · 49-51 · 67-69 · 106-108 | Precise cart interactions—add/remove items, multi-product management, live inventory checks | 18/18 → 100% | 0/18 → 0% |
Advanced Filtering | 25-27 · 43-45 · 46-48 · 88-90 · 91-93 · 94-96 · 97-99 · 100-102 · 115-117 | Complex multi-condition filtering (price, specs, warranty), arithmetic reasoning, scenario-driven logic | 20/24 → 83% | 3/24 → 13% |
Gifts & Compliance | 16-18 · 55-57 · 85-87 · 103-105 · 109-111 | Occasion-based product recommendations, suitability for pets/children, strict policy adherence | 18/18 → 100% | 12/18 → 67% |
Apollo-1 Strengths Across the Retail Buckets:
Areas We’re Still Tuning:
Rufus’ Persistent Gaps:
Apollo-1 is built specifically for conversational agents that act reliably on behalf of organizations, such as retailers, airlines, banks, or government agencies. Rather than relying on generative transformers, Apollo-1’s introduces a Neuro-Symbolic Reasoner as its decision-making core, making conversational agents that converse fluently and act reliably a reality. To dive into Apollo-1’s architecture, click here.
Across 120 one-shot retail scenarios, Apollo-1 completes the entire conversation in 109 runs (90.8 %), whereas Amazon Rufus succeeds in just 20 (16.7 %). The gap widens when tasks demand live arithmetic, variant precision, or multi-constraint reasoning—hallmarks of production-grade conversational agents where transparent, actionable, and reliable behaviour is non-negotiable; here, neuro-symbolic agents prevail.
Find a 55-inch 4K TV in stock for under $400.
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (1/3)
Find a 55-inch 4K TV in stock for under $500
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (1/3)
Find a 55-inch 4K TV in stock for under $700.
What’s the refresh rate of this TV?
How much more is the 65-inch version of the same model?
Result: Fail (0/3)
I need a laptop. Budget is $500.
Does it come in different colors?
Result: Fail (1/2)
I need a laptop. Budget is $700.
Does it come in different colors?
Result: Fail (1/2)
I need a laptop. Budget is $1000.
Does it come in different colors?
Result: Fail (1/2)
I need wireless headphones for classical music, budget up to $400.
What’s the battery life of these headphones?
What do real customers say about this model?
Result: Fail (2/3)
I need wireless headphones for classical music, budget up to $300.
What’s the battery life of these headphones?
What do real customers say about this model?
Result: Fail (1/3)
I need wireless headphones for classical music, budget up to $200.
What’s the battery life of these headphones?
What do real customers say about this model?
Result: Fail (1/3)
I’m looking for an ergonomic office chair under $200.
Do you have this chair in black?
Result: Fail (1/2)
I’m looking for an ergonomic office chair under $300.
Do you have this chair in black?
Result: Fail (1/2)
I’m looking for an ergonomic office chair under $500.
Do you have this chair in black?
Result: Fail (1/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Fail (1/2)
Result: Pass (2/2)
Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).
1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.