Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity, whether it’s an insurance company, a brand, or a travel platform. That’s where Neuro-Symbolic AI becomes essential. Apollo-1, our neuro-symbolic foundation model for conversational agents, is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with reliable actionability, giving organisations the transparency, traceability, and steerability that conventional LLM co-pilots lack.
To demonstrate Neuro-Symbolic AI’s superiority over purely generative models, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules:
Apollo-1: 92 / 111 full-conversation passes (82.9%)
Gemini 2.5-Flash: 24 / 111 full-conversation passes (21.6%)
Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.
Traditional LLM co-pilots (ChatGPT, Gemini, Claude) are superb personal assistants. But when the agent is the business, stochastic phrasing, hallucinated facts, and opaque reasoning become existential risks. ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.
For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1 marks the beginning of this transformation. In recent benchmarks and evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.
The fundamental shortcomings of generative AI have become increasingly visible (and costly):
Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence. Recent academic work underscores how far today’s LLM agents are from that bar. A May 2025 paper, “CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions”, reports that state-of-the-art agents score only 58% on single-turn tasks and collapse to ≈35% once the dialogue spans multiple turns, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands.”1
For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.
Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users.
We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.
Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario
Step | Tester prompt | Check | Pass criterion |
R-1 | “One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.” | Return at least one qualifying red-eye itinerary | Departs 9 Aug, arrives next day before 12:00 PM, shows total price |
R-2 | “What’s the cost to upgrade the cheapest red-eye?” | Quote upgrade price (or say none) for that same itinerary | States whether an upgrade exists and the correct price delta |
R-3 | “If I skip the upgrade and instead add one checked bag, what’s the total cost?” | Add the airline’s checked-bag fee and compute new grand total | Gives per-bag fee, adds it to base fare, and returns the exact all-in total |
Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).
Scenario Reward Logs
Model | Red-eye found? | Upgrade price? | Bag math right? | Score | Reward |
Apollo-1 | ✅ 22:00 → 06:30 $204 |
✅ +$80 Blue Extra |
✅ Fare + $40 bag = $244 |
3/3 | 1 |
Gemini | ✅ 20:47 → 05:30 $139 |
❌ pulled upgrade from wrong airline |
❌ bag fee range, no total |
1/3 | 0 |
The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:
Group | Technical capability under test |
Baseline Retrieval | Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers. |
Core Search | Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked. |
Ancillary Costs | End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity. |
Cabin Precision | Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints. |
Constrained Planning | Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists. |
Group | Scenario IDs | What it measures | Apollo-1 | Gemini |
---|---|---|---|---|
Baseline Retrieval | 37–39 · 70–72 · 91–93 · 100–111 | Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert” | → 85.7% | → 71.4% |
Core Search | 1–6 · 25–27 · 40–42 · 64–66 · 73–78 | Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT | → 95.2% | → 28.6% |
Ancillary Costs | 7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96 | Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag | → 57.1% | → 0% |
Cabin Precision | 13–18 · 46–51 · 85–90 · 97–99 | Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX | → 90.5% | → 0% |
Constrained Planning | 10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81 | Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips. | → 85.2% | → 11.1% |
— | Totals (n = 111 Scenarios) | — | → 82.9% | → 21.6% |
Apollo-1 Strengths Across the Five Competency Buckets
Areas that Still Need Work
Gemini’s Persistent Gaps
Apollo-1 is designed to power conversational agents acting on behalf of entities across industries and use-cases. Apollo-1 enables advanced native tool use through reliable, structured symbolic interactions with complex external systems and APIs. It provides comprehensive traceability, with each decision logged in a fully inspectable, editable format. Finally, it offers steerability and controllability, allowing organizations to consistently steer agents toward desired behaviors by providing granular context, instructions, and guidelines. To dive into Apollo-1’s architecture, click here.
Generative AI is fine for user chat; Neuro-Symbolic AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for a neuro-symbolic core. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.
Apollo-1’s neuro-symbolic architecture delivers the natural language quality users expect and the steerability, policy adherence and compliance companies require.
Hi! I need round-trip flights from MIA to NYC.
Where does the return leg depart from
Result: Fail (0/2)
Hi! I need round-trip flights from BOS to WAS
Where does the return leg depart from?
Result: Fail (0/2)
R1: Hi! I need round-trip flights from LON to PAR.
Where does the return leg depart from?
Result: Fail (0/2)
Hi! I need to find round-trip flights from LON to PAR in August.
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
Hi! I need to find round-trip flights from MIA to NYC in August
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
Hi! I need to find round-trip flights from BOS to WAS in August.
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
BOS → DUB 1 Nov – 8 Nov, basic-econ
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
MCO → DFW 10 Jul – 16 Jul, basic-econ.
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
LAS → LAX 11 Aug – 14 Aug, basic-econ.
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
Looking to fly tomorrow from BOS to NYC.
Need to get to NYC on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (1/3)
Looking to fly tomorrow from LAX to SFO.
Need to get to SFO on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (0/3)
Looking to fly tomorrow from MAD to BER.
Need to get to BER on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (0/3
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).
1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.
this is a normal text