News

Apollo-1 One-Shots 83% of 111 Live Google-Flights Scenarios — A Neuro-Symbolic AI Showcase

Generative AI unlocked fluent chat. Apollo-1 adds something generative models can’t: verifiable action.
07/12/2025

1) Summary

Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity, whether it’s an insurance company, a brand, or a travel platform. That’s where Neuro-Symbolic AI becomes essential. Apollo-1, our neuro-symbolic foundation model for conversational agents, is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with reliable actionability, giving organisations the transparency, traceability, and steerability that conventional LLM co-pilots lack. 

To demonstrate Neuro-Symbolic AI’s superiority over purely generative models, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules:

Apollo-1: 92 / 111 full-conversation passes (82.9%)

Gemini 2.5-Flash: 24 / 111 full-conversation passes (21.6%)

Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.

2) Introduction

Traditional LLM co-pilots (ChatGPT, Gemini, Claude) are superb personal assistants. But when the agent is the business, stochastic phrasing, hallucinated facts, and opaque reasoning become existential risks. ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.

For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1 marks the beginning of this transformation. In recent benchmarks and evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.

3) The Limitations of Generative-Only AI

The fundamental shortcomings of generative AI have become increasingly visible (and costly):

  • Opaque Reasoning: Generative models act as black boxes, leaving no clear explanation of their decisions. This is unacceptable when auditability and accountability matter.
  • Volatile Outputs: Even minor changes in input can drastically alter responses—a hazard in banking, healthcare, customer service, and beyond.
  • Policy Drift: Generative AI regularly ignores or misinterprets critical instructions, making it unsuitable for regulated or high-stakes scenarios.
  • Fragile Tool Calls: API calls and complex interactions often fail, especially in multi-step tasks such as bookings and transactions.
  • Costly Retraining: Corrections require costly, time-consuming retraining on massive datasets.

Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence. Recent academic work underscores how far today’s LLM agents are from that bar. A May 2025 paper, “CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions”, reports that state-of-the-art agents score only 58% on single-turn tasks and collapse to ≈35% once the dialogue spans multiple turns, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands.”1

4) Neuro-Symbolic AI: Bridging the Gap

For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.

Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users. 

5) Results in a Snapshot

Concept 01 V2 (1)resize icon

6) Evaluation method – real-world, one-take scoring

Corner Dark
Corner Dark
Rule
01One shot per scenario
02Multi-turn pass rule
03111 customer scenarios
04Live Google Flights backend
05Sampling window
What it enforces
The agent gets zero retries; whatever it says in each turn is final.
A scenario passes only if every step (R-1…R-3) is correct. One miss → scenario fail.
Book a seat, add a bag, pick the greenest flight, reroute through a hub, quote an upgrade, etc.
Both agents hit the same real-time inventory, so any gap is pure reasoning and control.
Runs executed between May 30 and June 2nd, 2025; results reflect that snapshot.
Corner Dark
Corner Dark

We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.

7) Example Scenario

Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario

Step Tester prompt Check Pass criterion
R-1 “One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.” Return at least one qualifying red-eye itinerary Departs 9 Aug, arrives next day before 12:00 PM, shows total price
R-2 “What’s the cost to upgrade the cheapest red-eye?” Quote upgrade price (or say none) for that same itinerary States whether an upgrade exists and the correct price delta
R-3 “If I skip the upgrade and instead add one checked bag, what’s the total cost?” Add the airline’s checked-bag fee and compute new grand total Gives per-bag fee, adds it to base fare, and returns the exact all-in total

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario Reward Logs

Model Red-eye found? Upgrade price? Bag math right? Score Reward
Apollo-1 ✅ 22:00 → 06:30
$204
✅ +$80
Blue Extra
✅ Fare + $40 bag
= $244
3/3 1
Gemini ✅ 20:47 → 05:30
$139
❌ pulled upgrade
from wrong airline
❌ bag fee range,
no total
1/3 0

8) Benchmark Overview & Group Analysis 

The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:

Group Technical capability under test
Baseline Retrieval Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers.
Core Search Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked.
Ancillary Costs End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity.
Cabin Precision Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints.
Constrained Planning Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists.
Graph Gemini V2resize icon
Group Scenario IDs What it measures  Apollo-1 Gemini
Baseline Retrieval 37–39 · 70–72 · 91–93 · 100–111 Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert” → 85.7% → 71.4%
Core Search 1–6 · 25–27 · 40–42 · 64–66 · 73–78 Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT → 95.2% → 28.6%
Ancillary Costs 7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96 Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag → 57.1% → 0%
Cabin Precision 13–18 · 46–51 · 85–90 · 97–99 Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX → 90.5% → 0%
Constrained Planning 10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81 Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips. → 85.2% → 11.1%
Totals (n = 111 Scenarios) → 82.9% → 21.6%

9) Evaluation Analysis

Apollo-1 Strengths Across the Five Competency Buckets

  • Context-aware dialogue control – in Core Search scenarios Apollo-1 clarifies missing slots (dates, O/W vs R/T, party size) before searching, then keeps that context intact through follow-up constraints.
  • High-fidelity data retrieval – consistently surfaces the exact fare family, bag fee, upgrade delta or aircraft type demanded in Ancillary Costs and Cabin Precision scenarios.
  • Multi-constraint filtering – satisfies arrival-before / depart-after windows, hub-routing, Wi-Fi-only, airline-specific and environmental (CO₂) filters in a single pass, dominating the Constrained Planning bucket.
  • Complete itineraries – returns full outbound + return legs where required, a failure point that still plagues Gemini in every bucket except Baseline Retrieval.

Areas that Still Need Work

  • True multi-city pricing flows – both models stumble on leg-by-leg bag or upgrade math in scenarios that chain three or more segments.
  • Long-tail ancillaries – niche items such as carrier-published CO₂ offset add-ons or exotic pet-in-cabin fees remain patchy (Gemini far more so).

Gemini’s Persistent Gaps

  • Fails slot clarification: often assumes round-trip, ignores supplied date windows, or answers with only the outbound leg.
  • Zero full-conversation passes in Ancillary Costs and Cabin Precision buckets.

10) Why Apollo-1 Pulls Ahead 

Apollo-1 is designed to power conversational agents acting on behalf of entities across industries and use-cases. Apollo-1 enables advanced native tool use through reliable, structured symbolic interactions with complex external systems and APIs. It provides comprehensive traceability, with each decision logged in a fully inspectable, editable format. Finally, it offers steerability and controllability, allowing organizations to consistently steer agents toward desired behaviors by providing granular context, instructions, and guidelines. To dive into Apollo-1’s architecture, click here.

Diagram V14resize icon

Apollo-1’s Key Neuro-Symbolic Advantages:

  • Traceability: Full transparency—each reasoning step is logged, inspectable, and editable.
  • Steerability and Controllability: Operators can steer agent behavior by injecting granular context, instructions, and structured guidelines instantly.
  • Native Tool Use: Reliable interactions with complex APIs and external systems, driven by clearly defined symbolic entities (e.g., intent, constraints, context).
  • Continuous Learning from Human Feedback: Continuous feedback loops for live fine-tuning of the NLP modules.
  • Conversational Fluency: Maintains the high linguistic performance of top generative models.

11) Bottom Line 

Generative AI is fine for user chat; Neuro-Symbolic AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for a neuro-symbolic core. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.

Apollo-1’s neuro-symbolic architecture delivers the natural language quality users expect and the steerability, policy adherence and compliance companies require.

Read more


Appendix A: Reward Logs

View Full Scenario List

ID
Scenario
Gemini 2.5-Flash
R1

Hi! I need round-trip flights from MIA to NYC.

R2

Where does the return leg depart from

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Hi! I need round-trip flights from BOS to WAS

R2

Where does the return leg depart from?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

R1: Hi! I need round-trip flights from LON to PAR.

R2

Where does the return leg depart from?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Hi! I need to find round-trip flights from LON to PAR in August.

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Hi! I need to find round-trip flights from MIA to NYC in August

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Hi! I need to find round-trip flights from BOS to WAS in August.

R2

What is the duration of each leg?

R3

What is the baggage allowance for each leg?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

BOS → DUB 1 Nov – 8 Nov, basic-econ

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

MCO → DFW 10 Jul – 16 Jul, basic-econ.

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

LAS → LAX 11 Aug – 14 Aug, basic-econ.

R2

Add one checked bag each way—what’s the new total?

R1 0
R2 0
R3 --

Result: Fail (0/2)

R1

Looking to fly tomorrow from BOS to NYC.

R2

Need to get to NYC on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 1
R3 0

Result: Fail (1/3)

R1

Looking to fly tomorrow from LAX to SFO.

R2

Need to get to SFO on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 0
R3 0

Result: Fail (0/3)

R1

Looking to fly tomorrow from MAD to BER.

R2

Need to get to BER on time for lunch.

R3

Is there a seat with more legroom on this flight?

R1 0
R2 0
R3 0

Result: Fail (0/3

Apollo-1
R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 --

Result: Pass (2/2)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)

R1 1
R2 1
R3 1

Result: Pass (3/3)


Appendix B : Trajectories

View Full Trajectories 


Click here to explore Apollo-1’s Eval Playground, where you can experience its capabilities firsthand. Interact with the model and its Reasoning Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).


References

1 Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.

this is a normal text

Corner Light
Corner Light
Back
Share
Corner Light
Corner Light