Apollo-1 One-Shots 83% of 111 Live Google-Flights Scenarios

1) Summary

Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity, whether it’s an insurance company, a brand, or a travel platform. That’s where Neuro-Symbolic AI becomes essential. Apollo-1, our neuro-symbolic foundation model for conversational agents, is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with reliable actionability, giving organisations the transparency, traceability, and steerability that conventional LLM co-pilots lack.

To demonstrate Neuro-Symbolic AI’s superiority over purely generative models, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules:

Apollo-1: 92 / 111 full-conversation passes (82.9%)

Gemini 2.5-Flash: 24 / 111 full-conversation passes (21.6%)

Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.

2) Introduction

Traditional LLM co-pilots (ChatGPT, Gemini, Claude) are superb personal assistants. But when the agent is the business, stochastic phrasing, hallucinated facts, and opaque reasoning become existential risks. ChatGPT showed the world that machines can talk. But when a machine must also reliably act—book flights, move money, enforce policy—fluent conversation isn’t enough. Generative AI remains an unreliable black box in scenarios where conversational agents must act on behalf of entities. Yet these scenarios represent a substantial and economically critical portion of all potential AI applications.

For AI to handle these critical interactions, we must transcend purely generative models and embrace a new architecture: Neuro-Symbolic AI. Apollo-1 marks the beginning of this transformation. In recent benchmarks and evaluation tests, such as the one detailed below, Apollo-1 consistently outperformed state-of-the-art generative models by wide margins, precisely on tasks that require conversational fluency combined with dependable, transparent action.

3) The Limitations of Generative-Only AI

The fundamental shortcomings of generative AI have become increasingly visible (and costly):

Opaque Reasoning: Generative models act as black boxes, leaving no clear explanation of their decisions. This is unacceptable when auditability and accountability matter.
Volatile Outputs: Even minor changes in input can drastically alter responses—a hazard in banking, healthcare, customer service, and beyond.
Policy Drift: Generative AI regularly ignores or misinterprets critical instructions, making it unsuitable for regulated or high-stakes scenarios.
Fragile Tool Calls: API calls and complex interactions often fail, especially in multi-step tasks such as bookings and transactions.
Costly Retraining: Corrections require costly, time-consuming retraining on massive datasets.

Generative AI alone simply cannot fulfill the promises the world expects from advanced artificial intelligence. Recent academic work underscores how far today’s LLM agents are from that bar. A May 2025 paper, “CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions”, reports that state-of-the-art agents score only 58% on single-turn tasks and collapse to ≈35% once the dialogue spans multiple turns, leading the authors to highlight a “significant gap between current LLM capabilities and real-world enterprise demands.”¹

4) Neuro-Symbolic AI: Bridging the Gap

For decades, AI researchers debated two distinct paths: symbolic AI, emphasizing rules, logic, and explicit reasoning; and neural networks, skilled at pattern recognition and statistical learning. Both have strengths, both have critical limitations. Purely symbolic AI struggles with natural, conversational interaction. Purely neural, generative AI falters when trust, reliability, and consistency are non-negotiable. The long-standing goal was to combine the two—leveraging each to offset the other’s weaknesses. With Apollo-1, that vision is now reality: conversational agents that converse fluently and act reliably.

Neuro-Symbolic AI bridges the gap between Generative AI’s linguistic capabilities and Symbolic AI’s structured reasoning, unlocking actionable, reliable, transparent, and steerable AI interactions. This second wave moves beyond conversation to dependable execution, enabling conversational agents that work for entities, not just end-users.

5) Results in a Snapshot

6) Evaluation method – real-world, one-take scoring

We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.

7) Example Scenario

Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario

Step	Tester prompt	Check	Pass criterion
R-1	“One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.”	Return at least one qualifying red-eye itinerary	Departs 9 Aug, arrives next day before 12:00 PM, shows total price
R-2	“What’s the cost to upgrade the cheapest red-eye?”	Quote upgrade price (or say none) for that same itinerary	States whether an upgrade exists and the correct price delta
R-3	“If I skip the upgrade and instead add one checked bag, what’s the total cost?”	Add the airline’s checked-bag fee and compute new grand total	Gives per-bag fee, adds it to base fare, and returns the exact all-in total

Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).

Scenario Reward Logs

Model	Red-eye found?	Upgrade price?	Bag math right?	Score	Reward
Apollo-1	✅ 22:00 → 06:30 $204	✅ +$80 Blue Extra	✅ Fare + $40 bag = $244	3/3	1
Gemini	✅ 20:47 → 05:30 $139	❌ pulled upgrade from wrong airline	❌ bag fee range, no total	1/3	0

8) Benchmark Overview & Group Analysis

The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:

Group	Technical capability under test
Baseline Retrieval	Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers.
Core Search	Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked.
Ancillary Costs	End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity.
Cabin Precision	Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints.
Constrained Planning	Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists.

Group	Scenario IDs	What it measures	Apollo-1	Gemini
Baseline Retrieval	37–39 · 70–72 · 91–93 · 100–111	Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert”	→ 85.7%	→ 71.4%
Core Search	1–6 · 25–27 · 40–42 · 64–66 · 73–78	Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT	→ 95.2%	→ 28.6%
Ancillary Costs	7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96	Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag	→ 57.1%	→ 0%
Cabin Precision	13–18 · 46–51 · 85–90 · 97–99	Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX	→ 90.5%	→ 0%
Constrained Planning	10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81	Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips.	→ 85.2%	→ 11.1%
—	Totals (n = 111 Scenarios)	—	→ 82.9%	→ 21.6%

9) Evaluation Analysis

Apollo-1 Strengths Across the Five Competency Buckets

Context-aware dialogue control – in Core Search scenarios Apollo-1 clarifies missing slots (dates, O/W vs R/T, party size) before searching, then keeps that context intact through follow-up constraints.
High-fidelity data retrieval – consistently surfaces the exact fare family, bag fee, upgrade delta or aircraft type demanded in Ancillary Costs and Cabin Precision scenarios.
Multi-constraint filtering – satisfies arrival-before / depart-after windows, hub-routing, Wi-Fi-only, airline-specific and environmental (CO₂) filters in a single pass, dominating the Constrained Planning bucket.
Complete itineraries – returns full outbound + return legs where required, a failure point that still plagues Gemini in every bucket except Baseline Retrieval.

Areas that Still Need Work

True multi-city pricing flows – both models stumble on leg-by-leg bag or upgrade math in scenarios that chain three or more segments.
Long-tail ancillaries – niche items such as carrier-published CO₂ offset add-ons or exotic pet-in-cabin fees remain patchy (Gemini far more so).

Gemini’s Persistent Gaps

Fails slot clarification: often assumes round-trip, ignores supplied date windows, or answers with only the outbound leg.
Zero full-conversation passes in Ancillary Costs and Cabin Precision buckets.

10) Why Apollo-1 Pulls Ahead

Apollo-1 is designed to power conversational agents acting on behalf of entities across industries and use-cases. Apollo-1 enables advanced native tool use through reliable, structured symbolic interactions with complex external systems and APIs. It provides comprehensive traceability, with each decision logged in a fully inspectable, editable format. Finally, it offers steerability and controllability, allowing organizations to consistently steer agents toward desired behaviors by providing granular context, instructions, and guidelines. To dive into Apollo-1’s architecture, click here.

Apollo-1’s Key Neuro-Symbolic Advantages:

Traceability: Full transparency—each reasoning step is logged, inspectable, and editable.
Steerability and Controllability: Operators can steer agent behavior by injecting granular context, instructions, and structured guidelines instantly.
Native Tool Use: Reliable interactions with complex APIs and external systems, driven by clearly defined symbolic entities (e.g., intent, constraints, context).
Continuous Learning from Human Feedback: Continuous feedback loops for live fine-tuning of the NLP modules.
Conversational Fluency: Maintains the high linguistic performance of top generative models.

11) Bottom Line

Generative AI is fine for user chat; Neuro-Symbolic AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for a neuro-symbolic core. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.

Apollo-1’s neuro-symbolic architecture delivers the natural language quality users expect and the steerability, policy adherence and compliance companies require.

Appendix A: Reward Logs

View Full Scenario List

Appendix B : Trajectories

View Full Trajectories

Click here to explore Apollo-1’s Demo Playground, where you can experience its capabilities firsthand. Interact with the model and its Control Panel, navigate real-time conversations across multiple evaluation domains, and review live benchmark trajectories (passkey required; request access).

References

¹ Huang, K-H.; Prabhakar, A.; Thorat, O.; Agarwal, D.; Choubey, P.K.; Mao, Y.; Savarese, S.; Xiong, C.; Wu, C-S. CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions. Salesforce AI Research (2025). arXiv:2505.18878.

News

Apollo-1 One-Shots 83% of 111 Live Google-Flights Scenarios — A Neuro-Symbolic AI Showcase