Apollo-1’s 83% vs Gemini 2.5-Flash’s 22%: Where generative AI stalls, neuro-symbolic controllable AI powers through
Generative AI excels when acting on behalf of a user, but it falters when an agent must act for an entity—whether it’s an insurance company, a brand, or a travel platform. That’s where controllable AI becomes essential, giving companies the levers to keep an agent aligned with their policies, guardrails, and desired behaviour. To demonstrate, we wired Apollo-1 into the same real-time Google Flights feed powering Gemini 2.5-Flash and ran 111 multi-turn conversations across both models.
Apollo-1 is engineered to operate on behalf of an entity, not an individual. Its neuro-symbolic core merges generative fluency with deterministic policy enforcement, giving organisations the control, traceability, and reliability that conventional LLM co-pilots lack. The head-to-head below shows how that design translates into real-world wins over Gemini when every query hits identical inventory, prices, and schedules.
Most leaderboards cherry-pick the best answer from a batch of generations. We do the opposite. Every prompt gets one shot—exactly what a live user would see—and that single response is what we score.
We reuse τ-Bench’s philosophy in miniature: no retries, binary checks that mirror real customer journeys. And because there’s no second chance, the outcome is a single, unforgiving line in the sand—a simple, reproducible, brutally honest snapshot of how an agent really behaves when someone is on the other side of the chat box.
For almost eight years we built toward one goal: agents that act on behalf of an entity, not a user.
Real-time Control Panel — teams inspect reasoning traces, tweak context schemas, inject policies, and replay trajectories; continuous fine-tuning at the sub-interaction level instead of retraining the whole stack.
This is how an agent can operate on behalf of an airline, a bank, or a government office – not just with a user – and still sound natural.
Scoring criteria — Scenario 28 “Same-day One-Way, Arrive Before Lunch” Scenario
Step | Tester prompt | Check | Pass criterion |
R-1 | “One-way LAX → JFK overnight on 9 Aug 2025 (arrive morning 10 Aug), 1 adult, economy.” | Return at least one qualifying red-eye itinerary | Departs 9 Aug, arrives next day before 12:00 PM, shows total price |
R-2 | “What’s the cost to upgrade the cheapest red-eye?” | Quote upgrade price (or say none) for that same itinerary | States whether an upgrade exists and the correct price delta |
R-3 | “If I skip the upgrade and instead add one checked bag, what’s the total cost?” | Add the airline’s checked-bag fee and compute new grand total | Gives per-bag fee, adds it to base fare, and returns the exact all-in total |
Scoring: binary (1 = pass, 0 = fail) per step. Scenario score per run = R-1 + R-2 + R-3 (max 3).
Model | Red-eye found? | Upgrade price? | Bag math right? | Score | Reward |
Apollo-1 | ✅ 22:00 → 06:30 $204 |
✅ +$80 Blue Extra |
✅ Fare + $40 bag = $244 |
3/3 | 1 |
Gemini | ✅ 20:47 → 05:30 $139 |
❌ pulled upgrade from wrong airline |
❌ bag fee range, no total |
1/3 | 0 |
The 111 scenarios are grouped around five distinct technical competencies that an enterprise-grade flight agent must master. Each group isolates one capability, so a perfect model would score 100 % inside every row:
Group | Technical capability under test |
Baseline Retrieval | Same-/next-day queries with minimal dialogue: slot-filling from a single prompt, immediate cheapest-non-stop extraction, earliest-arrival lookup, basic aircraft or alliance attribution, and quick FAA‐safety or risk answers. |
Core Search | Two-turn clarification flows and flexible-window optimisation: resolving missing dates or trip type, iterating on “next flight / land-by” constraints, and selecting the absolute lowest fare from a live inventory once parameters are locked. |
Ancillary Costs | End-to-end fee arithmetic: parsing carrier-specific price tables (checked bags, pets, seat upgrades, Wi-Fi, refundability), aggregating per-leg vs per-trip charges, recomputing totals after user changes, and preserving fare-rule validity. |
Cabin Precision | Fare-family graph reasoning and cabin hierarchy mapping (Economy → Premium Eco → Business/First): real-time availability checks, upgrade-delta calculation, group-size propagation through re-quotes, and class-of-service validation for non-stop constraints. |
Constrained Planning | Multi-constraint itinerary optimisation: enforcing strict arrival/departure windows, airline/airport and lay-over filters, short-connection thresholds, hub or region routing, CO₂ or environmental factors, historical delay-risk ranking, and fallback-option generation when no direct match exists. |
Group | Scenario IDs | What it measures | Apollo-1 | Gemini |
---|---|---|---|---|
Baseline Retrieval | 37–39 · 70–72 · 91–93 · 100–111 | Same-/next-day look-ups, cheapest non-stop pick, earliest-arrival check, basic aircraft/alliance flags, FAA-safety acknowledgement. e.g. “Cheapest BOS→MIA next Saturday”, “Land before midnight despite FAA alert” | → 85.7% | → 71.4% |
Core Search | 1–6 · 25–27 · 40–42 · 64–66 · 73–78 | Multi-turn date/trip-type clarification, flexible-window lowest fare, next-flight/arrival-window filters. e.g. round-trip MIA⇄NYC date probe; 1–7 Jul cheapest RT | → 95.2% | → 28.6% |
Ancillary Costs | 7–9 · 28–30 · 34–36 · 52–54 · 61–63 · 82–84 · 94–96 | Fee arithmetic: checked bags, pet-in-cabin, upgrade-vs-bag deltas, Wi-Fi/refundability tiers, OW-vs-RT price optimisation. e.g. basic-econ + checked bag; red-eye upgrade vs bag | → 57.1% | → 0% |
Cabin Precision | 13–18 · 46–51 · 85–90 · 97–99 | Fare-family graph reasoning, Premium/Biz quotes, group upgrades, non-stop cabin comparisons, multi-city upgrade pricing. e.g. Economy→Premium delta on nonstop JFK→LAX | → 90.5% | → 0% |
Constrained Planning | 10–12 · 19–24 · 31–33 · 43–45 · 55–60 · 67–69 · 79–81 | Multi-constraint routing: arrival/departure cut-offs, red-eyes + Wi-Fi, CO₂ & offset, hub/layover filters, delay-risk backups, national-park & festival trips. | → 85.2% | → 11.1% |
— | Totals (n = 111 Scenarios) | — | → 82.9% | → 21.6% |
Generative AI is fine for user chat; Controllable AI is mandatory when an agent operates on behalf of a business or an entity. Across 111 one-shot scenarios Apollo-1 delivers a full-conversation pass on 92 (82.9 %); Gemini 2.5-Flash manages 24 (21.6 %), proving the need for neuro-symbolic control. The advantage grows as queries add pricing arithmetic or routing constraints, signalling Apollo-1’s readiness for production settings where predictable, auditable behaviour is mandatory.
Hi! I need round-trip flights from MIA to NYC.
Where does the return leg depart from
Result: Fail (0/2)
Hi! I need round-trip flights from BOS to WAS
Where does the return leg depart from?
Result: Fail (0/2)
R1: Hi! I need round-trip flights from LON to PAR.
Where does the return leg depart from?
Result: Fail (0/2)
Hi! I need to find round-trip flights from LON to PAR in August.
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
Hi! I need to find round-trip flights from MIA to NYC in August
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
Hi! I need to find round-trip flights from BOS to WAS in August.
What is the duration of each leg?
What is the baggage allowance for each leg?
Result: Fail (0/3)
BOS → DUB 1 Nov – 8 Nov, basic-econ
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
MCO → DFW 10 Jul – 16 Jul, basic-econ.
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
LAS → LAX 11 Aug – 14 Aug, basic-econ.
Add one checked bag each way—what’s the new total?
Result: Fail (0/2)
Looking to fly tomorrow from BOS to NYC.
Need to get to NYC on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (1/3)
Looking to fly tomorrow from LAX to SFO.
Need to get to SFO on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (0/3)
Looking to fly tomorrow from MAD to BER.
Need to get to BER on time for lunch.
Is there a seat with more legroom on this flight?
Result: Fail (0/3
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (2/2)
Result: Pass (3/3)
Result: Pass (3/3)
Result: Pass (3/3)