Building a Conversational Flight Booking Assistant from Scratch with LangGraph, OpenAI API and…

wpnews.pro

Not all AI applications are solving the same kind of problem.

Some tasks are fundamentally transformational. A summarization system converts a document into a summary. A translation system converts text from one language to another. An extraction pipeline converts unstructured content into structured data. Given the required input, the system can usually produce an output immediately.

Others are information-seeking systems. Search, RAG, and knowledge assistants fall into this category. The user asks a question, the system retrieves or synthesizes information, and returns an answer. Even when conversation history is maintained, the interaction is largely user-driven: the user asks, the system responds.

Flight booking is different.

A booking assistant cannot complete its task with the information available in the first message. The user might say:

“Book me a flight to Mumbai.”

But a successful booking requires much more: departure city, travel dates, passenger count, trip type, flight selection, passenger details, and contact information. The assistant must actively collect missing information, validate inputs, recover from mistakes, maintain state across multiple turns, and guide the conversation toward a successful outcome.

This class of problems is commonly known as Task-Oriented Dialogue (TOD) or Goal-Oriented AI Systems. Unlike information-retrieval agents, these systems are responsible for driving a business process to completion. The challenge is no longer generating a correct response; it is managing a workflow.

In this article, we will build, a production-grade IndiGo Airlines booking assistant using LangGraph and OpenAI API. The assistant supports flight booking, web check-in, and flight status enquiries across Streamlit and Telegram while maintaining conversational state.

More importantly, the patterns discussed here extend far beyond airline reservations. The same architectural principles apply to insurance claims, customer onboarding, appointment scheduling, loan applications, technical troubleshooting, and any workflow where an AI agent must progressively gather information and drive a task to completion.

By the end of this article, you’ll understand how to design stateful, goal-oriented AI systems that move beyond answering questions and start completing real-world business processes.

Airlines handle millions of customer interactions related to flight bookings, web check-ins, and flight status enquiries. Traditionally, these interactions are completed through websites, mobile apps, or customer support agents, requiring users to navigate multiple screens and forms before completing a task.

The goal is to build an AI-powered assistant that allows customers to complete these workflows through a natural conversation. Instead of filling forms manually, users should be able to express their intent in plain language, while the assistant guides them through the process and completes the required actions.

This project is inspired by IndiGo’s 6Eskai virtual assistant and focuses on implementing a simplified version of its flight booking, web check-in, and flight status workflows.

As it is a conversational experience, there are various challenges like user might not tell all the information in one go, or user can provide city information for a question which was intended for travel date. Listing down the major challenges here:

User: Book me a flight to MumbaiBot:  Sure. What is your departure city?User: JaipurBot:  What date would you like to travel?-----------------The assistant must identify missing information and dynamically collect it through follow-up questions.
User: Book a flight from Bombay to Bangalore-----------------The assistant must resolve:Bombay → Mumbai (BOM)Bangalore → Bengaluru (BLR)
User: Book a flight from Bombay to Bangalore-----------------The assistant must resolve:Bombay → Mumbai (BOM)Bangalore → Bengaluru (BLR)before performing the flight search.
Bot: What date would you like to travel? User: to Jaipur on 12th July-------------------Here assistant should be able to understand that User has providedboth departure city and date
User: Book a flight from Jaipur to Mumbai yesterday-----------------The assistant should detect that the travel date is invalid and request a valid future date.User: Book a flight to Springfield------------------If no supported airport can be resolved, the assistant should gracefully ask for clarification rather than failing the conversation.

The assistant supports three workflows:

It runs on two channels simultaneously: a Streamlit web UI and a Telegram bot. Both channels share a single compiled LangGraph state machine.

Below is the tech-stack of the project:

LLM : OpenAI GPT-4o-miniAgent framework : LangGraphWeb UI : StreamlitBot channel :  python-telegram-bot Database : SQLite3

Complete Code is Present here: [flight-booking-assistant]

agentic-ai-usecases/advanced/flight-booking-assistant at main · alphaiterations/agentic-ai-usecases

Python 3.11+
An OpenAI API key
Basic familiarity with Python async and dictionary-based state

git clone https://github.com/vijendrajain/agentic-ai-usecasescd advanced/flight-booking-assistantpip install -r requirements.txt

Create a .env file at the project root:

OPENAI_API_KEY=sk-…TELEGRAM_BOT_TOKEN=8*************** TELEGRAM_BOT_USERNAME=FlightBookingAIBot

Note: Please refer below link to know how to create TELEGRAM_BOT_TOKEN and TELEGRAM_BOT_USERNAME on telegram:

From BotFather to 'Hello World'

To keep the system maintainable and extensible, the codebase is divided into independent layers for conversational agents, workflow nodes, business services, shared utilities, and user-facing channels. Each layer has a well-defined responsibility within the booking workflow.

Below is the repo structure:

flight-booking-assistant/├── app.py                      # Streamlit web UI entry point├── telegram_bot.py             # Telegram bot adapter├── graph/│   ├── __init__.py             # Top-level StateGraph + dispatch_route│   ├── booking_subgraph.py     # Compiled booking flow subgraph│   └── pnr_subgraph.py         # Compiled PNR / check-in / status subgraph├── state.py                    # BookingState TypedDict (5 sub-types + Passenger)├── constants.py                # Step, Intent, Process enums; CITY_TO_CODE map├── config.py                   # Settings loaded from .env├── nodes/                      # All node functions (LLM-calling and pure-Python)│   ├── router.py               # Intent classification (booking / check-in / status)│   ├── information_extractor.py  # Slot, PNR, and passenger extraction│   ├── slot_validator.py       # Per-field validation + retry counter updates│   ├── city_lookup.py          # City name → IATA code resolution│   ├── conversation_driver.py  # Slot sequencing, Phase 1 & 2 flow orchestration│   ├── flight_selection.py     # Parses user's flight choice from numbered list│   ├── booking_guardrail.py    # Guards against mid-flow process switching│   ├── confirmation.py         # Pre-search booking confirmation prompt│   ├── payment.py              # Payment step (stub, ready for Stripe/Razorpay)│   └── done.py                 # Session teardown and final response├── services/│   ├── flight_search.py        # SQLite query + dynamic pricing calculation│   ├── pnr_lookup.py           # PNR / check-in / flight status database lookup│   ├── booking_save.py         # Persists confirmed bookings to the database│   └── session_store.py        # Session persistence across page refreshes├── utils/│   ├── llm.py                  # call_llm_json wrapper + observability logging│   ├── db.py                   # SQLite connection and query helpers│   ├── formatting.py           # Flight list and message formatters│   ├── user_messages.py        # User-facing string constants│   └── prompts/                # LLM prompt templates (split by concern)│       ├── extraction.py       # Slot, PNR, passenger extraction prompts│       ├── conversation.py     # Routing, retry, persona prompts│       └── classification.py   # Intent classification prompts├── indigo_airline.db           # Pre-loaded SQLite database (17 tables)├── sessions.db                 # Session state persistence across page refreshes├── create_airline_db.py        # Script used to seed the database└── requirements.txt

Each directory has a single responsibility:

agents/ holds LLM-calling nodes,

nodes/ holds pure-Python nodes,

services/ handles database queries, and

utils/ provides shared infrastructure.

The two channel adapters (app.py [for streamlit] and telegram_bot.py [for telegram])contain no business logic. They call booking_graph.invoke() and display the result.

To support realistic airline workflows, We first create a synthetic airline database instead of relying on mocked responses. The dataset is generated from publicly available IndiGo route information and populated with synthetic customers, bookings, passengers, payments, baggage records, flight instances, and delay data.

Here we use python faker library.

The dataset is fully configurable, allowing the size and coverage of the airline network to be adjusted through a few parameters:

#create_airline_db.pyclass DBConfig:    # Reproducibility    random_seed                  = 42    # Date window — anchored to today so data is always futuristic    schedule_start               = datetime.now()    schedule_years               = 2    flight_instance_years        = 2    flight_instance_sample_weeks = 1    # Airport filter    # "all"      → every IndiGo airport (airport_list is ignored)    # "selected" → only airports in airport_list    airport_list_type            = "selected"    airport_list                 = [        "DEL", "BOM", "BLR", "MAA", "HYD", "CCU",   # major metros        "AMD", "PNQ", "COK", "GOI", "JAI", "LKO",   # tier-2        "NAG", "IXC", "PAT", "BBI", "SXR",    ]    # Volume    num_customers                = 100    num_bookings                 = 500    # How many flights (from the schedule) to generate instances for.    # Lower this to reduce FlightInstances rows and DB size.    # None = use all flights in the schedule.    max_flights_for_instances    = 20    # Paths    db_path                      = os.path.join(os.path.dirname(__file__), "indigo_airline.db")    routes_url                   = (        "https://raw.githubusercontent.com/alphaiterations/data-for-agents"        "/main/airlines-data/airline_routes.json"    )

This makes it easy to generate anything from a lightweight demo dataset to a much larger airline reservation system by simply changing a few configuration values.

Instead of manually maintaining flight routes, the generator extracts only IndiGo-operated routes from publicly available airline route data before creating the flight schedule.

#create_airline_db.pyfor carrier in route["carriers"]:    if carrier.get("iata") == "6E":        indigo_routes.append({            "origin": airport_code,            "destination": dest,            "distance_km": route.get("km"),            "duration_mins": route.get("min"),        })

Multiple daily departures are generated automatically for every supported route, creating a realistic flight schedule.

#create_airline_db.pydeparture_times = [    "06:00", "09:30", "12:00",    "15:30", "18:00", "21:00"]for route in unique_routes:    for departure_time in departure_times:        create_flight(...)

Flight schedules describe recurring flights, while flight instances represent a specific flight on a particular date. A small percentage of instances are randomly assigned delays, enabling realistic flight status demonstrations.

for days_ahead in range(0, instance_days, step_days):    current_date = start + timedelta(days=days_ahead)    create_flight_instance(...)    if random.random() < 0.05:        create_delay(...)

In the terminal navigate to flight-booking-assistant folder and run below command to create the synthetic db.

cd flight-booking-assistantpython create_airline_db.py

Output

Config: seed=42, start=2026-06-20, schedule_years=2, customers=100, bookings=500Removed existing database: /Users/current_user/agentic-ai-usecases/advanced/flight-booking-assistant/indigo_airline.dbFetching airline_routes.json from:  https://raw.githubusercontent.com/alphaiterations/data-for-agents/main/airlines-data/airline_routes.jsonRoutes data fetched successfully.IndiGo routes extracted (17 selected airports): 210Database schema created successfully.Flight schedule created: 1260 flights.Days of operation inserted for all flights.Generating 100 synthetic customers...100 customers inserted.Generating 500 bookings...500 bookings inserted.Generating flight instances and delays (2 years, sampled every 7d)...  Generating instances for 20 flights...Flight instances and 110 delays inserted.============================================================DATABASE SUMMARY - INDIGO AIRLINE BOOKING SYSTEM============================================================  Customers                             100  FlightSchedule                      1,260  DaysOfOperation                     8,820  PNRs                                  500  Bookings                              500  Passengers                          1,220  Itineraries                           500  ItineraryLegs                         500  PassengerBaggage                    1,220  FlightInstances                     2,100  FlightDelays                          110  Payments                              500------------------------------------------------------------  TOTAL                              17,330Database file : /Users/current_user/agentic-ai-usecases/advanced/flight-booking-assistant/indigo_airline.dbDatabase size : 1.59 MB============================================================

This creates indigo_airline.db in the root folder.

The three tables the agent queries most are FlightSchedule, FlightInstances, and PNRs.

FlightSchedule is the source of truth for what routes exist and when flights depart.

FlightInstances links a schedule entry to a specific date and carries the live status field (On Time, Delayed, Cancelled).

PNRs is what the web check-in and flight status flows query: the user gives a PNR code and their last name, and the agent joins PNRs to Bookings to ItineraryLegs to return their itinerary.

Here is the summary of all the tables:

The flight search query is straightforward:

One deliberate simplification: pricing is not stored in the database. It is computed dynamically from the flight index and duration at query time. This keeps the schema clean and avoids needing a fare table that would require constant updates for a demo.

Pricing is dynamic and computed in Python (The pricing formula lives in services/flight_search.py ), not stored as a fixed fare:

This creates natural price variation across flights without needing a fare table, which keeps the demo realistic without requiring a live pricing API.

In LangGraph, state acts as the shared memory of the workflow. Every node reads from the current state, performs its logic, and returns updates that are merged back into the state.

For a flight booking assistant, state needs to capture much more than conversation history. The system must track user inputs, booking details, workflow progress, validation errors, selected flights, and session metadata across multiple turns.

To keep the state manageable, I grouped related fields into logical categories:

This approach keeps ownership clear while allowing every node in the workflow to access a unified view of the conversation.

The step field is the single source of truth for where you are in the conversation. It takes values like GREETING, COLLECT_SLOTS, CONFIRM_BOOKING, SHOW_FLIGHTS, PAYMENT, DONE, and sub-steps like collect_names and collect_email. Every routing decision in the graph reads step first.

The slot_attempts field is a dict that tracks how many times the user has failed on each specific field. This is more surgical than a global counter: a user might nail the city name in one try but keep giving dates in the past, and you want to terminate only the date field after three failures, not the whole session.

Round-trip bookings require storing both legs. The booking_leg field tracks whether you are currently booking the outbound or return flight, and selected_outbound_flight holds the confirmed first leg while the user picks the second:

Gotcha:Never store derived values in state if you can compute them from other fields. Early in the project I stored total_passengers alongside adults and children. They drifted. Now only adults and children live in state; totals are computed on the fly.

Every LLM call in this project is driven by a prompt from utils/prompts/, which is split into three modules by concern:

USER MESSAGE                              │              ┌───────────────┼───────────────┐              │               │               │              ▼               ▼               ▼       EXTRACTION        CLASSIFICATION   CONVERSATION       PROMPTS           PROMPTS          PROMPTS              │               │               │    "What did the user   "Which option    "What should    actually say?"       did the user     the bot say                         pick?"           next?"              │               │               │       Strict JSON       Closed vocab     Free text       Fixed schema      One of N         Persona-bounded       null = not said   No elaboration   Empathetic              │               │               │    EXTRACTION_CONTEXT   CONFIRM_INTENT_  SYSTEM_PERSONA    EXTRACTION_PROMPT    PROMPT           ROUTING_PROMPT    PASSENGER_           FLIGHT_          OUT_OF_SCOPE_    EXTRACTION_PROMPT    SELECTION_PROMPT PROMPT    PNR_EXTRACTION_      MID_FLOW_INTENT_ RETRY_MESSAGE_    PROMPT               PROMPT           PROMPT                         CITY_LOOKUP_                         PROMPT

Every LLM call fits into one of these three boxes. If you ever find yourself writing a prompt that wants to both extract a slot and decide what to say next — split it into two.

Extraction prompts enforce strict “do not infer” rules and always return a fixed schema.

Classification prompts enforce a closed vocabulary (affirm / deny / modify, or a 0-based index).

Conversation prompts are the only place the model is allowed to produce free-form text. Keeping these three categories separate makes it easier to audit what each prompt is allowed to do.

Before writing a single prompt, let’s understand what we are dealing with.

An LLM does not “think” the way you do. When you ask it to extract flight information from a user message, it tries to produce the most plausible-looking response based on its training. That sounds helpful and it is but it creates three specific problems in a booking system:

Problem 1: Output format is unpredictable. Ask the LLM to return JSON and it might wrap the response in a markdown code fence (json ...). Ask for a number and it might say "The answer is 2 adults." For a booking system that parses LLM output programmatically, this breaks things silently.

Problem 2: The model infers what users “probably meant.” A user says “Jaipur to Mumbai on the 15th, 2 adults.” The model might helpfully set trip_type = "one-way" because most single-date bookings are one-way. But you never want the model to make that call — you want the user to confirm it explicitly. If the model guesses right, the confirmation step gets silently skipped.

Problem 3: Null and zero mean different things, but the model may treat them the same. A user says “2 adults, no kids.” The model might return children: null because it treats "no kids" as absence of information. But null means "we haven't asked yet" and 0 means "user confirmed no children." If children stays null, the bot will ask the passenger count question all over again.

These three problems drive every single prompt in this project.

The fundamental insight is this: not all LLM outputs are equal. Some need to be strict and parseable. Some need to choose from a fixed set. Only some need to be expressive. If you mix these requirements in one prompt, you get a confused model.

So we organise all eleven prompts into three separate families based on what kind of output they produce:

Every user message the bot receives leads to one of three questions: ┌──────────────────────────────────────────────────────────────────┐ │  "What did the user actually say?"  →  Extraction Prompts        │ │  Strict JSON, fixed schema, null for anything not stated.        │ │  The model is a parser, not a reasoner.                          │ ├──────────────────────────────────────────────────────────────────┤ │  "Which option did the user pick?"  →  Classification Prompts    │ │  Closed vocabulary. Pick exactly one bucket. No elaboration.     │ ├──────────────────────────────────────────────────────────────────┤ │  "What should the bot say next?"   →  Conversation Prompts       │ │  Free text allowed, but always bounded by a shared persona.      │ └──────────────────────────────────────────────────────────────────┘

This separation is the most important decision in the entire prompt layer. It determines how you write each prompt, what you constrain, and how you call the model. Let us go through each family.

Before any prompt runs, two settings eliminate the first problem (unpredictable format) at the API call level:

response_format={"type": "json_object"} tells the model at the API level: your entire output must be valid JSON. No markdown fences, no "Sure! Here's the JSON:" preamble. The API enforces this — the model physically cannot violate it.

temperature=0 removes creativity. You do not want the model to "interpret" what "July 15th" means. You want it to return "2025-07-15" the same way every single time.

The mental model: Think of these prompts like a strict forms clerk. They extract exactly what is on the form, leave everything else blank, and never fill in fields based on assumption.

All three extraction prompts share one foundational rule as a preamble:

EXTRACTION_CONTEXT — The "No Inference" Contract

This is the most important declaration in the project. Without it, GPT-4o-mini will helpfully fill in what it “probably” knows. When the model guesses a field, the slot validator sees a value and considers it “collected.” The retry logic never triggers. The user never confirms. The booking moves forward with wrong data — silently.

Every extraction prompt starts with EXTRACTION_CONTEXT.

EXTRACTION_PROMPT — Phase 1: Collecting Flight Slots

This prompt handles the first phase of booking: origin, destination, dates, trip type, and passenger count. Here it is in full:

Three rules in this prompt deserve extra attention:

**The **children null vs. 0 rule. null means "we haven't asked yet." 0 means "user confirmed no children." If the user says "2 adults, no kids" and the model returns children: null, the slot validator thinks the passenger question is still open and asks again — even though the user already answered it. The rule forces the model to distinguish between "not mentioned" and "explicitly zero."

**The **trip_type non-inference rule. Without it, the model sees two dates and decides it must be a round-trip — because that is the most common reason people give two dates. That guess silently skips the explicit trip type confirmation step. The rule is blunt: if the user did not say the words, return null.

The date inference rules. Users almost never say the year. “July 15th” could be this July or next July. The rules resolve this deterministically: use the current year if the date is in the future, bump to next year if it has already passed, and always use next year for January or February mentions in November or December.

PASSENGER_EXTRACTION_PROMPT — Phase 2: Step-Aware Extraction

After the user selects a flight, the conversation enters Phase 2: a four-step sequence collecting flight confirmation, WhatsApp consent, passenger names, and email. Each step needs different extraction logic.

The naive approach: write four separate prompts. The problem: you copy the EXTRACTION_CONTEXT preamble four times and they drift out of sync.

The better approach: one prompt that handles all four steps by embedding step-specific rules, with {current_step} injected at call time:

The current step gets injected at call time:

prompt = PASSENGER_EXTRACTION_PROMPT.format(    current_step=state["step"],    assistant_message=state["assistant_message"],    user_input=user_input,)

One prompt, one call_llm_json() call, four different behaviours — selected entirely by the step value in state. No branching in Python code, no four separate prompt strings to maintain.

PNR_EXTRACTION_PROMPT — PNR Code Extraction

The web check-in and flight status flows need just one thing from the user: their PNR code. This prompt is deliberately minimal:

Short prompt, one field, one rule. The model does not need conversation history, step logic, or persona here — just the raw user message and a pattern description.

The mental model: Think of these prompts like a multiple-choice answer sheet. The model gets a list of options and must pick exactly one. No elaboration, no hedging, no “it depends.”

User message arrives mid-conversation           │           ▼  ┌─────────────────────────────────────────────┐  │  Classification Prompt                      │  │  "Here are your only valid options:         │  │   affirm / deny / modify                    │  │                                             │  │  Pick exactly one. Return JSON."            │  └─────────────────────────────────────────────┘           │           ▼  {"intent": "affirm"}   ← no prose, no explanation

Classification prompts differ from extraction prompts in one key way: the model is not parsing what the user said — it is deciding which pre-defined bucket the user’s message falls into.

CONFIRM_INTENT_PROMPT — Affirm, Deny, or Modify?

Used after the bot shows the user a travel summary and asks them to confirm. Three outcomes are possible:

The examples inside each bucket do a lot of the work. Without them, “sounds good” might confuse the model. “Don’t stop” — which means proceed — could easily be mistaken for “deny.” Explicit examples anchor the model to the intended interpretation.

FLIGHT_SELECTION_PROMPT — Which Flight Did the User Choose?

After the bot shows a numbered list of flights, the user might say “flight 1”, “the cheapest one”, “the 9 AM flight”, or even the flight number directly (“6E0863”). This prompt maps all of those to a 0-based index:

The N-1 rule is explicit in the prompt. Without it, the model may return selected_index: 1 when the user says "flight 1" — off by one from the 0-based Python list. A subtle bug, caught by one sentence in a prompt.

MID_FLOW_INTENT_PROMPT — Is the User Continuing or Changing Their Mind?

When a user has already seen available flights, they might suddenly say “actually, I want to go to Goa instead.” This prompt distinguishes that from a normal flight selection:

Two buckets, explicit examples. Without the examples, “I want to go to Delhi instead” is ambiguous — it contains a city name, exactly like a normal slot answer. The examples train the model on the intent behind the words, not just the words themselves.

CITY_LOOKUP_PROMPT — Resolving Messy City Names

Users say “Bombay” instead of “Mumbai”, “Bengaluru” or “Bangalore”, “New Delhi” or “Delhi.” A pure string match fails all of these. This prompt resolves the user’s input against a pre-computed list of fuzzy candidates:

The candidates list is generated in Python using substring and fuzzy matching before the prompt runs. The model’s job is only to pick from a short, already-relevant list — not to search 50 cities from scratch. This two-step approach (Python narrows candidates → LLM picks the best match) is faster and more reliable than asking the LLM to search the full city list directly.

The mental model: This is the only family where the model produces natural language. But “free text” does not mean “no constraints.” Every conversation prompt is bounded by a shared persona constant.

Extraction prompts    →  strict JSON outputClassification prompts →  pick from closed options                                        │                                        ▼Conversation prompts  →  free text, always shaped by SYSTEM_PERSONA

SYSTEM_PERSONA — The Shared Personality

This constant is imported by every conversation prompt. It defines the bot’s personality once, so every message the bot produces sounds like the same voice:

If you want to change the bot’s tone — say, make it more formal, or add “always address users by their first name” — you edit this one constant and every conversation prompt inherits the change automatically. No hunting through eight prompt strings.

ROUTING_PROMPT — Classifying Intent for New Sessions

When a user sends their very first message, the bot does not know what they want. This prompt classifies the message into one of five intents:

The IMPORTANT note is the key clause. Without conversation history in the prompt, "23rd May" looks like an out-of-scope message — it is not a clear booking request on its own. With history included, the model can see that the bot previously asked "What is your travel date?" — so "23rd May" is unambiguously a booking reply.

OUT_OF_SCOPE_PROMPT — Politely Redirecting Off-Topic Messages

Short and deliberate. The {system} injection ensures the persona is consistent even in refusals. The "2–3 lines" constraint prevents a lengthy apology where a brief redirect is all that is needed.

RETRY_MESSAGE_PROMPT — The Only Free-Text Non-Schema Prompt

When slot extraction fails — the user gives a date in the past, a city with no airport, or a passenger count that does not add up — the bot needs to re-ask. It cannot do this robotically (“Invalid input. Please try again.”). It needs empathy.

This is the only prompt in the project that asks for pure natural language output with no JSON schema:

The {slot_label} and {error} values come from the slot validator, not from the model. This separation is important: the validator decides what went wrong and why; the model decides how to say it kindly. Mixing these responsibilities into one prompt would make either the error detection or the empathy unreliable.

Here is what happens in a naive multi-turn agent without careful graph design:

Every time the user sends a message — whether it is their first message or their twentieth — the system starts at the beginning and runs an LLM call to classify intent. That LLM call costs tokens, takes time, and can go wrong. More importantly, mid-session the intent is already known. The user is mid-booking. They just said “From Jaipur.” Running an intent classifier at this point is not just wasteful — it is a point of failure.

There is a second problem: with ten nodes in a booking flow, how do you resume exactly at the right node mid-conversation? If the user is at “collect passenger names” and sends a new message, you need to land at collect_names directly — not replay the entire booking flow from scratch.

These two problems drive the entire graph architecture.

The central idea is: **the **step field in state is the single source of truth for where you are. Before any LLM runs, the graph reads step and routes accordingly. This makes routing:

The graph is organised into two levels:

Level 1: Top-Level Graph (3 nodes)  Reads state["step"] and routes to the right subgraph.  Fires the LLM-based router ONLY for brand-new sessions.Level 2: Subgraphs (booking + pnr)  Each is a complete, self-contained state machine.  Has its own internal dispatch table.  All routing functions inside are pure Python.

Let us trace how a user message flows through this structure.

Every user message enters the graph at dispatch_route(). This function reads the current step and decides where to go — before any LLM runs:

Three outcomes:

The LLM only fires on the very first turn of a new session. Every turn after that bypasses it entirely.

After the router runs, route_after_router() reads the classified intent and sends the session to the right subgraph:

Here is the complete top-level graph:

Three nodes. Four edges. The top-level graph contains zero business logic — it is a traffic cop:

Every user message        │        ▼  dispatch_route()    ← pure Python, reads state["step"], no LLM        │   ┌────┼─────────────────────────────────────────┐   │    │                                         │   ▼    ▼                                         ▼ step   step in _BOOKING_STEPS            step in _PNR_STEPS is     (mid-booking)                     (mid-PNR lookup) GREET  │                                 │   │    └──────────────────┐              └──────────────┐   ▼                       ▼                             ▼"router"               "booking"                      "pnr"(LLM runs,             (skip router entirely)         (skip router entirely) classifies intent)   │   ▼route_after_router()   ← pure Python, reads state["intent"]   │   ├─ book_flight    → "booking"   ├─ web_checkin    → "pnr"   ├─ flight_status  → "pnr"   └─ greeting/out   → END

The booking subgraph handles the complete flight booking flow across ten nodes: slot collection, validation, city resolution, flight search, flight selection, passenger details, and payment.

Every turn into this subgraph passes through one entry point: booking_guardrail.

Why a guardrail at the entry point?

Without it, mid-flow changes break things. A user who has just seen available flights might say “actually, change destination to Goa.” Without the guardrail, that message arrives at select_flight, which tries to parse it as a flight selection and fails. The guardrail intercepts, detects the modification intent using a quick LLM call, resets the relevant state fields, and sends the user back to slot collection — cleanly, without a full restart.

After the guardrail, a lookup table dispatches to the correct node based on step:

This is better than a long if/elif chain for two reasons. First, it is a data structure — adding a new step means adding one line to the dict, not editing a chain of conditions. Second, it has a clean default: any slot-collection step that is not listed falls through to "info_extractor", which is the correct starting node for new slot turns.

The routing functions between nodes are all pure Python — no LLM, no I/O, no side effects:

Each function answers exactly one question: given what just happened (what is now in state), where do we go? No branching logic, no LLM, no database queries. If the bot ends up in the wrong node, you read the routing function and trace the state backward. Debugging is just reading Python.

Here is the complete booking subgraph:

The booking flow visualised end-to-end:

Incoming turn      │      ▼booking_guardrail  ← checks step, intercepts mid-flow changes      │  _dispatch()      ← reads _STEP_TO_NODE lookup table      │      ├─ COLLECT_SLOTS / new → info_extractor      ├─ SHOW_FLIGHTS         → select  → END      ├─ CONFIRM_BOOKING      → confirm      ├─ PAYMENT              → payment → END      ├─ DONE                 → done   → END      └─ FLIGHT_CONFIRM /         WHATSAPP_CONSENT /         COLLECT_NAMES /         COLLECT_EMAIL        → info_extractor                                    │                          _after_info_extractor()                                    │                          ┌─────────┴─────────────────┐                          │                           │                     step == EXTRACTED           everything else                          │                           │                          ▼                           ▼                   validate_slots             conversation_driver                          │                           │               _after_validate_slots()      _after_conversation_driver()                          │                           │                  ┌───────┴──────────┐      ┌─────────┼──────────┐                  │                  │      │         │          │               city_error?    cities      term-    PAYMENT?  SEARCH?                  │           changed?    inated?    │          │                  ▼              │           │       ▼          ▼         conversation_driver  city_lookup  END   payment    search                  ↑              │                  │          │                  └──────────────┘                 END        END

The PNR subgraph handles web check-in and flight status queries. It is simpler — three nodes, two routing decisions — but follows the exact same philosophy: state-driven dispatch, pure Python routing:

Incoming turn      │      ▼  _dispatch()      │   ┌──┴─────────────────────────────────┐   │                                    │   ▼                                    ▼conversation_driver               info_extractor(asks for PNR → END)                   │                              _after_info_extractor()                                        │                            ┌───────────┴───────────────┐                            │                           │                            ▼                           ▼                   conversation_driver → END      pnr_lookup → END                   (re-ask if PNR not found)

Put it all together and every user turn follows this path:

User sends a message  (Streamlit or Telegram — same code path)        │        ▼  booking_graph.invoke(state)   ← single compiled entry point        │        ▼  dispatch_route(state)         ← pure Python, reads state["step"]        │   ┌────┼──────────────────────────────────────────────┐   │    │                                              │   ▼    ▼                                              ▼ GREET  step in _BOOKING_STEPS               step in _PNR_STEPS   │    │                                              │   ▼    ▼                                              ▼router  booking_subgraph                         pnr_subgraph(LLM)   └─ booking_guardrail                    └─ _dispatch        └─ _dispatch → [10 nodes]               └─ [3 nodes]   │   ▼route_after_router()   ├─ book_flight    → booking_subgraph   ├─ web_checkin    → pnr_subgraph   └─ flight_status  → pnr_subgraph

The LLM for intent classification fires exactly once per session — on the very first message. Everything after that is a dictionary lookup or a boolean check on state fields. Add a new booking step? Add one line to _BOOKING_STEPS and one entry to _STEP_TO_NODE. The rest of the routing stays untouched.

The easiest mistake in agentic AI is making one agent do too many things. A node that extracts slots, validates them, and decides what question to ask next is hard to test, hard to debug, and impossible to trust. When it fails, you don’t know which of the three jobs failed.

The design principle here is simple: every node has exactly one job. Some nodes call the LLM. Most do not. The LLM is only involved when the task genuinely requires language understanding. Everything else — validation, routing, city code lookup, price calculation — runs in plain Python.

Nodes in this project:LLM nodes (language understanding required) ├── router          → classifies intent ├── info_extractor  → extracts slots / names / PNR from user message ├── city_lookup     → resolves messy city names to canonical names ├── conversation_driver → generates retry messages and slot questions └── flight_selection → maps natural language to a flight index Pure Python nodes (no LLM) ├── booking_guardrail  → reads step, dispatches or intercepts ├── slot_validator     → checks dates, passenger counts ├── confirm            → routes affirm/deny/modify ├── flight_search      → runs SQL, formats results ├── payment            → assembles booking summary └── done               → saves booking, generates PNR

Here is a subtle failure mode. The user is mid-booking. Destination is already “Mumbai.” They say “From Jaipur.” The LLM extraction prompt sees one city and might return it as destination_city (because it appeared without a "from" keyword in isolation). Now you have destination_city = "Jaipur" — overwriting the user's earlier answer.

The fix is pure Python, no extra LLM call:

The logic: if exactly one city was extracted, and one city is already confirmed in state, the new city must be the other field. No additional LLM reasoning needed — just state and a conditional.

Validation is a rule-based job, not a language job. The slot validator runs checks in Python and writes errors back to state. No LLM call:

When slot_error is set, the routing function after this node sends execution to conversation_driver, which calls RETRY_MESSAGE_PROMPT to generate a natural re-ask. The validator produces the error message; the LLM produces the empathy.

City names from users are messy: “Bombay”, “New Delhi”, “Bengaluru”, “Banglore” (typo). A pure string match fails all of these. But asking the LLM to search 50 cities from scratch is overkill and error-prone.

The solution is a two-step approach:

User types: "Bombay"       │       ▼  get_candidate_cities("Bombay")    ← Python substring + fuzzy match  → ["Mumbai"]                      ← short candidate list       │       ▼  CITY_LOOKUP_PROMPT with candidates  → {"resolved_city": "Mumbai"}     ← LLM picks from short list       │       ▼  city_to_code("Mumbai") → "BOM"   ← dict lookup, no LLM
php

If resolution fails, city_error is set. The conversation driver reads it on the next turn, calls RETRY_MESSAGE_PROMPT, and asks for the city again — incrementing slot_attempts["departure_city"]. After three failures, terminated = True.

At any point in the booking flow, the bot needs to decide: what do I ask next? This sounds like a job for an LLM. It is not.

If you let the LLM decide what question to ask, you get inconsistent ordering. Sometimes it asks for the date before the destination. Sometimes it skips passengers. Sometimes it asks two things at once. Users get confused. Your slot completion rate drops.

The design principle is: question ordering is data, not model output. A fixed list of required fields, checked in a fixed order, produces a predictable question sequence every time. The LLM is only called to generate the wording of the question — not to decide which question to ask.

The _get_missing_flight_slots function returns the list of unfilled slots in a deterministic order:

The conversation driver takes the first item from this list and asks for it. If the user provides three slots in one message, the extractor fills them all, and the next call to _get_missing_flight_slots returns only what is still missing. The driver never asks for something it already has.

After the user selects a flight, the flow enters Phase 2. The confirmation_step field drives a fixed four-stage sequence — no LLM needed to decide the order:

flight_confirm  →  whatsapp_consent  →  collect_names  →  collect_email

Each stage calls PASSENGER_EXTRACTION_PROMPT with the current step, extracts the relevant piece of information, stores it in state, and advances confirmation_step to the next stage.

A real usability problem: a user booking for 3 passengers might send names in separate messages (“Mr John Smith” then “Mrs Jane Smith” then “Miss Amy Smith”). The naive approach — reject anything that doesn’t give all names at once — frustrates users.

The extractor accumulates names across turns:

The conversation driver reads the partial list and tells the user what it already has before asking for the rest:

The name_attempts counter only increments when the total count is still wrong after extraction. Sending names incrementally is not a failure — it is normal input. Three failures means three genuinely bad attempts, not three partial messages.

Every field that can fail has its own retry counter in slot_attempts:

This is more surgical than a single session-level counter. A user might get the departure city right on the first try but keep giving dates in the past. With per-field tracking, only the date field terminates after three failures — the session stays alive. With a global counter, the third bad date would kill a session that otherwise had good data.

The PNR lookup serves two different flows (web check-in and flight status) from the same database query. The process field in state determines how the result is formatted:

One fetch, two formatters, zero LLM calls. The database join handles the complexity; Python handles the formatting. The LLM is not involved in lookup or display — only in the PNR extraction step upstream.

The payment node assembles the booking summary from state — no database writes yet, no LLM calls:

The done node then writes the booking to the database and generates the PNR:

Separating payment summary (payment node) from booking persistence (done node) means you can show the user a full price breakdown before writing anything to the database — which is how real booking systems work.

You build an agent. It works in your tests. Then a real user says “the bot got stuck after I said yes.” You have no idea what happened.

The failure mode is familiar: agents fail quietly. An LLM returns a slightly different JSON shape, the parser silently returns None, the routing function sends execution to the wrong node, and the user sees a confusing message. Without a trace, you are guessing.

The design principle: instrument at the wrapper layer, not at the node layer. Every LLM call already passes through call_llm_json. Every non-LLM node already calls log_node. This means zero manual instrumentation is needed in any node — you get a complete per-turn trace automatically.

The hardest part of distributed tracing is knowing who called what. Here, the LLM wrapper walks the Python call stack to find the calling node’s filename — no node_name parameter needed anywhere:

When information_extractor.py calls call_llm_json(), the stack walk returns "information_extractor._extract_flight_slots". The log entry is automatically attributed to the correct node and function — even if the same prompt is called from multiple places.

Every call_llm_json call appends a structured entry to _run_logs:

Pure Python nodes use log_node() for the same structured format:

At the end of each turn, llm_module.get_logs() returns the full ordered list of everything that ran — LLM calls and Python nodes alike.

The Streamlit UI renders every assistant message with a second “Logs” tab. When a user’s turn is stored in st.session_state.chat, it carries the log list alongside the reply text:

When the chat history is rendered, messages with logs get two tabs instead of one:

The logs tab shows turn latency, LLM call count, node call count, and expandable entries for every individual call — prompt in, output out, token counts, latency. When a user says “the bot got confused,” you open the Logs tab for that message and the answer is right there.

Most chatbot tutorials end with a Streamlit app. Then you want to add Telegram. Now you have two codebases to maintain — or you copy-paste the business logic and everything drifts.

The design principle: channels are thin adapters. The graph knows nothing about channels.

All business logic — routing, extraction, validation, search, payment — lives in the LangGraph state machine. The channels only do three things: get the user’s message, call booking_graph.invoke(), and display the result.

Streamlit (app.py)                  Telegram (telegram_bot.py)      │                                       │      └──────────────┬────────────────────────┘                     ▼         booking_graph.invoke(state)                     │                     ▼              result["assistant_message"]                     │         ┌───────────┴───────────┐         ▼                       ▼    st.write(reply)         await message.reply_text(reply)

Adding a WhatsApp channel means writing one new adapter file. The booking logic, prompts, validation, and database queries stay exactly as they are.

The two channels store state differently, but the state shape is identical.

Streamlit stores state in st.session_state keyed by UUID, with overflow persistence to sessions.db. On page refresh, the session is restored from the database:

The session ID is stored in the URL as ?sid=.... After a page refresh, the URL carries the session ID, the app restores state from sessions.db, and the user continues exactly where they left off.

Telegram stores state in an in-memory dict keyed by chat_id. No persistence across bot restarts — appropriate for a demo.

Both adapters enforce a 30-minute inactivity timeout. If last_active_at is more than 30 minutes old, the state resets and the user starts over:

When a session expires, the adapter resets to INITIAL_STATE but preserves user_id and channel — so the user gets a fresh booking context without losing their identity. The expired session is deleted from sessions.db.

Run the app with streamlit run app.py and open http://localhost:8501.

Query 1 (typical booking): “I want to travel from Jaipur to Mumbai”

The assistant extracts the slots from the first message, then asks for travel dates.

Query 2 (edge case: round-trip): “Book a round trip from Jaipur to Hyderabad, 15th July to 20th July, 2 adults 1 child”

The two-phase flow handles outbound selection, stores it, then immediately searches and presents return flights.

Query 3 (failure case: unknown city): “Book a flight to Springfield”

The city lookup finds no candidates, sets a city error, re-asks for the city, and tracks the attempt. After three failures it terminates with a customer care number.

Query 4 (Check flight Status): “I want to know the status of my flight”

The assistant asks for PNR number and once the PNR is provided, assistant shares the status of the flight.

Query 5 (Web Checkin): “I want to do Web-checkin”

The assistant asks for PNR number and once the PNR is provided, assistant shares the steps for web-checkin.

Observability

On streamlit, each answer comes with a logs tab where users can see the path of the entire query as shown in below screenshot.

Run the app with python run telegram_bot.py and open telegram and search for your bot. Below GIF shows the complete end to end session of flight booking (which is mostly similar to Indigo 6ESKAI on WhatsApp).

In this article, we built a production-style flight booking assistant using LangGraph and OpenAI APIs that supports flight booking, web check-in, and flight status enquiries through natural conversations.

More importantly, we explored a class of AI applications where the primary challenge is not answering a question, but managing a business workflow. Unlike transformation or information-seeking systems, task-oriented dialogue systems must progressively gather information, maintain conversational state, validate user inputs, recover from errors, and guide users toward a successful outcome.

While the implementation focused on an airline booking assistant, the architectural patterns remain the same across many real-world applications. Customer onboarding, insurance claims, appointment scheduling, loan applications, technical troubleshooting, and countless enterprise workflows all require the same combination of workflow orchestration, structured state management, and LLM-powered language understanding.

I hope this article provides a practical blueprint for building production-ready Agentic AI systems that go beyond simple chatbots and are capable of completing real business processes.

The complete source code for this project is available on GitHub. Feel free to explore it, experiment with new workflows, and extend it to solve your own domain-specific problems.

Thank you for reading the article.

AgenticAI is complex and chaotic but getting started doesn’t have to be. I focus on making that first step simpler for you. Follow along for regular updates and more such articles.

Feel free to connect on Linkedin if you’re on a similar path.

And if you’re still curious, there’s more to explore.

Building a Conversational Flight Booking Assistant from Scratch with LangGraph, OpenAI API and… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article OpenAI's GPT-5.6 Sol Hit 91.9% on Terminal-Bench — Then Cheated More Than Any Model METR Has Tested No, Your Chatbot Doesn’t Have Amnesia — It’s Drifting I Cracked Open Karpathy's $100 ChatGPT — the 2019 Original Cost $43,000 and 168 Hours

Building a Conversational Flight Booking Assistant from Scratch with LangGraph, OpenAI API and…

Run your AI side-project on zahid.host