Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
The HVAC Owner's 2 AM Problem
Every HVAC owner I have talked to has the same after-hours horror story. The phone rings at 2:14 AM. The voicemail says something about heat, or a smell, or a leak. By the time the owner listens, decides whether it is real, calls a tech, and gets a truck rolling, the customer has already called the next company in the search results. Worse: half the after-hours calls are not emergencies at all — somebody who got home late and wants to book a maintenance visit for next Tuesday — but the owner cannot tell which is which without listening to every one.
The bad outcomes are expensive. A missed no-heat call in January costs $10,000–$50,000 when pipes burst overnight and the homeowner files an insurance claim against you. A missed gas-smell call is a liability event you do not want to talk to your attorney about. A missed commercial walk-in cooler call costs you the account. The good outcomes — same-night dispatch on a real emergency — are how HVAC companies earn 5-star reviews and triple their after-hours revenue.
This post is the complete engineering guide to the CallSphere After-Hours Escalation system, purpose-built for HVAC. Three-tier CQRS architecture on Kubernetes, eleven micro-agents that triage HVAC emergencies in under a second, and a fault-tolerant dispatch loop that pages the on-call tech via voice + SMS + DTMF acknowledgment until somebody accepts the job.
What Counts As An HVAC Emergency
The triage layer is the heart of the product, and it is HVAC-specific. The model is fine-tuned on real HVAC after-hours messages and scores each one on a 0.0–1.0 urgency axis. Roughly:
- 0.9–1.0 — Dispatch immediately: gas smell, carbon-monoxide alarm sounding, boiler leak with active water, no heat at outdoor temp ≤ 20°F with infants/elderly in the home, commercial walk-in cooler down with product inside.
- 0.6–0.9 — Dispatch tonight: no heat at moderate outdoor temp, no AC during a heat advisory, furnace tripping breaker repeatedly, refrigerant smell.
- 0.3–0.6 — Confirm with on-call but probably tomorrow: AC making unusual noise, thermostat malfunction, intermittent issues, scheduling questions framed as urgent.
- 0.0–0.3 — Auto-acknowledge, no page: appointment requests, billing questions, "just leaving a message," vendor sales calls, spam.
The threshold to wake a tech is configurable per company — most start at 0.6 and tighten to 0.7 once they trust the system. Anything below threshold is logged for the morning and the customer gets an SMS auto-reply confirming receipt and offering a same-day callback window.
Architecture Overview
The system is a three-tier CQRS split: a thin edge that ingests calls and emails, a stateful Go gateway that owns routing and WebSocket connections, and a fleet of stateless Python agent workers that own AI inference and dispatch. Each tier scales independently because each has a different bottleneck — bandwidth at the edge, connection count at the gateway, and inference time at the workers.
flowchart TB
subgraph Edge["Edge / Customer Touchpoints"]
Email[Service Email Inbox
IMAP Polling]
Twilio[Twilio Inbound Call
+ Voicemail Transcription]
Dialpad[Dialpad / RingCentral
Webhooks]
WS[Owner Dashboard
WebSocket]
end
subgraph Gateway["Tier 1: Go API Gateway"]
Gin[Gin/Fiber Server
5-20 HPA Pods
25K req/sec]
WSHub[WebSocket Hub
200K Concurrent
2KB / Goroutine]
end
subgraph Bus["Durable Event Bus"]
NATS[NATS JetStream
8K msg/sec]
end
subgraph Workers["Tier 2: Python Agent Workers"]
AW[10-100 K8s Pods
HPA on Queue Depth]
LG[LangGraph Runtime
Durable Checkpoints]
Agents[11 HVAC Agents
OpenAI Agents SDK]
end
subgraph Data["Tier 3: Datastores"]
PG[(PostgreSQL Primary
+ 2 Read Replicas
CP / 10K writes/sec)]
Redis[(Redis Cluster
6 nodes, 24Gi
AP / 1M ops/sec)]
ES[(Elasticsearch
3 nodes / Debezium CDC
200K searches/sec)]
end
Email --> Gin
Twilio --> Gin
Dialpad --> Gin
WS --> WSHub
Gin --> NATS
WSHub --> NATS
NATS --> AW
AW --> LG
LG --> Agents
Agents --> PG
Agents --> Redis
Agents --> ES
PG --> ES
Tier 1: The Go API Gateway
The gateway is written in Go (Gin) for one reason: WebSocket fan-out. We hold ~200K concurrent WebSocket connections from owner dashboards and customer status pages, and each goroutine consumes ~2KB of memory. The Python asyncio equivalent we benchmarked first used ~100KB per coroutine — fifty times the footprint and a non-starter at our scale.
The gateway's job is intentionally narrow:
- Authenticate — JWT for the owner dashboard, signed Twilio webhooks for inbound calls/SMS, signed Dialpad/RingCentral payloads.
- Validate — reject malformed payloads before they touch the bus.
- Rate-limit — Redis token bucket per HVAC company tenant.
- Publish — push the event onto NATS and return 200 to the caller in ~5 ms.
Crucially, the gateway never calls the LLM. The webhook returns 200 immediately, the work is enqueued onto NATS JetStream, and a worker picks it up. This decouples ingestion latency from inference latency — vital because Twilio retries any webhook that takes longer than 15 seconds, and LLM tail latency under load can blow past that. Five to twenty pods behind an HPA sustain 25K req/sec; CPU stays under 30% even when a polar-vortex event triples normal call volume.
Tier 2: The 11 HVAC Agents
The agent fleet is the brain. Eleven small, single-purpose agents are orchestrated via the OpenAI Agents SDK and LangGraph, with handoffs governed by a head agent. Each agent has a tightly scoped tool surface — the triage agent cannot dispatch a tech, the voice agent cannot mutate the on-call rotation. This is bulkhead isolation at the agent boundary, not just the service boundary.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TB
Ingress[Customer Call or
Service Email]
Head[Head Agent
Routes & Handoffs]
subgraph Triage["Triage Layer"]
Email_T[Email Triage Agent
HVAC Urgency 0.0-1.0]
Call_T[Call Triage Agent
Parse Webhook + Caller ID]
VM_T[Voicemail Analyzer
Detect No-Heat / Gas / Leak]
end
subgraph Decision["Decision Layer"]
Dispatch_O[Dispatch Orchestrator
Build On-Call Ladder]
HITL[Human-in-the-Loop
Owner Approval]
end
subgraph Action["Action Layer"]
Voice_A[Voice Agent
Job Brief TTS to Tech]
SMS_A[SMS Agent
Address + Issue + ETA Ask]
Ack_M[Ack Monitor Agent
Tech Accepted Job?]
end
subgraph Fallback["Fallback Layer"]
Keyword[Keyword Triage
Circuit Breaker Open]
SMS_FB[SMS-Only Path
Twilio Voice Down]
end
Ingress --> Head
Head --> Email_T
Head --> Call_T
Head --> VM_T
Email_T --> Dispatch_O
Call_T --> Dispatch_O
VM_T --> Dispatch_O
Dispatch_O --> HITL
HITL --> Voice_A
HITL --> SMS_A
Voice_A --> Ack_M
SMS_A --> Ack_M
Dispatch_O -.degrade.-> Keyword
Voice_A -.outage.-> SMS_FB
The eleven agents:
- Head Agent — Dispatch and handoff coordination across the graph.
- Email Triage Agent — Scores incoming service email 0.0–1.0 with HVAC-specific rubric.
- Call Triage Agent — Parses signed Twilio/Dialpad webhooks; pulls caller ID against customer DB.
- Voicemail Analyzer Agent — Reads transcribed voicemails; flags no-heat, gas-smell, leak, CO-alarm markers.
- Dispatch Orchestrator — Builds the on-call ladder from rotation + owner fallback.
- Voice Agent — Generates a 35–50 word job brief for the on-call tech with address, issue, and DTMF prompt.
- SMS Agent — Composes ≤160-char SMS with address, issue summary, and "Reply YES to accept."
- Acknowledgment Monitor Agent — Detects acceptance across DTMF, SMS, and dashboard click.
- HITL Approval Agent — Pauses the graph for ambiguous cases the owner wants to eyeball.
- Keyword Fallback Agent — Deterministic regex triage when the LLM circuit is open.
- Audit Agent — Writes the event-sourced trail to PostgreSQL after every state transition.
Why NATS JetStream
NATS JetStream sits between the gateway and the workers. We did not pick it for raw performance — Kafka would also work — we picked it for three operational properties that matter for an after-hours product:
- Durable consumers with at-least-once delivery. If a worker crashes mid-dispatch, the message is redelivered and LangGraph's checkpoint resumes from the last committed step. No duplicate page-outs because the agent is idempotent on call SID.
- Queue depth as the autoscaling signal. CPU is the wrong metric for an LLM-bound workload — a worker can be 100% blocked on an OpenAI call while CPU sits at 5%. We export NATS pending message count to Prometheus and HPA scales workers from 10 to 100 pods on backlog depth — useful when a cold front drops temps and call volume spikes 5x in twenty minutes.
- Operational simplicity. A single 3-node NATS cluster versus Kafka with Zookeeper, schema registry, and connect workers. For 8K msg/sec we did not need Kafka's throughput ceiling.
The flow is a saga: each agent step publishes its result to a downstream subject, and the orchestrator owns compensating actions (cancel pending tech pages if the customer calls back to cancel). Saga semantics over distributed transactions because we span Twilio, OpenAI, PostgreSQL, and Redis — no two-phase commit is going to coordinate that.
The Tech-Acceptance Loop
Acceptance is the trickiest piece of the system. An on-call HVAC tech needs to be able to accept the job across whichever channel they are reachable on at 2 AM — DTMF on the call we made to them, an SMS reply, or a click in the company dashboard. All three must converge on a single canonical "accepted by tech X" state with strong consistency: dispatching two trucks to one job is bad, but dispatching the wrong tech is worse.
flowchart TB
Start[HVAC Score ≥ 0.6]
Build[Build On-Call Ladder
Primary Tech → Secondary → Owner]
Loop[For Each Tech]
subgraph Channels["Parallel Channels"]
Call[Twilio Call to Tech
Job Brief + Press 1 to Accept]
SMS[Twilio SMS to Tech
Reply YES + ETA]
Dash[Dashboard WebSocket
Click Accept]
end
Wait[Wait 120s OR Accept]
DTMF{DTMF
Pressed?}
SMSReply{SMS
Replied YES?}
DashClick{Dashboard
Clicked?}
Idem[Idempotent Write
PostgreSQL Tx
SELECT FOR UPDATE]
Cancel[Cancel All
Pending Channels]
Done[Job Accepted
SMS Customer ETA
Stop Escalation]
Next[Advance to
Next Tech]
Start --> Build
Build --> Loop
Loop --> Call
Loop --> SMS
Loop --> Dash
Call --> Wait
SMS --> Wait
Dash --> Wait
Wait --> DTMF
Wait --> SMSReply
Wait --> DashClick
DTMF -->|Yes| Idem
SMSReply -->|Yes| Idem
DashClick -->|Yes| Idem
Idem --> Cancel
Cancel --> Done
DTMF -->|No, timeout| Next
SMSReply -->|No, timeout| Next
DashClick -->|No, timeout| Next
Next --> Loop
The CP guarantee is enforced in PostgreSQL with a SELECT ... FOR UPDATE on the dispatch row, then a single UPDATE that flips status from paging to accepted. Whichever channel accepts first wins; the others see the row already locked, no-op, and emit a "duplicate accept" event for audit. The whole transaction is idempotent on (dispatch_id, channel, tech_id) so Twilio's webhook retries never produce two accepted records.
As soon as a tech accepts, the system fires an outbound SMS to the original customer with the tech's name and ETA — closing the loop in under a minute from the time the customer first called. At 5K concurrent dispatches, P95 round-trip is under 500 ms; most of that is Twilio call setup, not our database.
Resilience: Circuit Breakers and the Twilio Outage
Every external dependency is wrapped in a circuit breaker. Each agent has its own breaker so a degraded OpenAI region cannot take down SMS dispatch. The breaker exposes three states — closed, half-open, open — and on open, the agent falls back to a deterministic path.
flowchart LR
Req[HVAC Call/Email Arrives]
OAI{OpenAI
Circuit?}
KW[Keyword Triage
Regex Fallback]
LLM[LangGraph + GPT
Normal HVAC Triage]
Tw{Twilio Voice
Circuit?}
Voice[Voice Call to Tech
+ SMS Backup]
SMSOnly[SMS-Only Page
Maintain Delivery]
Ack[Tech Accepted
Truck Rolling]
Req --> OAI
OAI -->|Closed| LLM
OAI -->|Open| KW
LLM --> Tw
KW --> Tw
Tw -->|Closed| Voice
Tw -->|Open| SMSOnly
Voice --> Ack
SMSOnly --> Ack
This was not theoretical. During a Twilio us-east-1 regional outage, the voice circuit opened within 90 seconds of the first error spike. Every dispatch that followed routed straight to SMS-only. We maintained 99.7% acknowledgment delivery for 150K+ active dispatch users with zero data loss while Twilio's voice product was down. When the circuit closed again, the half-open probe traffic detected recovery and resumed normal voice paging within two minutes.
The keyword fallback is a 200-line file of regex patterns: no heat, gas smell, furnace not, pilot light, boiler leak, frozen pipes, CO alarm, walk-in down. It is dumber than the LLM, it produces more false positives, and that is fine — when the LLM path is down, the right answer for HVAC is to over-page, not under-page. Conservative degradation, not graceful failure.
Human-in-the-Loop: Owner Approval Mode
Some HVAC owners want to eyeball ambiguous calls before a tech is woken up at 2 AM — especially small shops where the owner pays the on-call premium and a false page costs real money. For those companies, the orchestrator triggers a LangGraph interrupt() on any call scoring 0.5–0.75, which suspends the graph and surfaces the agent's reasoning + proposed tech to the owner's phone via push notification.
The owner can:
- Approve the proposed dispatch as-is — graph resumes from the checkpoint, no replay.
- Modify the on-call ladder (skip Tech A who is on vacation, page Tech B instead) — graph state is patched and resumes.
- Reject — graph is killed, customer gets a polite "we'll call you at 7 AM" SMS, audit event recorded.
Resume latency is sub-second because LangGraph checkpoints are kept hot in Redis. Adding owner-approval mode cut false-positive dispatches by 31% while preserving fully-automatic dispatch for high-confidence emergencies above 0.75.
Datastore Choices: CP and AP, On Purpose
The CAP trade-off is made deliberately per data class:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- PostgreSQL (CP) — primary plus two read replicas behind PgBouncer with 10K max connections. Sustains ~10K writes/sec and ~100K reads/sec. Owns canonical state: customer records, on-call rotations, dispatches, acceptances, audit trail. Strong consistency is non-negotiable — a stale read of dispatch status sends two trucks.
- Redis Cluster (AP) — six nodes, 24Gi total, ~1M ops/sec, TTL-based eviction. Owns sessions, rate-limit counters, hot LangGraph checkpoints, and the WebSocket pub/sub. Eventual consistency is acceptable because none of this data is canonical.
- Elasticsearch — three sharded nodes fed by Debezium CDC from PostgreSQL. ~200K searches/sec. Owns the searchable view: full-text on customer notes, voicemail transcripts, dispatch history. Always behind PostgreSQL by a few hundred milliseconds; we never read from ES for transactional decisions.
Observability and Evaluation
The observability stack is the difference between "agent system in production" and "incident waiting to happen at 3 AM." We instrument three layers in parallel and they feed each other.
flowchart TB
subgraph Production["Production Traces"]
OTel[OpenTelemetry
Distributed Spans]
LangSmith[LangSmith
Agent Trace Capture]
Prom[Prometheus
+ Grafana]
end
subgraph Evaluation["Evaluation Loop"]
Dataset[HVAC Eval Dataset
200+ Edge Cases]
Replay[Replay-Based
Regression Evals]
Judge[LLM-as-Judge
Quality Scoring]
end
subgraph Mining["Continuous Mining"]
Mine[Trace Mining
Promote Failures]
Rotate[Monthly Rotation
New Edge Cases]
end
subgraph Alerting["Alerting"]
P99[P99 Latency Spike]
Err[Error Rate Threshold]
Page[PagerDuty]
MTTR[MTTR 45min → 8min]
end
OTel --> Prom
LangSmith --> Replay
Dataset --> Replay
Replay --> Judge
LangSmith --> Mine
Mine --> Rotate
Rotate --> Dataset
Prom --> P99
Prom --> Err
P99 --> Page
Err --> Page
Page --> MTTR
LangSmith captures every agent trace with inputs, outputs, tool calls, latency, and token cost. We replay traces against new prompt versions to catch regressions before they ship. The HVAC eval dataset has 200+ edge cases — multilingual emergencies, false alarms ("the AC is loud"), ambiguous urgency ("there's a smell but I think it's fine"), commercial vs residential, with outdoor temperature context. Every PR that touches a prompt or model runs the full set.
LLM-as-Judge scores agent decisions on rubrics: was urgency correctly classified, was the right tech paged given skill matrix and geography, was the SMS clear and within Twilio length limits. We track judge agreement against owner-labeled samples and only trust scores when agreement exceeds 0.8.
Production metrics — P95/P99 latency per agent, false-positive rate via owner overrides, tool call counts per turn, dollar cost per dispatch. Alerting on P99 latency spikes and error-rate thresholds dropped MTTR from 45 minutes to 8 minutes.
Deployment and Scaling on Kubernetes
The whole platform runs on a single Kubernetes cluster with namespace isolation per tier. Deployment is GitHub Actions → container registry → Argo Rollouts canary. Prompts and model configs are versioned separately from code via a config service so we can roll back a bad prompt in seconds without rebuilding an image.
Specifically for the AI workload:
- Workers are sized small — 1 CPU, 1Gi RAM per pod. They are I/O-bound on the LLM call; oversized pods waste money.
- HPA on queue depth via the NATS Prometheus exporter. Workers scale 10 → 100 pods on a 60-second window — useful when a winter storm triples after-hours volume.
- Provider rate limits are the real ceiling, not pod count. We shard across multiple OpenAI organization keys with a token-bucket limiter; the breaker degrades to fallback when we approach quota.
- Graceful shutdown with a 120-second termination grace period. SIGTERM stops NATS pulls, in-flight LangGraph runs checkpoint, then the pod exits. No mid-dispatch drops on rolling deploys.
- Pre-warm the prompt cache by keeping system-prompt prefixes stable. A worker that just started serves its first request with a warm cache because the prefix is shared across the fleet.
The full deployment serves 500K+ active users, 200K+ concurrent WebSocket connections, 25K+ requests/sec at the gateway, and 8K+ messages/sec through the worker tier.
Lessons From HVAC Production
Three patterns have held up across six months of HVAC dispatches:
- The triage taxonomy is more valuable than the model. Whether you use GPT-4o or Llama 3.1 matters less than whether your scoring rubric correctly distinguishes "no heat at -10°F with a baby in the house" from "the heat seems weak in one room." Spend the first month building the rubric with the owner, not tuning the model.
- Make the LLM call optional. The keyword fallback feels like over-engineering until the day OpenAI has a regional outage during a polar vortex. Then it is the only thing keeping pipes from freezing in your customers' homes. Conservative degradation beats clean failure.
- Evaluation is not a gate, it is a loop. A static eval set rots. Mine production dispatches every week, promote the surprising ones into the dataset, retire the trivial ones. The eval set should always feel slightly harder than yesterday.
Try CallSphere After-Hours For HVAC
The system described in this post powers CallSphere's After-Hours product for HVAC companies — production-ready emergency triage and tech dispatch. Eleven AI agents, configurable on-call ladders, Twilio voice + SMS + DTMF acceptance, full event-sourced audit trail, owner approval mode for ambiguous calls.
Book a 15-minute demo or see the live dashboard.
FAQ
Q: How do you tell a real no-heat call from someone who just wants to schedule maintenance?
The triage agent uses HVAC-specific signals: explicit phrases ("no heat," "freezing"), outdoor temperature pulled from a weather API, household composition if known (infants/elderly), customer call history, and tone markers in the voicemail audio. Anything below 0.6 score is logged for the morning, not paged.
Q: What happens if both Twilio and OpenAI are down at the same time during a polar vortex?
Keyword fallback for triage, SMS-only for delivery. If both Twilio voice and SMS are down, the dashboard WebSocket path still surfaces every emergency to the owner in real time. We have not yet seen all three down simultaneously, but the system is designed to over-page in that scenario rather than miss real emergencies.
Q: How do you prevent two trucks from being dispatched to one job when Twilio retries a webhook?
Every agent step is idempotent on the call SID + step ID, and PostgreSQL row-level locking on the dispatch record means a duplicate webhook becomes a no-op. The agent re-enters the LangGraph from its last checkpoint instead of starting over.
Q: Can the owner override the AI before a tech is paged?
Yes — that is the human-in-the-loop interrupt path. For ambiguous calls scoring 0.5–0.75, the graph suspends and the owner gets a push notification. They can approve, modify the on-call ladder, or reject. Resume latency is sub-second because checkpoints are hot in Redis.
Q: Does this replace my existing answering service?
Most HVAC companies on this product fully replaced their answering service within 60 days. The cost difference is significant ($1,500–$5,000/month vs the CallSphere subscription) and the triage accuracy is higher because the model is HVAC-specific instead of a generic call-center script.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.