Most teams trying to build their own AI receptionist think the hard part is the AI.

It's not. The AI is the easy part now.

The hard part is everything around the AI. The part that doesn't show up in demos or tutorials. The part that takes six to eight months to figure out and breaks every time it goes near production.

I've watched a few teams try to build this themselves. They all hit the same wall.

1000ms
Total latency budget per response
6-8 mo
Orchestration buildout before production
8 layers
Hidden under the 30-second demo

What they think they're building

You watch a Vapi or Retell demo. Agent answers a call, takes a booking, sends a confirmation. Looks simple.

So they think the build is:

A weekend project.

What they're actually building

Here's what's underneath that 30-second demo.

Telephony layer. SIP trunking. Carrier integration. STIR/SHAKEN attestation so calls don't get marked as spam. Inbound number provisioning. Outbound caller ID verification. DTMF detection. Call recording compliance per state.

Audio infrastructure. Voice activity detection that doesn't false-trigger on background noise. Barge-in handling so the agent stops talking when the caller interrupts. Echo cancellation. Silence detection. Dropped audio recovery.

Latency budget. The whole call has a 1000ms response window before it sounds robotic. That 1000ms gets split across speech-to-text, LLM inference, tool calls, text-to-speech, telephony round trip. Each one has to be optimized. Miss the budget and customers hang up.

Tool reliability. The agent calls your CRM to book an appointment. The API times out at 8 seconds. Agent already said "perfect, you're booked for Thursday." Customer gets no confirmation. Shows up. No record. Trust gone.

State management. Call drops mid-conversation. Customer calls back. How does the agent know they were already 80% through booking? Handoff between inbound and outbound. Retry logic. Idempotency so the same booking doesn't get created twice.

Escalation logic. When does the agent transfer to a human. When does it just take a message. How does it handle threats, lawsuits, contract disputes, refund demands. These aren't AI problems. They're product problems with hard rules.

Monitoring. How do you know the agent is failing? You can't watch every call. You need three layers — system health (uptime, error rates), leading indicators (transfer rate, low-confidence responses), business outcomes (bookings, conversion, revenue).

Model and data drift. The LLM provider updates their model. Agent behavior shifts subtly. Nobody notices for two weeks. You find out when bookings drop 15%.

The build vs buy moment

This is the conversation I have with operators who think they want to build it themselves.

They're not wrong about the AI. Anyone can prompt an LLM to sound friendly on the phone.

They're wrong about the rest.

I talked to a guy who'd been building his own setup for 8 months. He had the agent working great in test calls. The moment he tried to ship it into production, everything broke.

His telephony provider's webhook signing wasn't matching. His CRM API was throwing 500s on bookings during peak hours. His agent was confirming bookings before the API actually wrote them, so customers got told they had appointments that didn't exist. His latency was 2.4 seconds because he was running STT → LLM → TTS sequentially instead of streaming.

He asked me how long it took us to solve those problems.

About a year of running it in production with real shops.

He stopped trying to build his own.

Why this matters if you're shopping

If you're an operator looking at AI receptionist providers, the question isn't "do you have an AI that sounds good." Every provider sounds good in the demo.

The question is "what happens when something goes wrong."

Ask them:

Most cheap providers can't answer these. They shipped the demo. They didn't ship the production system.

The difference between a $300/month AI receptionist and one that actually works is everything underneath the conversation.

The takeaway

Building AI is no longer the hard part. Infrastructure around the AI is.

If you're an operator, ask the harder questions before you buy. The conversation quality is table stakes. The orchestration is what determines whether the agent actually books the job.

If you're a builder thinking about competing in this space, plan for six to eight months on the orchestration before you ship. Or pick a different problem. This one is solved by people who have already taken the lumps.

If you want to see what running the orchestration looks like from the operator side, my last long-form was on how I replaced hours of manual work with a self-hosted AI agent — same NeverMiss, different stack, full build log including the security layer most tutorials skip.