Technology

Designing an AI Pipeline That Works Everywhere, All the Time

Building an AI system that works on Friday afternoon is easy. Building one that reliably responds to events across a globally distributed team - in real time, around the clock - is a different problem entirely.

I wrote about the matching logic behind this system in an earlier post. That post covers the five-layer architecture: intake, vector search, LLM scoring, threshold routing, and feedback. What it doesn't cover is the part that took just as much thought: making the whole thing reliable enough that the team can count on it regardless of when or where they're working.

Getting a prototype working on a Friday afternoon is one problem. Getting a system to reliably respond to events around the clock, across a globally distributed team, is a different problem. That's what this post is about.

The constraint that changes everything

The engineering and product teams using this system aren't in the same building, or even the same timezone. A customer submits a feature idea at 10am their time. Someone on the product team, halfway around the world, is asleep. The system needs to process that idea, run the matching logic, and write the result back before anyone on either end thinks to check.

There's no "overnight batch window" when everyone is asleep and the system can catch up. There's no single morning standup where results need to be ready. The system runs constantly, responding to events as they happen - customer ideas coming in from wherever customers are, PMs triaging from wherever they're working.

When it doesn't work, the failure is invisible until someone notices a gap. They're not looking at a broken dashboard. They're just not seeing matches they should be seeing. That's the worst failure mode, and it's the one that erodes trust fastest.

The design constraint I kept coming back to was: "This has to respond to events reliably, in any timezone, without anyone babysitting it." That framing forced decisions I wouldn't have made if reliability was an afterthought.

Volume added pressure on top of that. With thousands of roadmap features and tens of thousands of ideas in the system, the pipeline needed to be smart about what it processes and resilient when parts of it fail. A single slow dependency can stall everything if the architecture doesn't account for it.

The architecture for reliability

The first thing I changed from the prototype was making the pipeline fully async.

In the prototype, idea intake and LLM scoring happened in sequence. A new idea came in, the system immediately started scoring it, and nothing else happened until that finished. Fine for testing. Terrible for production. If the LLM call took 30 seconds (or timed out), everything downstream waited.

The production version separates intake from processing entirely. A new idea arrives, gets validated, and goes into a queue. The intake step returns immediately. Scoring happens in the background, independently. Ideas pile up in the queue; the scoring workers drain it at their own pace.

For orchestration, I used Azure Durable Functions. Durable Functions handle the coordination work that's easy to get wrong manually: retry logic, tracking workflow state across multiple steps, and picking up where you left off after a failure. If the LLM scoring step fails partway through, the orchestrator knows where the workflow stopped and can restart from that point rather than from scratch.

The queue itself is Azure Storage Queue. I considered Service Bus - it has more features - but Storage Queue was simpler and the extra features weren't needed for this workload. One of the choices I'm glad I kept simple.

For vector search, I used Azure Cognitive Search (now Azure AI Search). It handles the similarity search against the roadmap feature embeddings and returns the top candidates before the LLM ever sees anything. Keeping the vector search and the orchestration inside Azure made the integration straightforward.

Failure modes and how we handle them

Every external dependency in this system is a potential failure point. I mapped them out early and designed recovery behavior for each one.

The LLM times out or hits rate limits. Azure OpenAI has quotas, and under load you'll hit them. When a scoring call fails, the item goes back into the queue with an increasing wait period before the next retry (exponential backoff). If retries exhaust the timeout, the item stays in the queue to be retried on the next webhook trigger. The system self-heals: when a change happens to an idea or roadmap feature, a webhook fires, the pipeline picks it up, and scoring runs again. Nothing is lost; it just takes longer.

The roadmap tool API is down. When scoring succeeds but writing the result back to the roadmap fails, we store the result locally and retry the write later. The match isn't lost just because the downstream API was unavailable for a few minutes.

An idea is poor quality. Some submissions come in with vague, malformed, or very short text. I'm scoring those ideas with a quality rubric, allowing us to skip ideas that aren't detailed enough to evaluate meaningfully.

Vector search returns no candidates. That's a valid outcome, not a failure. The system treats it as a result, not an error.

The key principle across all of these: fail gracefully and retry smartly. The pipeline should degrade to a slower or partial result, not stop. And when something goes wrong, log it and move on. The self-healing behavior handles the rest - when the item gets updated, the webhook fires again and the pipeline picks it back up.

Observability: knowing when things break

The failure handling above only works if you can see what's happening. I instrumented the pipeline from the start with a few specific things in mind.

Metrics I track:

  • Ideas ingested per day (is the intake working?)
  • LLM scoring latency at p50 and p99 (is scoring getting slower?)
  • Retry rate (is something degrading?)
  • Write-back success rate to the roadmap tool (is the downstream API healthy?)
  • Queue depth for idea processing (is the queue growing or shrinking?)

Alerts that matter:

  • Retry rate above 10% (something upstream is struggling)
  • Write-back failures spiking in a short window (the roadmap API is having issues)
  • Queue idle for 24 hours with unprocessed items (something is stuck)

What I'd do differently, and what I'd do the same

I haven't had to roll back any major architectural decisions yet. The async queue model has held up, the Durable Functions orchestration has handled failures cleanly, and the self-healing retry behavior has meant that most failures resolve without anyone noticing.

A few things I'd emphasize to anyone building something similar:

Start with observability. I built the logging and metrics in from the beginning, not as an afterthought. That paid off quickly. The first time LLM latency crept up, I could see it in the metrics before any PM noticed. If you wait to add observability until something breaks, you'll be debugging blind.

Dependency isolation is non-negotiable. You can't control Azure OpenAI's uptime or your downstream tool's API reliability. What you can control is how your system behaves when they fail. Design for graceful degradation from the start. A partial result is almost always better than no result.

Simplicity earns its keep. I chose Storage Queue over Service Bus because it was simpler and met the requirements. I chose Durable Functions over a custom orchestration solution for the same reason. The temptation to use the more powerful tool is real. Resist it unless you actually need the features.

Event-driven beats batch. Because the teams span timezones, there's no single window when everyone is offline and the system can catch up. Designing for real-time event response - ideas trigger processing immediately, results are written back as soon as they're ready - meant the system works the same at 2am in one timezone as it does at 10am in another. Batch processing would have created winners and losers by timezone. Events don't care where you are.

The reliability work didn't show up in the demo. It never does. But it's the reason the system is still running.

Comments welcome!