Technology

Why I Stopped Treating LLMs as Magic and Started Treating Them as Components

LLMs are powerful, but treating one as 'the solution' is a trap. Here's the architecture I built around one to make it actually work.

May 16, 2026 | 8 min read | Ai Llm Python Automation Lessons-Learned

I recently got the opportunity to build out an AI-powered feature-matching system. Some of my coworkers had already started mocking the idea - give an LLM a spreadsheet of ideas and features, let it give you back the list of matches. Feed it the customer's idea, get back a match. Easy. I initially made the same assumption a lot of teams make: I treated the LLM as the solution.

It wasn't easy.

The LLM hallucinated matches, missed real ones, and couldn't tell me the difference between "similar" and "identical." After a few rounds of that, I had to step back and rethink the whole thing. The LLM wasn't wrong, exactly - I was just asking it to do something it isn't designed to do. Once I stopped treating it like a magic oracle and started treating it like a well-scoped component in a pipeline, everything got better.

This post walks through what I built, why each layer mattered, and where I got it wrong the first time.

The actual problem

The business problem sounds simple: when customers submit feature ideas, we need to know if we're already building it. If we are, link the idea to the roadmap item. If we're not, route it for review. No one wants to ship something a customer already asked for six months ago without connecting the dots.

The subtlety is in the matching. "Customer A wants dark mode for the reporting UI" and "Customer B wants dark mode support across the product" might be the same idea, or might not - it depends on scope and intent. You can't match on keywords alone.

And you can't do it manually. When you have dozens of product managers and a growing roadmap, reviewing every new idea against hundreds of existing features by hand isn't realistic. We had to automate.

What I tried first

My first instinct was to just send everything to the LLM. "Here's a customer idea. Here are our roadmap features. Tell me if any of them match." Simple prompt, reasonable-sounding approach.

The results were bad. The LLM would confidently identify matches that weren't there, miss obvious ones, and give me no way to understand why it made the call it did. When I tried to tune the prompt to fix one failure mode, it introduced another. I kept tweaking the prompt and the model kept surprising me.

The problem wasn't the LLM. The problem was the setup. I was asking one component to do five different jobs at once - understanding intent, narrowing candidates, scoring similarity, explaining its reasoning, and making a binary decision. That's too much. Not to mention, the volume of data was blowing out the context window, the AI version of "out of sight, out of mind."

My original architecture: five layers, LLM in the middle

Once I broke the problem down into stages, things got a lot cleaner.

Layer 1 - Intake and cleaning. Customer ideas come in through an API. Before anything else happens, the text gets cleaned and normalized. Typos, vague scope, missing context - garbage in, garbage out. This layer is boring but it's a prerequisite for everything downstream.

Layer 2 - Vector search. Each feature description (customer idea and roadmap item alike) gets converted to an embedding and stored. When a new idea comes in, a similarity search runs against the roadmap and returns the top K number of candidates. This step is fast and cheap - it narrows the field dramatically before any LLM work happens.

I used off-the-shelf embeddings rather than fine-tuning a domain-specific model. That introduced some ambiguity (in a product context, "performance" might mean speed, reliability, or user experience), but the LLM scoring step downstream handles that ambiguity well enough that fine-tuning wasn't worth the effort.

Layer 3 - LLM scoring with a rubric. This is where the LLM actually earns its place. It doesn't get a wide-open question. It gets the customer idea, one candidate roadmap feature, and a specific rubric:

Intent alignment: Do they want the same thing? (1-10)
Scope alignment: Are they asking for the same scope? (1-10)
Confidence: How sure are you? (1-10)
For each score, quote specific language from the descriptions that supports your rating.

That last part - requiring quoted evidence - turned out to be the most important design decision. The LLM has to show its work. That means PMs reading the output can see exactly why the system scored a match the way it did. It also means the system is auditable. When something looks off, you can trace it back to the specific text the LLM was responding to.

Layer 4 - Threshold and routing. The LLM score feeds into a threshold check. Above the threshold, the idea is flagged as a match candidate and the PM team gets notified. Below the threshold, it goes into a manual review queue. The threshold itself took several rounds of iteration to get right - more on that below.

Layer 5 - Feedback loop. PMs review the match candidates, confirm or reject them, and those corrections feed back into the system. Over time this helps calibrate the rubric and threshold. The system doesn't "learn" in a machine learning sense, but the humans reviewing the output surface patterns that you can use to tighten the prompt or adjust the routing logic.

Where I got it wrong the first time

The architecture I described above was the design. What shipped was smaller. I was building this mostly solo with AI-assisted coding, against a real timeline. Some layers got cut or simplified:

Layer 1 - I didn't have time to build a real cleaning pipeline. Instead, I leaned harder on pulling structured context from existing data fields rather than trying to normalize free text. It compensated, mostly.
Layer 3 - The single scoring rubric turned into two. I added a pre-check rubric that evaluates the quality of the input before the match rubric even runs. Poorly structured ideas score low on quality, which explains a lot of the ambiguous match results that come out the other side.
Layer 4 - Dropped the notification step. The downstream tool the team was already using handles that on its own when the match attribute gets written. Also dropped the manual review queue - everything routes by threshold, and you adjust the threshold if the results skew too conservative or too aggressive.
Layer 5 - The feedback loop didn't become a training mechanism. It became an audit log. PMs can reverse a match decision, and that reversal is tracked. It's not machine learning, but it's accountability.

The threshold took several rounds to get right. My initial instinct was to set the bar high - only flag it as a match if the LLM was very confident. The result was that near-misses that a PM would have recognized as matches slipped through. I loosened it. Then watched it overcorrect the other direction. Went through a few cycles before it stabilized. There's no shortcut here - you have to run it on real data with real users to understand where the threshold should live. The great part was I built in an "Audit" mode from the start so no actions were written back to any records until we had the thresholds dialed.

The LLM was too verbose. Early versions of the prompt produced long explanations for each score. Useful, but expensive - token costs added up faster than expected. Adding a word limit to each explanation (something like "keep each explanation under 50 words") dropped costs meaningfully without losing the auditability I cared about. If you're not thinking about token efficiency early, you'll feel it later.

I underestimated how much rubric structure would matter. The first version used vague terms like "intent alignment" without defining them. The LLM interpreted those differently depending on context. Once I added explicit definitions and required quoted evidence, the scores got much more consistent. After that initial tightening, the rubric barely needed updating - once it was well-defined, it stayed well-defined.

What shipped

The system now handles the full (modified) flow: idea comes in, gets embedded, match candidates isolated by vector similarity, matches are scored by the LLM, and the result is written back to the idea record. The PM team gets notified through the tool they were already using, with no new interface to check.

Time savings are hard to measure precisely. What I can say is that manually scanning for duplicates across a growing backlog was a real cost before - and now it isn't.

What I'd tell someone starting this today

The real intelligence in this system isn't in the LLM. It's in the architecture around it - the vector search that narrows the candidates, the rubric that focuses the task, the evidence requirement that makes the output auditable, the threshold logic that decides what goes to humans, and the audit mode that lets your test the system on real data BEFORE it starts taking action against real systems.

The LLM's job is narrow and well-defined. That's why it works.

If you're building something with LLMs and it keeps surprising you in bad ways, the answer is almost never a better model. It's usually a better problem definition. Break the task into layers, give the LLM the one piece it's actually good at, and build the rest of the system to support it.

Comments welcome!