Skip to content

We opened our Constanța studio. See the photos

LABS
AnatomyJune 19, 2026

Anatomy of a Production AI Feature

Not every AI integration is a chatbot. We break down the six layers every production AI feature actually needs, and where most implementations go wrong.

Contents

When a founder says "we want to add AI," the conversation usually starts in the wrong place: which model to use. GPT-4o or Claude? Gemini? Something open-source?

The model choice is, in practice, one of the least consequential decisions in an AI integration. What matters (what determines whether a production AI feature is reliable, maintainable, and economically viable) is everything around the model call. The six layers that wrap it.

Before reading this, it is worth checking whether your codebase is ready to receive an AI integration at all. That checklist lives here.

For the integration work itself, this is what we cover under AI integration services. Here is how we think about it.


Layer 1: Input validation and guardrails

What it is: The set of checks that happen before any user input reaches the model. This includes format validation, length limits, content filtering, and rate limiting per user or session.

Common mistake: Sending raw user input directly to the model. This creates prompt injection vulnerabilities (users can craft inputs that override your system prompt), excessive token consumption (no length limits), and wildly inconsistent behaviour (no input normalisation).

What good looks like: A validation layer that checks input length, strips or escapes characters that could affect prompt structure, applies content policy rules appropriate to your use case, and returns a structured error before spending a token if the input fails validation. Rate limiting at this layer (not just at the API level) is necessary to prevent both abuse and runaway costs.


Layer 2: Prompt engineering and context assembly

What it is: The process of constructing what actually gets sent to the model. This includes the system prompt, any context retrieved from your data (retrieval-augmented generation, or RAG), the conversation history (if applicable), and the formatted user input.

Common mistake: Treating the system prompt as a static string written once and never revisited. Production prompts degrade: the model's behaviour shifts with updates, your product requirements change, and edge cases accumulate that the original prompt did not anticipate. Teams that treat prompts like static config ship brittle AI features.

What good looks like: Prompts versioned in source control. A retrieval strategy that is thoughtful about what context actually helps the model: not "add everything" but "add the right things." A context window budget that leaves room for the model's response. And a discipline of prompt iteration tied to evaluation (Layer 5), not intuition.


Layer 3: The model call (and why model choice is usually the least important decision)

What it is: The actual API call to the language model, including model selection, temperature, max tokens, and any sampling parameters.

Common mistake: Optimising model selection first. Founders often spend significant time comparing benchmarks before they have a working integration, output format, or evaluation harness. The model that performs best in isolation is not necessarily the model that performs best in your specific prompt context, at your specific latency budget, with your specific usage volume.

What good looks like: Start with a capable default (a current mid-tier model from a major provider is usually sufficient), get the integration working end-to-end, then benchmark against your actual use case with your actual prompts. Model choice should be driven by your evaluation results, not vendor marketing. Parameters like temperature should be set conservatively and changed only when there is an observed behavioural reason to adjust.


Layer 4: Output parsing and fallback handling

What it is: The work that happens after the model returns a response. Structured output extraction (if you asked for JSON and the model returned prose), validation of the response against expected format, and the fallback path when the output is unusable.

Common mistake: Assuming the model always returns what you asked for in the format you asked for it. It does not. Even well-instructed models occasionally return malformed JSON, truncated output, or responses that fail your business logic. Teams that do not handle this surface model errors as application errors: confusing crashes, empty states, or silent data corruption.

What good looks like: Every structured output is validated. If the model returns prose when you expected JSON, there is a parsing retry or a fallback response path, not an unhandled exception. The fallback does not have to be clever. It can be "sorry, something went wrong, here is a default response." But it has to exist and it has to be tested.


Layer 5: Evaluation and monitoring

What it is: The ongoing work of measuring whether the AI feature is doing what you want it to do. This includes automated evaluation (does the output pass a set of test cases?), production monitoring (are users accepting or rejecting outputs?), and drift detection (is the feature's behaviour changing over time?).

Common mistake: Shipping an AI feature with no evaluation harness. Teams that do this cannot answer the question "is this working?" They rely on anecdote: a teammate who tried it and thought it was good, a founder who liked the demo. When model providers update models, the feature's behaviour can change silently, and without evaluation, no one notices until users complain.

What good looks like: A set of evaluation test cases that cover the happy path, known edge cases, and cases where the feature should decline to respond. These run in CI. Production monitoring captures the signal users are sending: explicit (thumbs up/down) or implicit (did they use the output or discard it?). Regressions in model behaviour are caught before they reach all users.


Layer 6: Latency and cost management

What it is: The engineering work that keeps AI features economically viable and fast enough that users stay. This includes caching, streaming, async patterns, and usage monitoring.

Common mistake: Treating latency and cost as post-launch problems. A feature that returns in twelve seconds is not a usable feature for most product contexts. A feature that costs a dollar per use at any meaningful volume is not a viable feature. These constraints need to be part of the design, not the retrospective.

What good looks like: Caching for deterministic or near-deterministic inputs (if the same question will be asked repeatedly, the answer should be cached). Streaming for long outputs (users tolerate waiting for text to appear; they do not tolerate staring at a spinner for ten seconds). Async for non-blocking use cases (if the AI output is not needed immediately, process it in the background and notify). Cost monitoring per user, per feature, and in aggregate, so that a sudden cost spike is an alert, not a billing surprise.


Where implementations go wrong

Most AI integrations that fail in production do so at layers 1, 4, and 5. Input validation is skipped because it feels like overhead. Output parsing is treated as optional because the model "usually returns the right thing." And evaluation is deferred indefinitely because there is always a more urgent feature.

The model itself is almost never the problem.

If you are planning an AI integration and want to understand what this looks like applied to your specific use case, that is the core of what we do under AI integration. The six layers are the same across features; the right implementation for each layer depends on your product.


This reflects our engineering approach at Basetool Labs as of mid-2026. The tooling evolves; the layer model is stable.

Pick which categories of cookies you're OK with. You can change this any time from the footer.