ChecklistJune 24, 20265 min read

Auditing AI-Generated Code: What Actually Breaks in Production

When an app is built mostly by AI, a specific set of failures recurs. The checks we run on AI-generated code, and why each one bites once real users arrive.

Contents

The checklist
1. Authorization that was generated but never wired
2. Phantom and mismatched dependencies
3. Mock data shipped as if it were real
4. Secrets inlined where they can leak
5. Happy-path-only error handling
6. Prompt-injection and unsafe LLM-call surfaces
7. A data model shaped for the demo, not the product
8. The same problem solved five different ways
9. No cost or rate-limit discipline on model calls
10. Tests that assert the output, not the requirement
What to do with the results

This is the companion to Is your codebase AI-ready?. That checklist asks whether your code is ready to hand to an AI. This one asks the opposite: is the code the AI already wrote ready for users. The two share a foundation (tests, secrets, dependencies, observability all still matter) so for the general engineering hygiene, read that one first. Here we list the failures that are specific to AI-generated code, the ones that show up precisely because a model wrote it by feel and a human accepted it by feel.

These patterns are a representative composite. We do not name clients or invent metrics. But every item below is something we have repeatedly found when auditing apps built mostly by prompting, and each maps to a real production failure.

The checklist

1. Authorization that was generated but never wired

The model builds a login and hides the admin button, and it looks like access control. It is not. The check that matters (can this user touch this record, enforced on the server) is routinely missing, because the demo only ever ran as one user. Look for endpoints that trust the client, row-level security that was scaffolded but left permissive, and "is admin" flags read from a value the browser can set. This is the failure that becomes an incident report.

2. Phantom and mismatched dependencies

Generators confidently import packages that do not exist, pin versions that were never released, or pull a name one character off a real one (a typosquat). Check that every dependency resolves, is the package you think it is, and is actually used. A hallucinated import that someone later "fixed" by installing whatever matched the name is a supply-chain hole, not a typo.

3. Mock data shipped as if it were real

To make the demo move, the model seeds placeholder data and stubs responses. Those stubs have a habit of surviving into production: a hardcoded list where a query should be, a "TODO: replace" that shipped, a function that returns the same sample object regardless of input. Trace each data path to a real source. The ones that dead-end in a fixture are features that only ever worked in the demo.

4. Secrets inlined where they can leak

A model optimising for "make it run" will paste an API key into client-side code, commit a real .env, or hardcode a token in a config file. Beyond the generic secrets check, look specifically in the client bundle and the git history: AI-generated commits often include the working key that made the local run succeed.

5. Happy-path-only error handling

The generated code handles the case it was shown and nothing else. The external call that times out, the payment that half-completes, the upload that fails midway: those branches were never written, so they fail silently or crash. Search for external calls with no failure path, promises with no rejection handling, and try/catch blocks that swallow the error and continue. This is the single most common source of "it works for me, breaks for them."

6. Prompt-injection and unsafe LLM-call surfaces

If the app itself calls a model, the generator rarely defends it. User input flows straight into a prompt, tool calls run without validation, and model output is trusted as if it were code you wrote. Check that untrusted input cannot rewrite the instruction, that any action the model can trigger is authorised independently, and that output is treated as data, not as a command.

7. A data model shaped for the demo, not the product

The schema fits the one flow the founder showed. It lacks constraints, indexes, and the relationships the product actually grew into, and it often stores as text what should be typed. This is the item most likely to force a partial rebuild, because everything sits on top of it. Judge the schema against what the product is now, not the prototype it started as.

8. The same problem solved five different ways

Without a sense of the whole, a generator re-solves each task locally. You find three date formatters, two ways to call the API, and the same validation copy-pasted with small drift. It is not just untidy: when a rule changes, you now have to find every copy, and you will miss one. Map the duplication before you estimate any change, because it sets the true cost of every future edit.

9. No cost or rate-limit discipline on model calls

AI features generated without an eye on the bill have no token caps, no caching of repeated work, and retry logic that storms the provider on failure. Look for unbounded loops around model calls, missing timeouts, and the absence of any cache. One viral day or one bad retry path arrives as an invoice.

10. Tests that assert the output, not the requirement

If tests exist at all, they often assert exactly what the model produced ("the function returns this string") rather than what the feature is supposed to do. They lock in the current behaviour, bugs included, and give a false sense of safety. Read the assertions: a test that would still pass if the requirement were wrong is not protecting you.

What to do with the results

From audit to decision

01Audit against this listFind the AI-specific failures.
02Rank by severity and blast radiusAuth and data model first.
03Harden or rebuildItem 7 (the data model) is the rebuild trigger.
04Re-check after each fixConfirm the failure is actually gone.

Most of these are fixable without a rewrite. Found early and worked in priority order, they move a vibe-coded app from fragile to dependable while it keeps running. The exception is item 7: when the data model is wrong for what the product became, a targeted rebuild of the data layer is usually cheaper than patching around it forever. That harden-or-rebuild call is the subject of your vibe-coded MVP just got users, and it should be made from this list, not from a stressful week.

If you are taking on a codebase you did not write, or inheriting legacy code that happens to be a few months old, the same audit applies. We run a technical audit as a standalone engagement for exactly this: an honest read of what is sound, what is risky, and what it costs to fix, before anyone commits to building on top.

This checklist reflects our current practice at Basetool Labs. It will change as the tools that generate this code mature.

Terms in this article

← Back to Field Notes

Auditing AI-Generated Code: What Actually Breaks in Production

The checklist

1. Authorization that was generated but never wired

2. Phantom and mismatched dependencies

3. Mock data shipped as if it were real

4. Secrets inlined where they can leak

5. Happy-path-only error handling

6. Prompt-injection and unsafe LLM-call surfaces

7. A data model shaped for the demo, not the product

8. The same problem solved five different ways

9. No cost or rate-limit discipline on model calls

10. Tests that assert the output, not the requirement

What to do with the results

Have something you want shipped?

Is Your Codebase AI-Ready? A CTO Checklist

Anatomy of a Production AI Feature

A technical due diligence checklist that prices risk, not pages

Auditing AI-Generated Code: What Actually Breaks in Production

The checklist

1. Authorization that was generated but never wired

2. Phantom and mismatched dependencies

3. Mock data shipped as if it were real

4. Secrets inlined where they can leak

5. Happy-path-only error handling

6. Prompt-injection and unsafe LLM-call surfaces

7. A data model shaped for the demo, not the product

8. The same problem solved five different ways

9. No cost or rate-limit discipline on model calls

10. Tests that assert the output, not the requirement

What to do with the results

Have something you want shipped?

Related reading

Is Your Codebase AI-Ready? A CTO Checklist

Anatomy of a Production AI Feature

A technical due diligence checklist that prices risk, not pages