Why your inline prompts drift faster than your code

Skelf-Research · May 12, 2026 ·

engineeringprompt-ops

A prompt is a string. Strings inside source files have all the formal properties of source code: they are versioned in git, they show up in diffs, they get reviewed alongside everything else in a PR. From the outside, that looks like a perfectly serviceable workflow. The string is there. The review process touched it. What more could you want?

What you want is for the property “the string in production today is the string the team last agreed on” to actually hold. In practice, inside any team that ships LLM features under deadline pressure, that property breaks down inside weeks. This post is about why.

The four drift vectors

There are four distinct ways an inline prompt rots, and they happen in roughly this order.

1. Co-located edits inside unrelated PRs

The first time you watch this happen is the moment the workflow stops being credible. Someone is debugging a flaky JSON parse error in the chat handler. They notice the prompt is asking for “JSON” but the model occasionally returns it wrapped in markdown. They add three words: respond with raw JSON. The PR title is “fix: handle empty response in chat handler”. The reviewer reads the handler logic, approves, merges.

A week later, someone else looks at the prompt and notices it now says respond with raw JSON. They wonder when that changed. They check the blame. The commit message is about an empty-response bug.

This is the cleanest example because nothing was done maliciously. The reviewer did their job; the editor did their job; both of them knew they were touching the prompt. There was just no place in the PR description where prompt changes get separate attention. They are file changes like any other, and the cost of a sloppy file change to a chat.py is small — except for the one line in chat.py where it isn’t.

2. Copy-paste forks

The second pattern shows up when the prompt that handles summarize for the dashboard gets copied into a new endpoint that does summarize for an email digest. The two strings start identical. Then someone tunes one of them — say, for an outdoor-newsletter audience that wants more sentences. The email-digest variant gets longer. A month later, the dashboard variant gets a different tweak: shorter, more bulleted. The two prompts are now diverged, and there is no record anywhere that they were ever supposed to be the same prompt.

The deeper problem is that the names of the variables disagree before the team notices. The dashboard prompt uses text. The email one uses body. A junior on the team writes a third call-site, picks one of them as the model, and now you have three.

3. Silent SDK upgrades

The third vector is the most underappreciated. The prompt does not change. The SDK does. A minor version bump in openai==1.x.y adjusts default temperature handling, or changes how system messages are merged with user messages when both are present, or tweaks the way tool-call schemas get serialized. The string is exactly the string it was yesterday. The behaviour is not.

If your prompt is an inline string, there is no way for your audit trail to even notice this happened. The diff in the SDK upgrade PR is in pyproject.toml. The downstream effect on prompt behaviour is in chat.py, which did not change.

4. Quiet evaluation drift

The fourth vector is the cumulative version of all the others. The team has an eval suite. The eval suite is run quarterly, sometimes more often, sometimes less. Between runs, the prompt has been touched twelve times — three for bug-fixing, two for new model support, four for “small wording tweaks,” and three of which were inadvertent. The next eval shows a five-point quality drop. Bisecting twelve interleaved edits across three engineers and a quarter of velocity is, in practice, not done. The team writes a Notion page about the regression and moves on.

What good would look like

If you wanted to stop all four of these without rebuilding your stack, what would the minimum set of operations be?

You would need, at a minimum:

A way to separate prompt edits from code edits in a PR so reviewers know which kind of change they are evaluating.
A way to identify duplicates before they fork, so the two summarize prompts in two files become one prompt referenced twice.
A way to pin the prompt body itself so a downstream change — to the SDK, to the model, to anything — fails loud on the prompt rather than silently.
A way to make drift a build failure rather than a thing that gets noticed at eval time.

These four operations correspond very precisely to four blogus commands. scan enumerates inline prompts and tells you where they live. init and the .prompt file format extract them into a named, deduplicated artifact. lock writes a content hash for each. verify is the CI gate that fails when the hash on disk and the hash in the lock disagree.

That is not an accident. The tool was designed by working backwards from those four drift vectors and asking what the smallest set of operations would be that closes each one.

Why this is “package.lock for prompts” and not something heavier

You might reasonably ask why you cannot solve this with a prompt registry — a hosted service that owns the prompt, exposes it via a REST endpoint, and version-bumps when it changes. That model works. It also reintroduces a category of problems that the source-tree model does not have:

Network dependency at runtime. Your app now needs to reach a registry to render a prompt.
Cache invalidation. The registry’s version cache and your app’s deployed code can disagree.
Two source-of-truth systems. Auditing a prompt change now means correlating registry history with git history.
Separate identity and access management. The team that can change a prompt is now defined in a second system.

A lockfile in the repo solves all four. The runtime never asks anything; it reads from disk. The registry-vs-deploy mismatch cannot happen because there is no registry. The audit history is a single git log. The “who can change a prompt” question reduces to “who has merge rights,” which you already have an answer to.

The package-lock analogy is doing real load-bearing work here. npm install and uv sync solved this exact shape of problem for source dependencies. The lessons are reusable: declared name, resolved hash, machine-readable manifest, fail-loud verification, no network in the critical path. blogus is the same shape applied to a different artifact.

What you still need a human for

A lockfile is not an opinion. It does not say whether a prompt is good, whether the new wording is more accurate, whether the temperature change makes sense for a customer-facing surface. It only says that the change was made deliberately and that the on-disk content matches the agreed-upon hash.

The argument for the lockfile is not that it replaces review. The argument is that without it, review of prompt changes does not really happen — the changes are invisible inside other changes. Once they are extracted into a .prompt file and a hash, the diff in a PR is unambiguous: it is the prompt that changed; the reviewer is looking at exactly the artifact that will run. That is the moment review becomes possible.

Inline prompts drift faster than code because the existing review machinery treats them as code, and the cost-of-attention model that works fine for if/else logic does not survive contact with strings whose effect lives entirely outside the codebase. Pull the strings into a named artifact, hash them, and make the build fail on mismatch. Everything else — eval suites, A/B tests, prompt-engineering ops — works better once the lower layer is solid.