Auditing prompt changes the same way you audit dep upgrades

Skelf-Research · June 1, 2026 ·

reviewciprompt-ops

When a Python team reviews a uv lock update, the review has a recognizable shape. Someone opens the PR. The reviewer reads the human-facing change — the version bump or the new package — and skims the lockfile for surprises. They look for transitive jumps that they did not expect, for new packages they have never heard of, for hashes that have changed without an obvious upstream version reason. They check the changelog of the bumped package. They look at the test run. They approve.

That entire ritual exists because the lockfile is the artifact. The reviewer trusts the lock because they trust the process around the lock. The hash is not the security mechanism; the review is. The hash just makes the change legible.

A prompts.lock update should be reviewed in the same shape. This post is about what that ritual looks like, what shows up that does not show up in a dep review, and how to set up the surrounding tooling so the review can actually happen.

What a prompts.lock diff looks like

A PR that bumps one prompt looks like two file changes. The first is the .prompt file: a human-readable diff in the template body, possibly a frontmatter change (a new variable, a temperature tweak, a model id update). The second is the corresponding line in prompts.lock:

prompts:
  summarize:
-   hash: sha256:a1b2c3d4...
+   hash: sha256:f7e8d9c0...
    commit: 4903f76

That is it. The hash change is the receipt. The reviewer’s eyes go to the .prompt file, where the actual semantic change lives.

This is structurally identical to a uv lock PR: the human-meaningful change (a version bump in pyproject.toml) sits next to the machine receipt (the recomputed hashes in uv.lock). The reviewer reads the meaningful side; the receipt is for the build.

The four review questions

When you actually audit a prompt change, four questions get asked, in roughly this order.

Is the change consistent with the intent of the PR? A PR titled “support German for the summarize endpoint” should change the prompt in a way that obviously enables that. If the diff also tweaks the temperature, the reviewer wants to know why — was that load-bearing for the German case, or did it sneak in?

Are the variables still right? Did the prompt gain or lose a variable? If it gained one, do the callers all pass it, or does it have a default? If it lost one, is the call site cleaned up? The frontmatter and the body in a .prompt file are read together, so this is one glance, not a multi-file investigation.

Is the model still the right model? A change from gpt-4o to gpt-4o-mini is a real change. Sometimes it is the right one — a prompt got simpler, the cost matters, the team is moving down-tier deliberately. Sometimes it is an accident. The frontmatter is explicit, so the question is answerable.

Is the eval evidence attached? This is the question that does not show up in a dep review. Source dependencies have changelogs and test suites. Prompts have neither, unless the team has built an eval. If the PR is changing the wording of a customer-facing prompt and there is no link to an eval run, the reviewer is supposed to push back.

The reviewer is doing all four of those checks against artifacts that, before blogus, were not really artifacts — they were strings embedded in arbitrary source files. The work to make the review possible was the work of pulling the prompt into a named file with a hash. After that, the review is recognizable.

What CI should be doing during this review

Three things, all of which are non-controversial:

Re-verify the lock. blogus verify runs on every commit. If the developer hand-edited the lock or pushed a stale one, the build fails. This is exactly the role uv sync --frozen plays in a Python build.
Re-render the prompt with a known set of inputs. If your eval suite already exists, run it on the changed prompts. If it does not, the cheap version is to use blogus exec <name> with a representative variable bundle and at minimum confirm the model still returns. Failing fast on a prompt that produces a non-parseable response is worth doing.
Block merges with stale locks. GitHub branch protection on the build status is the standard mechanism. Same as everywhere else.

None of those steps require new infrastructure. They are the steps your team already runs for source dependencies; the new tool is just generating the lock to verify against.

What does not transfer from dep reviews

A few things are genuinely different and worth naming.

There is no upstream changelog. When pydantic bumps a minor version, there is a changelog. When prompts/summarize.prompt bumps its hash, there is whatever the PR author wrote in the description. The review burden shifts toward the PR description. A team norm of “the PR description explains the why, not the what” closes the gap.

The blast radius is harder to predict. A library upgrade has known surface area: it can break things that call the library. A prompt change can break anything downstream of the model’s output, including things that parse it loosely (the JSON-in-markdown problem). Eval coverage is the only real defence; the lockfile just makes the change auditable.

Rollback is faster. This one is in your favour. A prompt rollback is git revert of the .prompt file plus a re-lock. No package republish, no dependency-graph reasoning, no version negotiation. The unit of rollback is one file.

A workable review checklist

Lifted from internal use; adapt freely.

The .prompt change matches the PR title and description.
Variables added / removed are consistent with all call sites.
The model id and temperature change is intentional (or the field is unchanged).
The prompts.lock diff is exactly one hash line per changed .prompt, no surprises.
If the prompt is customer-facing, an eval run is linked.
CI is green, including blogus verify.

Six checks. Three of them are mechanical and most of those can be automated. The other three are judgement calls and that is what the reviewer is for.

Why the analogy holds

The deepest reason dep-upgrade review works is that the upgrade is forced into the open: a separate artifact (the lockfile) records the change, a separate tool (the resolver) generates the artifact, and a separate process (CI) verifies the artifact. None of those steps gives the developer or the reviewer a place to be casual. The same three pieces are what prompt changes have lacked: separate artifact, separate tool, separate verification. Once they exist, the review you already know how to run starts working on prompts too.

A worked example

Take a real-shaped PR. The author needs the customer-facing summarize prompt to handle long-form support transcripts. Before the change, the prompt was tuned for short product reviews. The PR touches one .prompt file: the body is longer, there is a new optional variable audience with a default of general, and the temperature drops from 0.4 to 0.2.

What the reviewer sees, in order:

The PR description names the use case (“support transcripts, 5-10 paragraphs of dialogue”).
The .prompt diff shows the new audience variable in the frontmatter, the temperature drop, and the body change.
The prompts.lock diff shows a single new hash for summarize.
The call-site diff shows one updated load_prompt call that passes the new variable on the support route only; other callers are unchanged because the variable has a default.
The eval-suite run linked from the PR includes both the old prompt’s eval (unchanged on its existing dataset) and a new dataset of representative support transcripts that the new prompt now passes.

That review takes the reviewer about five minutes. The thing that made it possible was the structure: the artifact is in front of them, the receipt is next to it, the call sites updated correctly, and the eval evidence is one click away. None of that would be possible if the prompt were still an inline string.

The lockfile is small. The discipline around it is what does the work.