← Back to Home

Why Running an LLM Once Is Never Enough

How non-determinism quietly undermines AI-powered code review — and a practical way to fix it

By Shiva Kumar Pati  ·  March 2026

Here's something most people discover the hard way: ask an LLM to review the same piece of code twice and you'll get two different answers. Not dramatically different, but different enough to matter. One run flags a SQL injection vulnerability and misses a data exposure issue. The next run catches the data exposure but glosses over error handling. Neither run catches everything.

This isn't a bug or a sign that the model is underperforming. It's just how these systems work. LLMs are probabilistic — every output is a sample from a probability distribution, not a lookup from a table. The problem is that we often treat a single run as if it were the complete answer. For casual use, that's fine. For code review, where missing a security issue has real consequences, it isn't.


Where the Non-Determinism Actually Comes From

Temperature and Sampling

The most obvious source is temperature — the parameter that controls how "creative" the model is. At higher temperatures, the model spreads probability mass more evenly across token choices, leading to more varied outputs. At temperature zero, you get greedy decoding: always pick the most likely next token.

Most people assume that setting temperature to zero solves the problem. It reduces variance significantly, but doesn't eliminate it entirely. Which brings us to the less obvious sources.

Hardware and Floating-Point Arithmetic

Modern LLM inference involves billions of floating-point operations. Different GPU models, different CUDA versions, and even different batching configurations introduce tiny numerical differences at each layer. These compound. Two identical requests hitting different servers in a load-balanced API can produce subtly different probability distributions, and therefore different outputs, even at temperature zero.

Prompt Sensitivity

LLMs are surprisingly sensitive to phrasing. "Review this code for bugs" and "Identify issues in this code" are semantically equivalent to us, but they activate slightly different patterns in the model's attention. This means that even minor upstream formatting differences — a trailing newline, a slightly different variable name in the snippet — can shift which issues the model focuses on.


What This Looks Like in Practice

To make this concrete, consider this Python function. It has six distinct problems:

def get_user_data(user_id, db_conn):
    query = f"SELECT * FROM users WHERE id = {user_id}"  # SQL injection risk
    result = db_conn.execute(query)  # no exception handling
    data = result.fetchall()
    if data:
        user = data[0]
        return {
            "name": user[1],           # positional indexing — fragile
            "email": user[2],
            "password_hash": user[3],  # should never be returned
            "last_login": user[4]
        }
    return None

When I ran a standard code review prompt against this function four separate times, here's what each run caught:

Issue Run 1 Run 2 Run 3 Run 4
SQL injection via f-string
password_hash returned to caller
No exception handling on execute()
SELECT * instead of named columns
Positional indexing on result rows
No input validation on user_id

No single run finds everything. On average, each run surfaces about half the issues. But across all four runs, every issue appears at least once. The information is there — it just isn't reliably surfaced in a single pass.


The Fix: Sample Multiple Times and Aggregate

The solution isn't a smarter prompt or a lower temperature. It's treating the LLM like the probabilistic system it is. Run the review multiple times, collect all the findings, then merge and deduplicate them. Each run samples a different slice of the model's attention. Together, they converge on something much closer to complete coverage.

There are three components to making this work:

1. Structured Output

Free-form text reviews are hard to aggregate. Prompt the model to return findings as a JSON array — one object per issue, with category, severity, line number, and a description. This makes deduplication tractable and also nudges the model toward more specific, actionable findings rather than narrative summaries.

2. Multi-Run Collection

Run the structured review prompt N times independently. Three to five runs works well for typical functions. Each call should be a fresh request with no shared context — you want independent samples, not the model continuing from where it left off.

3. Aggregation via a Merge Pass

Once you have N sets of findings, feed them all back to the model in a single aggregation call, asking it to cluster duplicates and produce a unified list with a confidence score for each issue — based on how many independent runs surfaced it.

This step is worth pausing on: you're using the model to synthesize its own outputs. It turns out LLMs are quite good at recognizing when two differently-phrased findings describe the same underlying problem, which is exactly the capability you need here.

import anthropic, json

client = anthropic.Anthropic()

REVIEW_PROMPT = """Review this code and return ONLY a JSON array of issues.
Each item must have: category, severity (high/medium/low), line (int or null),
issue (description), fix (suggested fix). No preamble, no explanation.

Code:
{code}"""

def multi_run_review(code: str, runs: int = 4) -> list[dict]:
    all_findings = []
    for i in range(runs):
        resp = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1000,
            messages=[{"role": "user",
                       "content": REVIEW_PROMPT.format(code=code)}]
        )
        try:
            findings = json.loads(resp.content[0].text)
            for f in findings:
                f["_run"] = i
            all_findings.extend(findings)
        except json.JSONDecodeError:
            pass
    return all_findings


def aggregate(raw: list[dict], total_runs: int) -> list[dict]:
    prompt = f"""These code review findings came from {total_runs} independent
LLM runs on the same code. Many are duplicates with different wording.

Merge them into a single deduplicated list. For each unique issue:
- Use the clearest description
- Add a "confidence" field: fraction of runs that surfaced it (e.g. 0.75)
- Keep all other fields

Return ONLY a JSON array. No extra text.

Findings:
{json.dumps(raw, indent=2)}"""

    resp = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(resp.content[0].text)


# Usage
raw = multi_run_review(my_code, runs=4)
report = aggregate(raw, total_runs=4)

# Sort: highest confidence first, then by severity
sev = {"high": 0, "medium": 1, "low": 2}
report.sort(key=lambda x: (-x.get("confidence", 0), sev.get(x["severity"], 2)))

How Many Runs Is Enough?

For a typical 50–100 line function, three or four runs usually saturates the finding space — meaning additional runs stop producing new unique issues. For larger modules with many concerns, five to seven runs is more appropriate. A practical stopping rule: keep running until two consecutive runs add nothing new after deduplication.

Cost scales linearly with runs, but finding coverage follows a curve that flattens quickly. The first run might catch 50% of issues. The second adds another 25%. By the fourth, you're typically above 90%. For most teams, three to four runs is the practical sweet spot.


A Bonus: Vary the Prompt Angle

There's a complementary technique worth combining with multiple runs: vary the prompt framing across runs rather than repeating the same prompt. A model asked to "review this code with the eyes of a security engineer" attends to different things than when asked to "review this as someone who will have to maintain it in six months." The code is the same. The attention is different.

This expands the effective sampling distribution beyond what temperature variation alone achieves, and often surfaces a class of findings — particularly maintainability and design concerns — that pure security-focused prompts routinely miss.


The Confidence Score Is Actually Useful

One underappreciated side effect of this approach is that the confidence score derived from cross-run frequency isn't just a completeness safety net — it's a useful signal in its own right.

A finding that appears in 4 out of 4 runs is almost certainly real and significant. A finding that appears in only 1 out of 4 runs might be a genuine edge case the model caught once, or it might be noise. Surfacing both, with their frequencies attached, lets reviewers make that judgment rather than hiding low-confidence findings or treating them with the same urgency as high-confidence ones.

This maps naturally onto a triage workflow: high-confidence findings go directly into the PR review; low-confidence findings get flagged for a human to evaluate. It's a cleaner handoff than "the AI said there might be an issue here, maybe."


Closing Thought

The instinct to run an LLM once and trust the output is understandable — it's fast, and the answers usually sound confident and complete. But for anything where missing something has a real cost, that instinct is worth resisting.

The multi-run pattern doesn't make the model more reliable. It makes your process more reliable by accounting for what the model actually is: a probabilistic system that samples intelligently from a large space of possible observations, not a deterministic function that exhaustively enumerates them.

Run it again. You'll be surprised what you missed the first time.

Related Reading For a deeper look at engineering determinism into LLM systems in regulated environments — including ensemble validation, RAG architectures, and audit trail design — see Engineering Determinism into LLMs for Financial Applications. Many of the same principles apply, with stricter requirements around reproducibility and governance.

🏠