Skip to main content
A normal completion writes code once and hands it back. Code Mode treats code as something to get right, not just to produce: one model writes a solution, a separate critic model reviews it, and the coder repairs against that feedback — looping until the review passes or a budget/iteration cap is hit. It’s the Reflexion pattern (Shinn et al., 2023), applied across different models so the critic’s blind spots aren’t the coder’s. You set the objective with a policy. PRYSM picks the coder and the critic, runs the loop, and returns the files, the final review verdict, and a PrysmProof v2 that binds the task to the code it produced.
PRYSM never executes the generated code. The default critic is a pure LLM reviewer — it reads the code, it doesn’t run it. That keeps the endpoint safe to call with any task. See Safety below.
from prysm import Prysm

client = Prysm()
r = client.code("Write a thread-safe LRU cache in Python with tests", policy="depth")
for f in r["files"]:
    print(f["path"])
    print(f["content"])
print(r["passed"], r["prysm"]["proof"]["proof_hash"])

The loop

Code Mode runs four kinds of stage. Generate and review alternate until the code passes or the loop runs out of budget.
1

plan

Detect the target language (or honor the one you pass) and select the coder and critic models for the policy.
2

generate

The coder writes the complete solution as one or more fenced files. In depth, several diverse coders write in parallel and the best candidate is kept.
3

review

A separate critic model reads the code and returns an explicit verdict — PASS/FAIL, a score, and a list of concrete issues. If it passes, the loop stops.
4

repair

The coder rewrites the solution against the reviewer’s issues. The new draft goes back to review. Repeat up to max_iters total iterations.
The result reports how many iterations ran, the final passed verdict, and the full per-stage trace (when include_trace is on) so it’s never a black box.

Policies: the objective dial

A policy says what you’re optimizing for. It chooses how many coders run and how strong the critic is.

efficiency

One cheap, capable coder and a lightweight critic. The fewest tokens that still gets a reviewed, repaired solution.

balanced

The default. A solid mid-tier coder paired with a strong reasoning critic — the right trade-off for everyday code generation.

depth

Maximum robustness. Several diverse coders (different providers) write in parallel, the best candidate is kept, and the strongest available critic reviews it.
PolicyCodersCriticDefault max_iters
efficiency1 (cheapest capable)lightweight3
balanced1 (mid-tier)strong reasoning model3
depthseveral, diverse providersstrongest available3
depth is where PRYSM’s multi-model edge shows: independent coders from different providers rarely share the same failure mode, so crossing them and keeping the best candidate beats any single model on hard tasks.

Single-shot mode

Set review: false to skip the critic loop entirely — one coder, one draft, no repair. It’s the fastest, cheapest shape, for when you just want a quick generation and will review it yourself.
r = client.code("Write a Python hello world", review=False)
print(r["iterations"])  # 1

Overriding the models

PRYSM picks sensible coder and critic models for the policy, but you can force either:
r = client.code(
    "Implement rate limiting middleware",
    coder_model="claude-sonnet-4.5",   # who writes the code
    reviewer_model="deepseek-r1",      # who critiques it
    max_iters=4,                       # generate + up to 3 repairs
    max_cost_usd=0.10,                 # soft budget; stops the repair loop early
)
Unknown model ids are ignored and PRYSM falls back to its own selection, so a typo never breaks the call.

Confidence and the verdict

The critic returns an explicit VERDICT: PASS|FAIL and a SCORE between 0 and 1. PRYSM blends that score with the running confidence so the final number reflects the whole loop, not just the last turn. The verdict, score, and the concrete issues the critic raised are all returned on the response:
"review": {
  "passed": true,
  "score": 0.92,
  "issues": [],
  "reviewer": "deepseek-r1"
}

PrysmProof v2

Every run is hashed into a PrysmProof v2: a SHA-256 that binds the "prysm-code-v1" tag, the policy, the language, the task itself, and every produced file (path + content hash) together with the execution stages. Because the task and the files are part of the hash, the proof attests to this code solving this task — change either and the hash changes. It’s logged to the same store as v1, so you can verify any run later:
curl https://api.prysm1.com/v1/proof/{request_id}

Safety: PRYSM never runs your code

Generating code is safe; executing untrusted, model-written code on a server is not. So Code Mode’s default critic is a pure LLM reviewer — it reasons about correctness, edge cases, and style from the source alone. Nothing PRYSM ships runs the generated code. The reviewer is a pluggable interface: a deployment that owns a real, isolated sandbox can supply its own runner to execute tests, but that’s an explicit, self-hosted choice — never the hosted default.

When to use Code Mode

Use chat / routing

Quick snippets, explanations, or code you’ll review yourself. One model, lowest cost, OpenAI drop-in.

Use Code Mode (/v2/code)

Self-contained tasks where you want a reviewed, repaired solution — and a verifiable proof of how it was produced.
Code Mode is available in both SDKs (client.code(...)), the CLI (prysm code), and directly via POST /v2/code.