01
The metric that actually pays: cost-per-result
A high benchmark score feels good — but it isn't what you pay for. What you pay for is each correct answer the model gives you on your task. That's cost-per-result: take the price of running the model on a representative batch, divide by the number of useful, correct outputs. A model can win a leaderboard and still be the wrong choice if every correct answer costs you five times more than a cheaper model that's almost as good. Stop optimizing for benchmark vanity. Optimize for cost-per-result on YOUR task.
02
The proof point (directional — attribute it)
This isn't theory. In a Semgrep benchmark (2026-06-28), the open-weight GLM-5.2 (MIT-licensed, runnable on your own machines) was compared to Claude Code on detecting IDOR (Insecure Direct Object Reference) vulnerabilities:
| On IDOR bug detection | GLM-5.2 (open, MIT) | Claude Code (frontier) |
|---|
| Accuracy (F1) | ~39% | ~32% |
| Cost per vulnerability found | ~$0.17 | ~$1.00 |
| Runs on-prem? | Yes (open weights) | No (hosted API) |
These are Semgrep's directional figures on one task, not universal truth — Claude isn't 'worse,' and on other tasks the ranking flips. The point is the shape: for this specialized, high-volume task, a cheap open model gave more correct answers per dollar. That's a routing signal, not a verdict.
03
The routing rule (the whole map in one line)
You don't pick one model to rule them all. You route each job to the right model at the right cost. The decision comes down to three dials:
- Volume — how many times will you run this task? High volume multiplies every cent of cost-per-result, so cheap-and-good-enough wins big.
- Error tolerance — can a wrong answer be caught cheaply (a test, a reviewer, a retry)? If yes, a slightly-less-accurate cheap model is fine. If a wrong answer is expensive or unrecoverable, pay up for the frontier.
- Cost-per-result, measured — don't guess. Run a representative batch on two candidates, count correct outputs, divide. Route to whichever is cheaper per correct result at acceptable accuracy.
Rule of thumb: high volume + recoverable errors → cheap open model. Low volume + expensive errors → frontier model. The rest is measuring.
04
The pairing map
A starting map you can adapt. 'Cheap open model' = a good-enough open-weight model (often on-prem); 'frontier' = your most capable hosted model.
| Task | Route to | Why |
|---|
| High-volume scanning (security/lint/triage loops) | Cheap open model | Runs constantly; errors caught by the next stage; cost-per-result dominates |
| Bulk classification / tagging / extraction | Cheap open model | Huge volume, low per-item stakes, easy to spot-check |
| Draft generation (first pass, boilerplate) | Cheap open model | A human or a frontier model edits the 10% that matters |
| On-prem / data-residency-sensitive work | Cheap open model (open weights) | Open weights run inside your boundary — no data leaves |
| The hard 10% (novel reasoning, final review, high-stakes) | Frontier model | Accuracy is worth the premium when an error is expensive |
| Anything customer-facing & unrecoverable | Frontier model | The cost of one bad answer outweighs the per-call savings |
The pattern: a cheap model does the volume, the frontier model does the judgement. Most pipelines should be a blend, not a single model.
05
How to actually measure cost-per-result
Make it mechanical so you decide on data, not vibes:
- Take a representative batch of your real task (50–200 items with known correct answers).
- Run it on two candidates: one cheap open model, one frontier model. Log total cost and total correct outputs for each.
- Compute cost-per-result = total cost ÷ number of correct/useful outputs (not ÷ total outputs).
- Check accuracy is above your floor (the minimum quality the downstream step needs).
- Route to the lowest cost-per-result that clears the floor. Re-measure when models or prices change — they move monthly.
Open-weight models change this math fast: when a free, on-prem model clears your accuracy floor, its cost-per-result can undercut a hosted frontier model by 5–10x on high-volume work.
06
Why open-weight changes the routing math
Open-weight models (like GLM-5.2, MIT-licensed) matter to this map for three operator reasons beyond raw price:
- No per-token bill on volume — once it's running on your hardware, high-volume loops don't meter you to death.
- Data stays in your boundary — on-prem open weights are often the only option for sensitive code or regulated data.
- No vendor lock on your cheapest tier — you control the model that does 90% of the volume; the frontier vendor only sees the hard 10%.
You don't have to love open models to use them well. You just route the work where cost-per-result is lowest and accuracy still clears the floor.
Get the next drop
One operator AI move a week, plus the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.
You're in — check your inbox to confirm.
Frequently asked questions
What is cost-per-result, exactly?
The total cost of running a model on a representative batch divided by the number of correct, useful outputs (not total outputs). It's the real price of an answer you can actually use — which is what you should optimize for, instead of a benchmark score.
Does this mean GLM-5.2 is better than Claude?
No. On one task (IDOR detection), in one test (Semgrep, 2026-06-28, directional), the open-weight GLM-5.2 scored ~39% F1 vs Claude Code's ~32% at roughly $0.17 vs $1.00 per bug found. On other tasks the ranking flips. The lesson is the routing principle — route by cost-per-result on your task — not 'this model beats that one everywhere.'
Which tasks should go to a cheap open-weight model?
High-volume, specialized work where errors are cheap to catch downstream — security/lint/triage scan loops, bulk classification and extraction, first-pass drafts, and anything that must stay on-prem for data residency. Keep the frontier model for the hard 10%: novel reasoning, final review, and high-stakes or unrecoverable, customer-facing answers.
How do I measure it without a big eval setup?
Take 50–200 real items with known answers, run them on a cheap open model and a frontier model, log total cost and total correct outputs for each, and divide cost by correct outputs. Route to the lowest cost-per-result that still clears your accuracy floor. Re-measure when models or prices change.
Why does open-weight specifically change the math?
Once an open-weight model runs on your hardware, high-volume loops don't meter you per token, your data stays inside your boundary, and you're not locked to one vendor for the 90% of work that's routine. When a free on-prem model clears your accuracy floor, its cost-per-result can undercut a hosted frontier model many times over on volume.