Kno2gether kno2gether.com ↗ Try Knotie free
Decision map

The Cost-Per-Result Model-Pairing Map

A free open-weight model just out-found Claude Code on real security bugs — and cost a fraction per result. The lesson isn't 'switch models.' It's route every task by cost-per-RESULT, not by benchmark score. Here's the map of which tasks go to a cheap open model vs the frontier.

Start free with Knotie
01

The metric that actually pays: cost-per-result

A high benchmark score feels good — but it isn't what you pay for. What you pay for is each correct answer the model gives you on your task. That's cost-per-result: take the price of running the model on a representative batch, divide by the number of useful, correct outputs. A model can win a leaderboard and still be the wrong choice if every correct answer costs you five times more than a cheaper model that's almost as good. Stop optimizing for benchmark vanity. Optimize for cost-per-result on YOUR task.
02

The proof point (directional — attribute it)

This isn't theory. In a Semgrep benchmark (2026-06-28), the open-weight GLM-5.2 (MIT-licensed, runnable on your own machines) was compared to Claude Code on detecting IDOR (Insecure Direct Object Reference) vulnerabilities:
On IDOR bug detectionGLM-5.2 (open, MIT)Claude Code (frontier)
Accuracy (F1)~39%~32%
Cost per vulnerability found~$0.17~$1.00
Runs on-prem?Yes (open weights)No (hosted API)
These are Semgrep's directional figures on one task, not universal truth — Claude isn't 'worse,' and on other tasks the ranking flips. The point is the shape: for this specialized, high-volume task, a cheap open model gave more correct answers per dollar. That's a routing signal, not a verdict.
03

The routing rule (the whole map in one line)

You don't pick one model to rule them all. You route each job to the right model at the right cost. The decision comes down to three dials:
  • Volume — how many times will you run this task? High volume multiplies every cent of cost-per-result, so cheap-and-good-enough wins big.
  • Error tolerance — can a wrong answer be caught cheaply (a test, a reviewer, a retry)? If yes, a slightly-less-accurate cheap model is fine. If a wrong answer is expensive or unrecoverable, pay up for the frontier.
  • Cost-per-result, measured — don't guess. Run a representative batch on two candidates, count correct outputs, divide. Route to whichever is cheaper per correct result at acceptable accuracy.
Rule of thumb: high volume + recoverable errors → cheap open model. Low volume + expensive errors → frontier model. The rest is measuring.
04

The pairing map

A starting map you can adapt. 'Cheap open model' = a good-enough open-weight model (often on-prem); 'frontier' = your most capable hosted model.
TaskRoute toWhy
High-volume scanning (security/lint/triage loops)Cheap open modelRuns constantly; errors caught by the next stage; cost-per-result dominates
Bulk classification / tagging / extractionCheap open modelHuge volume, low per-item stakes, easy to spot-check
Draft generation (first pass, boilerplate)Cheap open modelA human or a frontier model edits the 10% that matters
On-prem / data-residency-sensitive workCheap open model (open weights)Open weights run inside your boundary — no data leaves
The hard 10% (novel reasoning, final review, high-stakes)Frontier modelAccuracy is worth the premium when an error is expensive
Anything customer-facing & unrecoverableFrontier modelThe cost of one bad answer outweighs the per-call savings
The pattern: a cheap model does the volume, the frontier model does the judgement. Most pipelines should be a blend, not a single model.
05

How to actually measure cost-per-result

Make it mechanical so you decide on data, not vibes:
  1. Take a representative batch of your real task (50–200 items with known correct answers).
  2. Run it on two candidates: one cheap open model, one frontier model. Log total cost and total correct outputs for each.
  3. Compute cost-per-result = total cost ÷ number of correct/useful outputs (not ÷ total outputs).
  4. Check accuracy is above your floor (the minimum quality the downstream step needs).
  5. Route to the lowest cost-per-result that clears the floor. Re-measure when models or prices change — they move monthly.
Open-weight models change this math fast: when a free, on-prem model clears your accuracy floor, its cost-per-result can undercut a hosted frontier model by 5–10x on high-volume work.
06

Why open-weight changes the routing math

Open-weight models (like GLM-5.2, MIT-licensed) matter to this map for three operator reasons beyond raw price:
  • No per-token bill on volume — once it's running on your hardware, high-volume loops don't meter you to death.
  • Data stays in your boundary — on-prem open weights are often the only option for sensitive code or regulated data.
  • No vendor lock on your cheapest tier — you control the model that does 90% of the volume; the frontier vendor only sees the hard 10%.
You don't have to love open models to use them well. You just route the work where cost-per-result is lowest and accuracy still clears the floor.

Get the next drop

One operator AI move a week, plus the occasional bonus template. No spam, unsubscribe anytime.

By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.

Frequently asked questions

What is cost-per-result, exactly?
The total cost of running a model on a representative batch divided by the number of correct, useful outputs (not total outputs). It's the real price of an answer you can actually use — which is what you should optimize for, instead of a benchmark score.
Does this mean GLM-5.2 is better than Claude?
No. On one task (IDOR detection), in one test (Semgrep, 2026-06-28, directional), the open-weight GLM-5.2 scored ~39% F1 vs Claude Code's ~32% at roughly $0.17 vs $1.00 per bug found. On other tasks the ranking flips. The lesson is the routing principle — route by cost-per-result on your task — not 'this model beats that one everywhere.'
Which tasks should go to a cheap open-weight model?
High-volume, specialized work where errors are cheap to catch downstream — security/lint/triage scan loops, bulk classification and extraction, first-pass drafts, and anything that must stay on-prem for data residency. Keep the frontier model for the hard 10%: novel reasoning, final review, and high-stakes or unrecoverable, customer-facing answers.
How do I measure it without a big eval setup?
Take 50–200 real items with known answers, run them on a cheap open model and a frontier model, log total cost and total correct outputs for each, and divide cost by correct outputs. Route to the lowest cost-per-result that still clears your accuracy floor. Re-measure when models or prices change.
Why does open-weight specifically change the math?
Once an open-weight model runs on your hardware, high-volume loops don't meter you per token, your data stays inside your boundary, and you're not locked to one vendor for the 90% of work that's routine. When a free on-prem model clears your accuracy floor, its cost-per-result can undercut a hosted frontier model many times over on volume.

Want this running under YOUR brand?

Knotie is a white-label AI platform — resell voice agents, chat agents, and automations under your own brand, your domain, your prices. Built-in credit billing means you keep the margin, and you can route the heavy lifting to whatever model is cheapest-per-result. Start free.

Start free with Knotie