01
It happened todayThe morning that proved nobody is immune
On the morning of
2026-06-15, per
status.claude.com, an incident titled
"Claude Opus 4.8 — Elevated errors" ran roughly
06:20–08:56 UTC and is now
resolved. For about two and a half hours, calls to that model could come back as errors. If you were chatting in an app, you shrugged and retried. But if you had an
agent — a loop calling that one model with no alternative — that was ~2.5 hours of
dead automation: jobs failing, queues backing up, customers staring at a spinner, and you finding out from a support ticket instead of a dashboard. This is the only incident I'll state as fact here, and I'm attributing it on purpose. The point isn't that one model had a bad morning. The point is that
every model, on
every provider, eventually will — and the only question that matters is whether your agent notices and routes around it, or just stops.
Resilience isn't a knock on any provider. The best-engineered API on earth still publishes a status page, because outages are a when, not an if. The teams that stay up aren't the ones who picked the perfect provider — they're the ones who assumed their provider would fail and built for it before they needed to.
02
Name the risk: SPOFA single model is a single point of failure
Here's the uncomfortable architecture truth: the quality of your model has nothing to do with the availability of your agent. You can pick the smartest model in the world, and if it's the only thing your loop can call, then its uptime is your uptime. One endpoint, one set of credentials, one region, one rate-limit pool — every one of those is a wire that, when cut, takes your whole automation down with it. That's a single point of failure (SPOF), and it's an availability risk regardless of how good the model is.
- One endpoint = one failure domain. A 5xx, a timeout, a regional blip, a rate-limit spike, an expired key — any single one stops a single-model agent cold.
- Interactive use hides it. A human in the loop retries, switches tabs, comes back later. An unattended agent loop has no human to improvise — it just throws and dies.
- Recurring + unattended = highest blast radius. A scheduled job or a webhook-driven loop can fail silently every fire for hours before anyone notices.
- Quality doesn't buy availability. "Best model" and "always available" are different axes. You want both, and only one of them is something you architect.
The fix isn't a better single model. It's making sure no single model can take you down — which means having somewhere to fall back to before the morning you need it.
03
Vendor-neutral, evergreenThe pattern: try a primary, fall back to a secondary
The fallback-model pattern is deliberately boring, which is why it works. You define a primary model and one or more fallbacks (ideally on a different provider, so a single provider's outage can't take out both). On every call you try the primary; if it returns a retryable failure — a 5xx, a timeout, or your circuit breaker is open — you fall back to the secondary so the loop keeps running. When the primary recovers, you prefer it again. That's the whole idea: no single provider is ever the only thing standing between your agent and a response.
- Order your providers. Primary (your preferred model) → fallback 1 (a different provider, so one vendor's outage doesn't sink both) → optional fallback 2. Different provider matters more than different model.
- Classify failures. Retryable/failover-worthy: timeouts, connection errors, 429s, 5xx, circuit-open. NOT failover-worthy: 400s/422s (bad request — same on every provider) and auth 401s for that key. Don't fail over a bug.
- Retry the same provider briefly, THEN fail over. A quick retry with backoff absorbs a transient blip; if it keeps failing, move to the next provider rather than hammering a sick endpoint.
- Wrap each provider in a circuit breaker. After N consecutive failures, mark it "open" and skip it entirely for a cool-down window — so you stop wasting time (and timeouts) on a provider that's clearly down.
- Prefer the primary again after cool-down. When the breaker half-opens, send a trial request; if it succeeds, close it and route back to your preferred model. Failover should be temporary, not permanent.
Keep it vendor-neutral on purpose. The pattern shouldn't know or care which providers you use — it just knows "try the list in order, skip the ones that are failing, prefer the top of the list when it's healthy." That's what lets you swap providers later without rewriting your agent.
04
This is the reward — take itWire it into your agent loop (copy-paste)
Here is the pattern as something you can paste in today. First the language-agnostic shape, then a concrete example using a generic OpenAI-compatible client with two base URLs / two keys — because most providers expose an OpenAI-compatible endpoint, this same code points at almost any of them by changing a base URL. Retry-with-backoff and a tiny circuit breaker are built in; nothing here is tied to a single vendor.
- Pseudocode — the shape of it:
providers = [primary, fallback1, fallback2] # ordered; mix vendors
function complete(request):
for p in providers:
if breaker[p].is_open(): # skip providers we know are down
continue
for attempt in 1..MAX_RETRIES:
try:
resp = p.call(request, timeout=T)
breaker[p].record_success()
return resp # first healthy provider wins
catch err:
if not is_retryable(err): # 400/422/401: don't fail over, it's a bug
raise err
breaker[p].record_failure()
sleep(backoff(attempt)) # e.g. 0.5s, 1s, 2s + jitter
# this provider exhausted its retries: fall through to the next provider
raise AllProvidersFailed # every provider down: alert + degrade gracefully
- Python — generic OpenAI-compatible client, two providers:
import time, random
from openai import OpenAI # any OpenAI-compatible SDK
PROVIDERS = [
{"name": "primary", "client": OpenAI(base_url=PRIMARY_URL, api_key=PRIMARY_KEY), "model": PRIMARY_MODEL},
{"name": "fallback", "client": OpenAI(base_url=FALLBACK_URL, api_key=FALLBACK_KEY), "model": FALLBACK_MODEL},
]
MAX_RETRIES, TIMEOUT = 2, 30
_fails, _open_until = {}, {}
def _open(name): # simple circuit breaker
return time.time() < _open_until.get(name, 0)
def _retryable(e):
code = getattr(e, "status_code", None)
return code is None or code == 429 or code >= 500 # timeout/conn or 5xx: retry; other 4xx: don't
def complete(messages):
for p in PROVIDERS:
if _open(p["name"]):
continue
for attempt in range(MAX_RETRIES + 1):
try:
r = p["client"].chat.completions.create(
model=p["model"], messages=messages, timeout=TIMEOUT)
_fails[p["name"]] = 0 # healthy again
return r
except Exception as e:
if not _retryable(e):
raise # bad request / auth: same on every provider, don't fail over
_fails[p["name"]] = _fails.get(p["name"], 0) + 1
if _fails[p["name"]] >= 3:
_open_until[p["name"]] = time.time() + 60 # cool down 60s, prefer primary after
time.sleep(min(pow(2, attempt), 8) + random.random()) # backoff + jitter
raise RuntimeError("all providers failed") # alert here
- TypeScript — same idea, ordered list + failover:
import OpenAI from "openai";
const PROVIDERS = [
{ name: "primary", client: new OpenAI({ baseURL: PRIMARY_URL, apiKey: PRIMARY_KEY }), model: PRIMARY_MODEL },
{ name: "fallback", client: new OpenAI({ baseURL: FALLBACK_URL, apiKey: FALLBACK_KEY }), model: FALLBACK_MODEL },
];
const openUntil: Record<string, number> = {}, fails: Record<string, number> = {};
const retryable = (s?: number) => s === undefined || s === 429 || s >= 500;
export async function complete(messages: any[]) {
for (const p of PROVIDERS) {
if (Date.now() < (openUntil[p.name] ?? 0)) continue; // breaker open: skip
for (let attempt = 0; attempt <= 2; attempt++) {
try {
const r = await p.client.chat.completions.create({ model: p.model, messages });
fails[p.name] = 0; return r; // first healthy provider wins
} catch (e: any) {
if (!retryable(e?.status)) throw e; // other 4xx: real bug, don't fail over
fails[p.name] = (fails[p.name] ?? 0) + 1;
if (fails[p.name] >= 3) openUntil[p.name] = Date.now() + 60000; // cool down, prefer primary after
await new Promise(r => setTimeout(r, Math.min(Math.pow(2, attempt), 8) 1000 + Math.random() 1000));
}
}
}
throw new Error("all providers failed"); // alert + degrade gracefully
}
- Health-check / prefer-primary-again note. The breaker's cool-down is your health check: when the window expires you try the primary again first, so failover is temporary. If you want eager recovery, run a tiny periodic background ping to the primary and close its breaker early the moment it answers — so you spend as little time as possible on the fallback.
Two honest caveats. (1) The retryable/non-retryable split is the part to get right — failing over on a 400 just runs your bug twice. (2) Keep the call shape identical across providers (same messages, same tools schema) so the fallback is a drop-in; if a provider needs a different shape, normalise it behind the provider object, not in your loop.
05
Five minutes, onceThe resilience checklist
Run down this list before you call an agent production-ready. None of it is exotic; all of it is the difference between a two-and-a-half-hour outage you slept through and a blip your users never noticed.
- Timeouts are set on every call. No unbounded waits — a hung request is an outage with no error. Pick a timeout shorter than your users' patience.
- A secondary provider's keys are provisioned NOW. Not "when we need it." The key, the base URL, and a smoke-test call all working before the outage. You can't onboard a provider during one.
- A circuit breaker per provider. Stop hammering a provider that's clearly down; skip it for a cool-down, then re-test. Prefer the primary again when it's healthy.
- Alerting on failover + on all-providers-failed. You should learn from a notification, not a customer. Alert when you start using the fallback (early warning) and page when everything fails.
- A periodic failover drill. Force the primary to fail in staging (bad key / blocked URL) and confirm the agent rides the fallback. An untested fallback is just a comment in your code.
- Graceful degradation for total failure. When every provider is down, fail in a way that's safe — queue and retry, return a clear message, don't lose the work. Down is bad; losing data is worse.
The whole job is small and it's a one-time cost. Build the fallback on a calm afternoon, drill it once, and you've turned every future provider outage from an incident into a footnote.
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.
You're in — check your inbox to confirm.
Frequently asked questions
Does adding a fallback double my cost?
No. You only call the fallback when the primary is failing — which is rare. In normal operation every request goes to your primary and the fallback costs nothing. The only ongoing cost is provisioning a second provider's keys (usually free until used) and a tiny optional health-check ping. You're paying near-zero for insurance that turns a multi-hour outage into a non-event.
How do I keep outputs consistent across two different models?
Keep the call shape identical (same system prompt, same messages, same tool/JSON schema) so the fallback is a drop-in. Constrain the output: ask for structured JSON and validate it the same way regardless of which model answered. Accept that a fallback response may be slightly different in style — during an outage, a slightly-different correct answer beats no answer. For anything strict, validate-and-repair the output rather than trusting either model blindly.
What about streaming responses?
Stream from whichever provider you've selected, but only commit to a provider once the stream actually starts. If the connection fails before the first token, fall over and start the stream fresh on the next provider. If it fails mid-stream, you generally restart the request on the fallback rather than trying to resume — so make the consuming side idempotent (don't act on partial output until the stream completes cleanly).
How do I actually test that failover works?
Force the failure on purpose. In staging, point the primary at a bad base URL or an invalid key, or use a mock that returns 503s, and confirm the agent rides the fallback and still completes. Then test recovery: restore the primary and confirm the breaker closes and routing returns to it. Put this drill on a schedule — an untested fallback path is the one that fails when you finally need it.
Should the fallback be a smaller or cheaper model?
It can be, and often should be on a different provider so a single vendor outage can't take out both. A capable-enough fallback that's available beats a perfect one that's down. The trade-off is your call: some teams fall back to a comparable model on another provider (consistency), others to a cheaper/smaller one (cost) and accept slightly lower quality during the outage window. Either is fine as long as the fallback can actually complete the task.
When should I fail over versus just retry the same provider?
Retry the same provider for a brief transient blip (a single timeout or 429) using a couple of attempts with backoff — most blips clear in milliseconds. Fail over to the next provider when retries are exhausted or the circuit breaker is open, i.e. the provider is sustainably unhealthy, not just briefly busy. Never fail over on a 400/422/401: those are bugs or auth problems that will fail identically on every provider.
Won't a circuit breaker make things worse if it trips wrongly?
Only if it's tuned badly. Set the failure threshold high enough that a single blip doesn't trip it (e.g. 3+ consecutive failures), keep the cool-down short (tens of seconds), and use a half-open trial request to re-test before fully closing. Tuned that way, the breaker only skips a provider that's genuinely down and quickly returns to it once it recovers — it reduces wasted timeouts, it doesn't cause outages.
Where should this live — in my app or in front of it?
Either works. In-app (the pattern in this guide) is the fastest to ship and keeps control in your code. A gateway/proxy in front of your app centralises failover for many services and keeps provider logic out of every codebase. Start in-app to get protected today; graduate to a shared gateway when more than one service needs the same resilience.