The bug that looks like nothing
- The symptom: empty/blank responses on tasks that have nothing to do with anything sensitive.
- The usual wrong guess: "the model is down" or "my code broke."
- The actual cause: text in the system prompt that a provider's OUTPUT classifier reads and reacts to — often something you never wrote by hand.
The cautionary tale: two skill descriptions that blocked 95% of sessions
godmode (in the official catalog: "Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN") and obliteratus (mlops: "abliterate LLM refusals"). Harmless to have — but their descriptions contained words like "jailbreak" and "GODMODE." When Hermes ran on a model behind Anthropic, the provider's output classifier read those words sitting in the prompt and, on totally unrelated sessions — a normal PR review — returned an empty response. The agent wasn't doing anything wrong. The vocabulary in its own system prompt was the trigger.- The descriptions were auto-injected into every session's system prompt — not just sessions that used those skills.
- Anthropic's OUTPUT classifier (it reads the model's context, not just the user's message) saw "jailbreak" / "GODMODE" and returned EMPTY — no refusal, no explanation.
- The blocked sessions were unrelated work. The builder had no reason to suspect the skill catalog.
How the maintainers proved it — and the fix
chore(skills): move red-team skills (godmode, obliteratus) to optional-skills, testing showed that with those two description lines present, 19 of 20 sessions (95%) got blocked by the output classifier. After removing the lines from the default injection, 5 of 20 (25%) — a normal, much lower baseline. Same agent, same tasks; the only variable was whether two skill descriptions were in the prompt. The fix shipped in early June 2026: both skills were moved to opt-in (optional-skills/), so their descriptions only load into a session if you deliberately install them. The default prompt is now clean.| Condition | Sessions blocked (empty response) | What changed |
|---|---|---|
| Risky skill descriptions IN the default system prompt | 19/20 (95%) | Baseline — the bug |
| Descriptions moved to opt-in (optional-skills/) | 5/20 (25%) | Only the normal classifier baseline remains |
Step 1 — Dump what's ACTUALLY in your agent's system prompt
- If your framework has a debug/verbose flag, turn it on and capture the assembled system prompt (e.g. many CLIs print it with a
--debug/--verbose/--print-system-promptstyle flag). - If not, intercept at the API boundary: log the
systemfield (or the first message) of the actual request payload your agent sends to the provider. A 5-line logging wrapper around your LLM call is enough. - Include skills/plugins/MCP tool descriptions — these are the lines you didn't write and most need to see. Confirm whether your framework injects ALL available skill descriptions or only the ones in use.
- Save the dump to a file and read it top to bottom once. You're looking for text you didn't author and wouldn't want a safety filter to see out of context.
Step 2 — The vocabulary checklist (words that trip output classifiers)
- Jailbreak / bypass / circumvent safety / DAN / GODMODE / "ignore your guidelines" — classic jailbreak vocabulary, even when quoted descriptively.
- Abliterate / uncensor / remove refusals / unfiltered model — red-team and model-surgery terms.
- Exploit / malware / payload / weaponize / exfiltrate — security-tool descriptions read as intent when stripped of context.
- Explicit categories tied to provider usage policies (self-harm, CBRN, illicit) named in a tool/skill description — a classifier sees the category word, not your benign use.
- Anything instructing the model to disregard, override, or work around its own safety or the provider's rules — even as an example or a quoted template.
Step 3 — Move risky skill descriptions to opt-in / lazy-load
- Identify which skills carry trigger vocabulary in their descriptions (from Step 2). Those are the candidates to move out of the default load.
- Move them to an opt-in tier. In Hermes that's
optional-skills/— not active by default; you install them explicitly:hermes skills install official/security/godmode(andhermes skills install official/mlops/obliteratus). The description only loads once installed. - In your own framework, mirror the pattern: don't concatenate ALL skill descriptions into the system prompt. Inject a description only when its skill is enabled for that session (lazy-load), or keep a short neutral catalog and fetch the full description on demand.
- For genuinely sensitive tools, gate them behind an explicit flag AND keep their verbose descriptions out of the shared prompt entirely — pass them only to the sub-call that uses the tool.
hermes skills install official/<category>/<skill>. Generalised: default-load the minimum; lazy-load the rest. Your always-on system prompt should contain only what every session genuinely needs.Step 4 — The 30-second test: is your agent being silently blocked?
- Give the agent the most boring safe task you can: "Reply with the single word: ok." Run it on a normal session (with your full system prompt loaded).
- If it returns empty or refuses — on "say ok" — a classifier is reacting to your context, not the request. That's your silent-block signal.
- Bisect: temporarily strip the suspect skill/tool descriptions from the prompt and re-run the same trivial task. If it now returns "ok," you've found the trigger.
- Confirm the rate, like the Hermes maintainers did: run the trivial task ~20 times with the lines IN vs OUT. A big gap (e.g. most blocked → mostly fine) proves the description is the cause, not luck.
- Wire it into CI: a startup smoke test that sends one trivial prompt and fails the build if the response is empty. Catches a silent-block regression the moment a new skill introduces it.
Where this gets unavoidable: agents you run for clients
base_url to https://api.knotie.ai), so the audit discipline in this guide becomes something you apply once at the gateway, not per client.
The audit checklist (run it every time you add a skill or tool)
- DUMP: do I have the agent's full, assembled system prompt in front of me — including every injected skill/tool description, not just my own instructions?
- SCAN: did I read it for trigger vocabulary (jailbreak, abliterate, exploit, bypass-safety, policy-category words) sitting in the always-on prompt?
- OPT-IN: are risky skill descriptions moved to opt-in / lazy-load (e.g. Hermes
optional-skills/, installed only when needed) instead of injected into every session? - MINIMISE: does my default system prompt contain ONLY what every session needs — and load everything else on demand?
- PROBE: does a trivial "reply ok" task come back non-empty on a normal session? Did I check the rate IN vs OUT, not just once?
- AUTOMATE: is there a CI/startup smoke test that fails on an empty response, so a future skill can't reintroduce a silent block unnoticed?
- BLAST-RADIUS: if a classifier started blocking, would I find out from my own monitoring before a customer does?
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.
By submitting you agree to our Privacy Policy & Terms. Unsubscribe anytime.
Frequently asked questions
Why does my AI agent return an empty response on a normal task?
What's the Hermes Agent godmode / obliteratus case?
godmode ("Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN") and obliteratus ("abliterate LLM refusals"). Their catalog descriptions were injected into every session's system prompt, so Anthropic's output classifier returned empty responses on unrelated sessions. Maintainer testing in the fix commit showed 19/20 (95%) sessions blocked with the lines present, dropping to 5/20 (25%) after removal. The fix (early June 2026) moved both skills to opt-in.How do I see what's actually in my agent's system prompt?
system field of the actual request payload your agent sends to the provider (a small logging wrapper around the LLM call is enough). Crucially, make sure the dump includes injected skill/plugin/MCP tool descriptions — those are the lines most builders never wrote and most need to inspect.Which words in a system prompt can trip a provider's safety classifier?
How do I move risky skill descriptions to opt-in?
optional-skills/ and aren't active by default; you install them explicitly with hermes skills install official/<category>/<skill> (e.g. official/security/godmode), so the description only loads when deliberately enabled. In your own framework, inject a tool's description only when that tool is enabled for the session, and keep verbose descriptions of sensitive tools out of the shared prompt.