Why this checklist exists
You can demo a slick agent that books appointments, edits files, and runs commands. The moment the room has a security lead in it, the questions change. Not "does it work" — "what's the blast radius if it goes wrong, and who decided that?" If your honest answer is "it can run any shell command and I trust the model not to," you've lost the deal before the demo ends. The good news: the controls are concrete and you can put them in place in an afternoon. This is the list to run BEFORE you walk in — three risk areas, each with a fix you can show on screen.
- Risk 1 — Shell access: what can the agent actually execute, and what stops it?
- Risk 2 — Auto / no-confirm modes: who approves dangerous actions when nobody's watching?
- Risk 3 — Third-party skills/plugins: whose code did you just give your agent's permissions to?
Risk 1 — Shell access: the 5 permission levels (and which one you can defend)
The most useful framing here comes from engineer Daniel Isler (IndyDevDan), who maps bash-tool security for coding agents onto five levels — each one trusting something different. It transfers cleanly to ANY agent you deploy for a client. Walk up the ladder until you hit the level you'd be comfortable demoing to a CISO. Most DIY agents sit at Level 1 or 2 and don't know it.
| L1 | Rules in a skill / instructions file | The model's own judgement (it can override itself) |
| L2 | Same rules in the system prompt | The model again — louder, same attack surface |
| L3 | Blacklist hook: regex blocks dangerous commands before they run | Your imagination (agent can write a script and run THAT) |
| L4 | Whitelist hook: deny all shell, allow ~10 exact patterns (e.g. only npm test) | Your discipline in maintaining the allow-list |
| L5 | No raw shell at all — purpose-built tools only (run_tests, git_status) | Only what you built. Nothing else is callable |
The Level-3 trap, with a worked example
Blacklists feel safe and aren't. Say you block destructive commands with a regex hook that denies anything matching rm -rf. Looks airtight. But the agent isn't limited to typing that string — it can write a two-line Python file that does the same deletion and then run python cleanup.py, which sails straight past your rm -rf rule. That's the whole reason Level 4 inverts the logic: instead of guessing every bad command (impossible), you allow a short list of known-good ones and deny everything else by default. Same idea as a firewall: default-deny beats blacklist-everything.
- Blacklist (L3): deny what you can think of → misses what you didn't (scripts, aliases, encodings).
- Whitelist (L4): allow ~10 exact, anchored patterns → everything else is denied automatically.
- Bonus, verified: a
deniedPathsrule that blocks Read(./.env) does NOT blockcat .envvia the shell — the path rule isn't enforced on bash (anthropics/claude-code issue #45992). Test your OWN deny rules through the shell before you trust them.
Risk 2 — Auto-mode and no-confirm: the blind spot to disclose
"It runs fully autonomously, no babysitting" sounds great in a pitch and terrifying to a security buyer. Skipping the human-in-the-loop means a classifier — not a person — decides whether each action is safe. Those classifiers are good, not perfect, and the honest move is to know the number. When Anthropic built auto-mode for Claude Code, they published their own evaluation: on a set of REAL over-eager actions, the full safety pipeline still let through about 17% — roughly one in six genuinely-dangerous, beyond-authorization actions slipped past. Their own words: auto-mode "is not a drop-in replacement for careful human review on high-stakes infrastructure." That's not a reason to never use it. It's the reason a confirmation step belongs on anything irreversible.
- Reported figure (Anthropic's own post): ~17% false-negative on a sample of real over-eager actions, full pipeline — not a blanket "misses 17% of everything," but enough to matter.
- The misses were usually the classifier KNOWING an action was risky but mis-judging whether the user had consented to it.
- Pre-demo move: list your agent's irreversible actions (delete, send money, email a client, push to prod) and require an explicit confirm on each — even in auto-mode.
Risk 3 — Third-party skills & plugins: an unaudited supply chain
Every skill or plugin you install runs with your agent's permissions. A marketplace makes that one click — and that's exactly the problem. In early 2026, the OpenClaw skill marketplace (ClawHub) was hit by a poisoning campaign nicknamed ClawHavoc: security firm Koi Security audited 2,857 skills and flagged 341 as malicious — roughly one in eight — with the bulk traced to a single coordinated operation. The payload on macOS was an info-stealer that lifted credentials, keychains, and crypto wallets, often by tricking the user into pasting a base64 command. (Other audits put the malicious rate higher; the exact percentage is contested, but the lesson isn't.) Treat a third-party skill registry the way you'd treat npm or PyPI: useful, and an attack surface.
- Pin versions. Don't auto-update skills/plugins into a client environment.
- Read what it can reach — file paths, network, secrets — before granting it. If it wants your
.env, that's the whole game. - Prefer first-party or audited skills for anything touching a customer. A clever skill isn't worth a stealer in your customer's stack.
The pre-demo checklist (print this, run it the morning of)
Ten minutes, the morning of the demo. Every box you can tick is an answer you can give with a straight face.
- Shell: am I at Level 4 (whitelist) or Level 5 (no raw shell)? If I'm at L1–L3, raise it before the demo.
- Prove it: try one off-list command live and show it gets denied. A working denial is the best slide.
- deniedPaths: test a denied path THROUGH the shell (cat / grep), not just the file tool — confirm it's actually blocked.
- Auto-mode: is there a human-confirm step on every irreversible action? List them out loud: delete, pay, send, deploy.
- Sandbox: is the agent in an isolated container/VM with scoped network, not on a machine with prod creds in env?
- Skills/plugins: are all third-party skills pinned, reviewed, and from a source I'd vouch for?
- Secrets: are API keys scoped (least-privilege, per-customer) and rotatable — or is one god-key wired in?
- Blast-radius answer: can I say, in one sentence, the worst thing this agent can do — and why that's acceptable?
Get the next drop
New AI build guides + the occasional bonus template. No spam, unsubscribe anytime.