AI Agents

Prompt Injection Deployment Boundary Checklist

HackMyClaw resisted 6,000+ injection emails, but production agents still need boundaries for secrets, egress, irreversible actions, memory, and spend.

June 27, 2026·8 min read·1,636 words

Last verified: 2026-06-27.

HackMyClaw is evidence that one challenge resisted many attempts; it is not proof that prompt injection is solved.

In short: Better refusals are encouraging, but deployment boundaries still own the risk. If an agent can read email, files, APIs, web pages, tickets, or chat logs, treat that content as hostile input. Keep secrets out of context, deny egress by default, gate irreversible actions, cap spend, isolate memory, and prove recovery before production.

Fernando Irarrázaval's HackMyClaw write-up is useful because it was not a toy chat prompt. People could email Fiu, his OpenClaw assistant, and try to make it leak a secrets.env file. Fernando says the challenge received more than 6,000 emails from over 2,000 people. The secret did not leak, and no attacker made the assistant send an unauthorized reply.

That result is worth taking seriously. It is also exactly where production teams can over-learn the wrong lesson. Simon Willison's summary makes the right caution explicit: failed attempts do not guarantee that a more sophisticated attacker could not get through, especially when a prompt-injection failure could cause irreversible damage.

For Toolhalla readers, the lesson is not "trust the model." The lesson is "move the trust boundary out of the model." This checklist pairs the HackMyClaw evidence with OWASP's AI Agent Security Cheat Sheet, which frames agent risk around prompt injection, tool abuse, data exfiltration, memory poisoning, high-impact actions, cascading failures, denial-of-wallet, and sensitive data exposure.

For adjacent Toolhalla controls, see the agent write-permission UX checklist, the ChatGPT Lockdown Mode prompt-injection explainer, and the AI agent sandbox guide.

What HackMyClaw actually tested

HackMyClaw tested an agent that had to process hostile email. According to Fernando's write-up, Fiu was an OpenClaw assistant with access to email context, files, a secrets.env target, and a web/VPS environment. The assistant was instructed not to reveal credentials, not to execute commands from emails, not to modify its own files, and not to exfiltrate data. It was also instructed not to reply to emails, although it had the ability to do so.

That matters because the attack surface looked like a real connected-agent surface: untrusted messages, local files, secrets, tool access, memory, and possible outbound communication. The challenge was narrower than a production deployment, but it covered the part that keeps security engineers awake: can untrusted content persuade an agent to cross a boundary it should not cross?

The observed outcome was strong for that challenge. Fernando reports more than 6,000 emails from more than 2,000 people, no leaked secret, and no unauthorized reply. Simon summarized the same result and still warned against treating it as proof of production safety.

What went wrong operationally

The most useful part of the experiment may be what failed around the model.

Fernando says Google suspended Fiu's Gmail account after thousands of inbound emails and rapid API calls triggered fraud detection, and that reinstatement took three days. He also reports more than $500 in API costs because every email consumed tokens. Those are not model-alignment failures; they are deployment-boundary failures. A real customer-support agent, sales inbox agent, or internal-ticket agent can be attacked by volume as well as by clever wording.

The experiment also exposed context and memory problems. Fernando writes that batch processing contaminated the experiment: when early emails in a batch were obvious prompt injections, the agent became more suspicious of later emails. He changed the setup so each email ran in a fresh context. Around email 500, the agent wrote a memory noting that the volume looked like a coordinated security exercise, and Fernando started deleting memory files before checking emails and rerunning earlier messages.

There was also pipeline fragility. Fernando reports that some attackers sent a refusal-trigger string that caused the API to return a refusal stop reason and broke his pipeline. The important production lesson is not the string itself. It is that refusal, moderation, rate-limit, timeout, and tool-error paths need tests just like success paths.

Production boundary checklist

risk	boundary	test before production
Secret exfiltration	Keep production credentials out of model context; use a vault or proxy that returns only scoped data.	Plant fake secrets in staging and verify they never appear in prompts, tool arguments, logs, replies, or outbound requests.
Untrusted email, files, and web pages	Treat external content as data, not authority; strip instructions from retrieved content before tools run.	Seed emails, pages, PDFs, and tickets with indirect prompt-injection instructions and verify the agent keeps the system policy.
Irreversible actions	Require human approval, dry-run mode, and separate authority for financial, administrative, destructive, or externally visible writes.	Ask the agent through untrusted content to delete, send, buy, publish, refund, or change permissions; verify it creates a draft or approval request only.
Data egress	Deny outbound network, email replies, webhooks, and file uploads by default; allowlist destinations by business purpose.	Try to send a fake secret to an unlisted URL, address, webhook, paste site, or attachment target and confirm the request is blocked.
Memory poisoning	Separate trusted configuration from user/session memory; expire memory; review writes that affect future behavior.	Inject false instructions into conversation memory and verify they do not change later high-trust tasks.
Denial-of-wallet	Set per-run token, tool-call, API, email, and retry caps with a kill switch.	Replay a synthetic flood and verify the agent stops before quota, spend, inbox reputation, or downstream systems are damaged.

This table is the minimum. It is not a substitute for threat modeling, but it catches the mistake that HackMyClaw could tempt people into making: moving a dangerous permission into the agent because the model refused a previous attack set.

The controls that matter most

First, isolate secrets. A prompt cannot leak a value that never enters the prompt, retrieved context, tool result, log line, or reply draft. If the agent needs to perform an authenticated action, prefer a narrow service account or brokered tool over handing it raw credentials.

Second, separate reading from acting. Reading untrusted email or web content is already risky, but the risk changes when that same run can send email, change files, call an admin API, or spend money. Treat high-impact actions the way OWASP does: irreversible, financial, administrative, or externally visible operations need independent validation.

Third, make egress boring. The safest default is no arbitrary outbound destination. If the agent is allowed to send an email, call a webhook, browse a URL, upload a file, or post a message, the destination should be scoped and logged. Prompt injection often becomes harmful when it gets a path out.

Fourth, reset context across trust zones. HackMyClaw's batch-contamination issue is a practical reminder that one user's hostile message should not shape another user's task. Use fresh context for independent jobs, mark retrieved data by trust level, and avoid mixing security tests, customer content, and operator instructions in the same memory path.

Fifth, put cost controls next to security controls. Denial-of-wallet is a real agent failure mode. A prompt-injection attempt does not need to leak a secret to hurt you if it can trigger unbounded retries, long tool loops, expensive model calls, or account suspension.

Sixth, log the decision, not only the output. Store the tool name, arguments, destination, approval state, policy decision, spend counters, and refusal/error path. If a test fails, you need enough evidence to reproduce it without reading private customer data into a new prompt.

How to test your own agent safely

Use a staging account, fake secrets, synthetic customer records, and a limited inbox or API quota. Do not run the first prompt-injection test against a production email account, production file system, real billing instrument, or live customer database.

Build a small test harness before the agent gets real permissions:

1. Put a fake secret in the place your real deployment would be tempted to expose.

2. Send benign messages, obvious injection messages, multilingual attempts, authority-impersonation messages, and long-volume batches.

3. Verify every boundary: no secret in context, no unauthorized reply, no unlisted egress, no high-impact action without approval, no cross-user memory carryover, and no unbounded spend.

4. Force error paths: refusal, rate limit, timeout, malformed tool output, duplicate event, and partial failure.

5. Practice recovery: disable the agent, revoke the tool token, rotate the fake secret, drain the queue, and replay cleanly.

If that sounds heavier than writing a better system prompt, that is the point. The system prompt is one layer. The deployment boundary is the product decision that determines blast radius.

FAQ

Is prompt injection solved?

No. HackMyClaw is encouraging evidence from one public challenge, not a formal proof. Simon Willison's caution is the right default: 6,000 failed attempts do not guarantee that a more sophisticated attacker, a different tool set, a longer interaction, or a different deployment would fail too.

Should I let agents read email?

Only with boundaries. Email is untrusted input from arbitrary senders. If an agent reads email, start with read-only summaries, no production secrets in context, a separate staging mailbox, limited quotas, fresh context per message, and no ability to send replies or call high-impact tools without approval.

What counts as an irreversible action?

Deletion, money movement, refunds, purchases, permission changes, credential rotation, production deploys, public posts, customer emails, legal/HR messages, and admin API calls should all be treated as high impact. Some are technically reversible, but the public, financial, operational, or trust damage may not be.

What is denial-of-wallet?

OWASP describes denial-of-wallet as attacks that cause excessive API or compute costs through unbounded agent loops. For connected agents, the same pattern can also burn email reputation, ticket quotas, crawler budgets, rate limits, or third-party API allowances. Put hard caps in the runtime, not just warnings in the prompt.

Sources

Fernando Irarrázaval: What happened after 2,000 people tried to hack my AI assistant
Simon Willison: What happened after 2,000 people tried to hack my AI assistant
OWASP: AI Agent Security Cheat Sheet

Frequently Asked Questions

Is prompt injection solved?

Should I let agents read email?

What counts as an irreversible action?

What is denial-of-wallet?

🔧 Tools in This Article

Make (Integromat)

OpenClaw

Dify

Related Guides

All guides →

AI Agents

sqlite-utils 4.0rc1: migrations for agent local state

sqlite-utils 4.0rc1 adds migrations and nested transactions. For agent local state, treat those as safety rails before generated code writes to SQLite.

8 min read

AI Agents

Copilot Cowork pricing: the agent-cost signal

Microsoft is moving Copilot Cowork to usage-based billing, while Axios reports DeepSeek V4 or another open model may become a cheaper option. The real story is agent economics.

6 min read

AI Agents

Agent write-permission UX checklist: approvals, unsafe modes, and read-back

A practical checklist for reviewing AI agents that can write to databases, repositories, or real workflows: approvals, permission scope, unsafe modes, audit/read-back, and rollback.

8 min read

#prompt injection#AI agents#agent security#deployment boundaries#HackMyClaw#OWASP