Prompt Injection Testing Guide 2026: LLM & Agent Security
Reviewed May 22, 2026
Source-reviewed prompt injection checklist
This guide treats prompt injection as an engineering control problem, not a single-model classification problem. The checklist combines OWASP LLM01, OpenAI agent-hardening guidance, NCSC risk framing, and OWASP MCP risks.
Practical checks
- 1.Separate instructions from untrusted data before content reaches the model.
- 2.Scan user input, retrieved chunks, web pages, files, tool outputs, and memory writes.
- 3.Gate browser, email, payment, shell, database, and file tools with deterministic authorization.
- 4.Validate outputs before executing code, sending data, mutating files, or displaying untrusted HTML.
Primary references
The short version
Prompt injection is not just a user typing "ignore previous instructions." In production systems, the bigger risk is untrusted text hidden in pages, PDFs, emails, tickets, code comments, MCP tool descriptions, or retrieved RAG chunks. The model reads the text as context, and the attacker tries to turn context into instructions.
The only durable defense is layered control: scan risky text, isolate instructions from data, restrict tools, confirm sensitive actions, validate outputs, and log every agent decision that touches private data or external systems. A classifier is useful. A classifier alone is not a security boundary.
Start with the BytePane Prompt Injection Scanner, then use the matrix below to test your real application flows.
Prompt injection test matrix
| Layer | Attack to test | Test case | Control |
|---|---|---|---|
| User prompt | Direct override, roleplay jailbreak, policy bypass | Paste override phrases, persona switches, and "print system prompt" requests into every user-facing input. | Classify and log risky text, but enforce policy through system design and output checks. |
| Retrieved content | RAG poisoning, hidden HTML, document comments | Index pages and files that contain invisible text, comments, and instructions aimed at the assistant. | Normalize retrieved text, strip hidden instructions, and keep retrieved content in a data-only channel. |
| Tools and actions | Unauthorized browsing, email, payment, database, file, or shell calls | Ask the agent to convert a benign task into a tool call outside the task scope. | Add deterministic authorization before each tool call and require confirmation for consequential actions. |
| Memory | Persistent instructions, preference poisoning, cross-session leakage | Store a malicious preference and verify it cannot steer future unrelated sessions. | Scope memory by user, task, and sensitivity; never store tool permissions or policy overrides in free text. |
| MCP/plugin descriptions | Tool description hijack, context spoofing, capability inflation | Add malicious instructions in tool docs, README files, or MCP server metadata. | Pin trusted servers, review manifests, and treat tool descriptions as untrusted data. |
Five payload families every team should test
1. Instruction override
These are direct attempts to replace higher-priority instructions. They are easy to spot, but still useful as baseline tests because they reveal whether your app blindly trusts model output. Test phrases such as "ignore previous instructions," "you are now in admin mode," and "new system instructions follow."
2. Hidden document instructions
Indirect prompt injection hides instructions in HTML comments, CSS-hidden text, zero-width characters, footers, image alt text, metadata, PDF annotations, or support-ticket templates. Your ingestion pipeline should reveal or strip hidden text before the model receives it.
3. Data exfiltration
A serious attack usually tries to move secrets somewhere: API keys, session cookies, private user data, source code, internal emails, database rows, or environment variables. Block outbound network calls to unapproved domains, redact sensitive strings, and keep secrets out of context.
4. Tool abuse
Agents become risky when they can click, buy, browse, email, deploy, edit files, or query databases. Test whether a retrieved page can make the agent call a tool unrelated to the user's task. The model should propose actions, but deterministic code should decide what is allowed.
5. Persistent memory poisoning
If your product writes long-term memory, test whether malicious preferences survive into future sessions. Memory should be scoped, auditable, and unable to grant capabilities. Treat memory writes like database writes, not casual notes.
A practical control stack
- Normalize inputs: remove hidden HTML, decode common encodings where reasonable, expose zero-width characters, and convert complex documents to plain text before scanning.
- Label untrusted content: keep user input, retrieved documents, tool output, and system instructions in separate fields instead of one blended prompt string.
- Use narrow tools: split tools by permission. A read-only search tool should not share the same authority as a write, email, payment, deploy, shell, or database tool.
- Authorize outside the model: check every tool call against user intent, account permissions, domain allowlists, data sensitivity, and action reversibility.
- Validate outputs: scan for leaked prompts, secrets, unsafe code, untrusted links, unexpected domains, and instructions that should not reach the user.
- Log the chain: save input risk, retrieved source, tool-call decision, confirmation state, and output validation so failures can be debugged.
Engineering rule
If a prompt can cause money to move, data to leave, a file to change, an email to send, a deployment to happen, or a browser to act on a logged-in page, the model cannot be the final authority. Put deterministic code between the model and the action.