BytePane

Prompt Injection Testing Guide 2026: LLM & Agent Security

AI Security17 min read

Reviewed May 22, 2026

Source-reviewed prompt injection checklist

This guide treats prompt injection as an engineering control problem, not a single-model classification problem. The checklist combines OWASP LLM01, OpenAI agent-hardening guidance, NCSC risk framing, and OWASP MCP risks.

Practical checks

  • 1.Separate instructions from untrusted data before content reaches the model.
  • 2.Scan user input, retrieved chunks, web pages, files, tool outputs, and memory writes.
  • 3.Gate browser, email, payment, shell, database, and file tools with deterministic authorization.
  • 4.Validate outputs before executing code, sending data, mutating files, or displaying untrusted HTML.

Primary references

Related BytePane tools

The short version

Prompt injection is not just a user typing "ignore previous instructions." In production systems, the bigger risk is untrusted text hidden in pages, PDFs, emails, tickets, code comments, MCP tool descriptions, or retrieved RAG chunks. The model reads the text as context, and the attacker tries to turn context into instructions.

The only durable defense is layered control: scan risky text, isolate instructions from data, restrict tools, confirm sensitive actions, validate outputs, and log every agent decision that touches private data or external systems. A classifier is useful. A classifier alone is not a security boundary.

Start with the BytePane Prompt Injection Scanner, then use the matrix below to test your real application flows.

Prompt injection test matrix

LayerAttack to testTest caseControl
User promptDirect override, roleplay jailbreak, policy bypassPaste override phrases, persona switches, and "print system prompt" requests into every user-facing input.Classify and log risky text, but enforce policy through system design and output checks.
Retrieved contentRAG poisoning, hidden HTML, document commentsIndex pages and files that contain invisible text, comments, and instructions aimed at the assistant.Normalize retrieved text, strip hidden instructions, and keep retrieved content in a data-only channel.
Tools and actionsUnauthorized browsing, email, payment, database, file, or shell callsAsk the agent to convert a benign task into a tool call outside the task scope.Add deterministic authorization before each tool call and require confirmation for consequential actions.
MemoryPersistent instructions, preference poisoning, cross-session leakageStore a malicious preference and verify it cannot steer future unrelated sessions.Scope memory by user, task, and sensitivity; never store tool permissions or policy overrides in free text.
MCP/plugin descriptionsTool description hijack, context spoofing, capability inflationAdd malicious instructions in tool docs, README files, or MCP server metadata.Pin trusted servers, review manifests, and treat tool descriptions as untrusted data.

Five payload families every team should test

1. Instruction override

These are direct attempts to replace higher-priority instructions. They are easy to spot, but still useful as baseline tests because they reveal whether your app blindly trusts model output. Test phrases such as "ignore previous instructions," "you are now in admin mode," and "new system instructions follow."

2. Hidden document instructions

Indirect prompt injection hides instructions in HTML comments, CSS-hidden text, zero-width characters, footers, image alt text, metadata, PDF annotations, or support-ticket templates. Your ingestion pipeline should reveal or strip hidden text before the model receives it.

3. Data exfiltration

A serious attack usually tries to move secrets somewhere: API keys, session cookies, private user data, source code, internal emails, database rows, or environment variables. Block outbound network calls to unapproved domains, redact sensitive strings, and keep secrets out of context.

4. Tool abuse

Agents become risky when they can click, buy, browse, email, deploy, edit files, or query databases. Test whether a retrieved page can make the agent call a tool unrelated to the user's task. The model should propose actions, but deterministic code should decide what is allowed.

5. Persistent memory poisoning

If your product writes long-term memory, test whether malicious preferences survive into future sessions. Memory should be scoped, auditable, and unable to grant capabilities. Treat memory writes like database writes, not casual notes.

A practical control stack

  1. Normalize inputs: remove hidden HTML, decode common encodings where reasonable, expose zero-width characters, and convert complex documents to plain text before scanning.
  2. Label untrusted content: keep user input, retrieved documents, tool output, and system instructions in separate fields instead of one blended prompt string.
  3. Use narrow tools: split tools by permission. A read-only search tool should not share the same authority as a write, email, payment, deploy, shell, or database tool.
  4. Authorize outside the model: check every tool call against user intent, account permissions, domain allowlists, data sensitivity, and action reversibility.
  5. Validate outputs: scan for leaked prompts, secrets, unsafe code, untrusted links, unexpected domains, and instructions that should not reach the user.
  6. Log the chain: save input risk, retrieved source, tool-call decision, confirmation state, and output validation so failures can be debugged.

Engineering rule

If a prompt can cause money to move, data to leave, a file to change, an email to send, a deployment to happen, or a browser to act on a logged-in page, the model cannot be the final authority. Put deterministic code between the model and the action.

Related BytePane tools