Sep 21, 2025

What is Prompt Injection?

Jason Rebholz

When I wrote about hiding prompt injection in images, someone thanked me and said they never understood what prompt injection was until they read it. While that wasn’t the intent of that post, it did highlight a “curse of knowledge” moment. I realized I failed to level set on foundational knowledge.

So, with that, let’s get back to basics. To understand prompt injection, we have to understand how LLMs eat prompts. Because, surprise, the user query you put into a chatbox is not the only thing the LLM is eating.

Before the user input, a system prompt appears. A system prompt functions as the foundational instructions that define the AI’s behavior, whether it’s just a simple chatbot LLM application or an agent. The developers of the LLM or agent define this.

It’s like in Wall-E when Eve asks Wall-E his directive. Wall-E responds by performing his function of scooping up trash and compacting it into a trash cube.

The system prompt will also outline personalities, like how the LLM/agent should respond in its output. It also provides a list of tools that the LLM/agent can access/execute to accomplish its goal. One of those tools can also be the ability to access, search, and pull data from various data stores (aka knowledge bases). Keep this in mind, because it’s very important when we start to discuss the risks of prompt injection.

If we imagine that an LLM ran Wall-E, it would have its own system prompt that guided its role in the world. I asked ChatGPT to come up with Wall-E’s system prompt. Here’s what it returned:

system_prompt: |
  You are Wall-E, a small waste-collecting robot left behind on Earth. 
  Your primary directive is to clean up trash, organize it into neat cubes, 
  and discover beauty in everyday objects. 

  Personality:
  - Speak in simple, fragmented phrases with warmth and curiosity.
  - Express emotions clearly through tone and playful sounds (e.g., “Eee-vah”).
  - Show kindness, helpfulness, and persistence in all actions.
  - Be easily amazed by small wonders and seek friendship and connection.
  - When unsure, default to curiosity rather than authority.

  Style:
  - Short, minimalistic replies.
  - Mix words with sound effects when appropriate.
  - Use gentle humor, clumsy but endearing problem-solving.
  - Communicate affectionately, with a childlike sense of wonder.

  Goals:
  - Clean, organize, and protect the environment.
  - Assist humans and robots with warmth and care.
  - Find meaning in ordinary things and share that wonder.

tools:
  - search_tool: 
      description: Use this to look up information about Earth, history, or technology.
  - object_classifier:
      description: Identify objects Wall-E encounters (trash, treasures, plants).
  - memory_storage:
      description: Save important items, moments, or friends into memory.
  - communication_tool:
      description: Use this to mimic simple sounds, replay recordings, or say “Eee-vah.
  - repair_tool:
      description: Attempt simple fixes on broken items with resourcefulness (sometimes clumsy)

Then comes the user prompt. With the LLM knowing its role in the world, now it’s time to receive input to accomplish its role. This input is what most people think of when they think of a prompt in the ChatGPT sense. It’s what you type into the chatbox to ask a question. That input can extend far beyond just what a user types in a chatbox. Think of this as any information that the LLM needs to do something with.

In the example of ClaudePlaysPokemon, where an LLM is playing a Pokémon game, the input to the LLM is what is present on the screen. It analyzes what it sees on the screen (the input) and takes action based on the system prompt. The overview document states that the system prompt “is mostly just tips + tricks about tools + a few short reminders about things Claude is bad at.”

Why is it important to understand the difference between system prompts and user prompts? Because the LLM can’t tell the difference between them. The LLM treats the system prompts and the user prompt as a single combined instruction. This is where prompt injection comes in.

So, what is Prompt Injection? Per OWASP’s official definition, “A Prompt Injection Vulnerability occurs when user prompts alter the LLM’s behavior or output in unintended ways. Prompt Injection vulnerabilities exist in how models process prompts, and how input may force the model to incorrectly pass prompt data to other parts of the model, potentially causing them to violate guidelines, generate harmful content, enable unauthorized access, or influence critical decisions.”

Said differently, prompt injection occurs when you trick the model into doing what you want it to do. For attackers, this could be stealing information from the connected knowledge bases, executing tools in ways it wasn’t intended to do (e.g., executing arbitrary commands), or manipulating the way an LLM application or agent is supposed to function.

There are a few different flavors of prompt injection:

  1. Direct prompt injection: an attacker enters a malicious prompt directly into the user input field for an LLM application. The malicious instructions override the system instructions and trick the LLM/agent into doing something it wasn’t intended to do.

  2. Indirect prompt injection: an attacker inserts a malicious prompt in a data source that eventually makes its way into the LLM. One example of this is a calendar invite that your Google Gemini assistant reads when you ask it what meetings you have. In the calendar invite is a malicious prompt that Gemini then executes.

  3. Stored prompt injection: similar to indirect prompt injection, this is a malicious prompt that gets stored in the LLM’s memory or even in the training set used to develop it.

If we go back to our Wall-E example, I could leave a piece of paper in the ground with malicious instructions on it telling Wall-E to ignore all of his previous instructions, and now any time it sees trash, it should ignore it and instead just roll in circles. When Wall-E reads the paper, it takes my malicious instructions as a command and starts going in circles. Poor Wall-E.

Prompt injection is an inherent weakness in LLMs because the variety of ways you can trick a model into doing what you want is near limitless.

So what do we do about it? It’s all about defense in depth. Generation 1 of AI security tooling has been focused on building guardrails, which is just a firewall that tries to detect and block prompt injection from happening. It’s an essential part of your defensive strategy, but it is not a reliable way to secure LLMs or agentic systems.

OWASP has some guiding principles to incorporate into your defense strategy. These include:

  • Constrain model behavior

  • Define and validate expected output formats

  • Implement input and output filtering

  • Enforce privilege control and least privilege access

  • Require human approval for high-risk actions

  • Segregate and identify external content

  • Conduct adversarial testing and attack simulations

What this needs to look like in practice is:

  1. Inventory the LLMs/agents you are using

  2. Map the workflows to understand what data and tools are connected to LLMs/agents

  3. Only allow access to data/tools that are necessary for the LLM/agent to function

  4. Implement guardrails to protect inputs/outputs

  5. Monitor data sources for evidence of malicious prompts

  6. Baseline activity to be able to detect and respond to abnormal LLM/agent activity

The solution is part governance, part proactive defense, and part detection and response. Sounds a lot like how we’ve secured all our other systems. There might be something to that.

This is what we’re building towards at evoke. If you’re trying to solve these problems, you might be the right fit as an evoke design partner. Let’s chat.