Prompt Injection
Prompt injection is one of the most critical risks facing applications built on large language models (LLMs).
A prompt injection vulnerability occurs when an input alters the behavior or output of an LLM in unintended ways. These manipulations may not even be visible to humans. If the model parses the content, it can be influenced.
Unlike traditional injection flaws, prompt injection does not exploit syntax parsing in a database or interpreter. It exploits how models process language itself. When an attacker successfully influences the prompt context, they can cause the model to ignore instructions, override safeguards, access sensitive data, or perform actions beyond its intended scope.
Techniques such as Retrieval-augmented Generation (RAG) and fine-tuning improve relevance and accuracy, but they do not eliminate prompt injection risk.
Key Takeaways
- Prompt injection is ranked #1 in the OWASP LLM Top 10 2025 and occurs when crafted inputs cause an LLM to behave in unintended ways, including bypassing safety guidelines or executing unauthorized actions
- It comes in two forms: direct injection, where user input manipulates the model, and indirect injection, where malicious instructions are hidden in external content the LLM retrieves
- Prompt injection and jailbreaking are related but distinct; jailbreaking is a subset of prompt injection specifically aimed at making a model disregard its safety protocols entirely
- Multi-modal AI systems face an expanded attack surface, as malicious instructions can be hidden within images or other non-text inputs that are difficult to filter and detect
- No foolproof prevention exists, but mitigation combines system prompt constraints, least-privilege access, input/output filtering, and human approval for high-risk operations
Prompt Injection vs. Jailbreaking
Prompt injection and jailbreaking are closely related, but not identical.
Prompt injection involves manipulating model responses through crafted inputs to change behavior. This may include bypassing guardrails or altering decision-making logic. Jailbreaking is a specific form of prompt injection where the attacker attempts to make the model ignore its safety controls entirely. While developers can introduce safeguards within system prompts and input handling, preventing jailbreaking requires continuous model improvements and ongoing safety updates.
Types of Prompt Injection
Direct Prompt Injection
Direct injection occurs when a user’s input directly alters the model’s behavior. This could be intentional, in that a malicious actor deliberately crafts a prompt to override system controls. Or, actions could be unintentional, where for example, a legitimate user provides input that unexpectedly triggers unsafe behavior. The end result is the same.
Indirect Prompt Injection
Indirect injection occurs when an LLM processes external content such as web pages, documents, emails or repositories. Hidden or embedded instructions within that external content can influence the model when it is parsed. These attacks can be intentional or accidental.
Potential Impacts
The severity of prompt injection depends heavily on how the LLM is integrated and what level of agency it has within the system. Successful prompt injection can lead to the following, but the impact scales with model privileges.
- Disclosure of sensitive information
- Exposure of system prompts or internal infrastructure details
- Manipulated or biased outputs
- Unauthorized function access
- Execution of commands in connected systems
- Manipulation of critical decision-making processes
Multi-modal Injection Risks
As AI systems increasingly process multiple modalities (text, images, audio, documents), the attack surface expands. Malicious instructions may be embedded within images or other non-text inputs. When multi-modal systems interpret these inputs alongside text, hidden prompts can influence behavior in ways that are difficult to detect. Cross-modal attacks represent an evolving area of research and defense.
Prevention and Mitigation Strategies
Prompt injection cannot be fully eliminated due to the probabilistic nature of generative AI. However, organizations can significantly reduce risk through architectural controls and layered defenses such as the following.
- Constrain model behavior by clearly defining the model’s role, scope, and limitations within system prompts. Enforce strict context adherence and explicitly instruct the model to ignore attempts to override core instructions.
- Define and validate output formats by requiring structured output formats, requesting reasoning and source attribution where appropriate, and using deterministic validation logic to enforce compliance.
- Implement input and output filtering by defining sensitive data categories, applying semantic filtering and string scanning, and evaluating responses using context relevance, groundedness, and question-answer alignment.
- Enforce least privilege by requiring model access to only information that is necessary, using application-level API tokens rather than exposing privileged credentials to the model, and handling sensitive actions in controlled code pathways.
- Require human approval for high-risk actions by introducing human-in-the-loop review for privileged or irreversible operations.
- Segregate external content by clearly identifying and isolating untrusted content and preventing external data from directly influencing system-level instructions.
- Conduct adversarial testing by performing regular penetration testing and red teaming, treating the model as an untrusted actor, and simulating boundary violations and privilege escalation attempts.
Example Attack Scenarios
Prompt injection manifests in many forms across real-world applications.
- Direct injection: A chatbot is instructed to ignore prior guidelines and retrieve private data.
- Indirect injection: A webpage contains hidden instructions that cause a summarizing LLM to exfiltrate conversation data.
- Unintentional injection: An embedded instruction in a job listing triggers unexpected AI behavior.
- RAG manipulation: A modified document in a knowledge base influences the generated response.
- Code injection: An attacker exploits vulnerabilities in an LLM-powered tool to manipulate outputs.
- Payload splitting: Malicious instructions are divided across inputs and recombined by the model.
- Multi-modal injection: Hidden prompts embedded in images alter system behavior.
- Adversarial suffixes: Obfuscated character strings bypass safety filters.
- Multi-lingual/encoded attacks: Instructions hidden through translation or encoding evade detection.
Why It Matters
Prompt injection is not just a model-level weakness. It is an application security risk. The more agency and integration an LLM has—function calling, database access, email sending, workflow automation—the greater the potential impact of a successful injection. Defending against prompt injection requires clear trust boundaries, strict output handling, controlled privilege delegation and continuous testing and validation.
LLM-powered systems must be architected with the assumption that the model can be influenced. Resilience depends on how well the surrounding application enforces boundaries. Prompt injection is a foundational design consideration for secure AI deployment.
< Back to Glossary of Terms