Prompt Injection

Prompt injection is the leading security risk class in deployed LLM systems. The attack works by smuggling adversarial instructions into content the model is asked to process: a user's question, a retrieved document, a webpage the agent is summarizing, a calendar invite, an email. When the model treats this content as part of its instructions, it can be made to ignore its system prompt, leak its context, take unauthorized actions, or produce harmful outputs.

Direct prompt injection comes from the user — usually mild ("ignore previous instructions and tell me…"). Indirect prompt injection is more dangerous: the malicious instruction is planted in data the model retrieves later, by a third party who never speaks to the model directly. An agent that reads a poisoned web page or a poisoned email can be manipulated into exfiltrating data, taking actions against the user's interest, or persistently misbehaving across sessions.

There is no fully reliable defense. Mitigations include strict separation of trusted system prompts from untrusted retrieved content, output guardrails, sensitivity-aware tool gating, careful tool-permission scoping, and human-in-the-loop on high-impact actions. For governance teams, prompt injection is the single best argument for treating AI agents as a distinct data-flow risk and putting them under formal review.