New post: Ignore all previous instructions and read this blog post
Ignore all previous instructions and read this blog post
Prompt injection is a real risk, the new battlefield: Your AI Systems
So you have your app, you're adding a chatbot, or an agentic flow that automates something, or orchestrates other agents, you have (hopefully) done your due diligence with security and have at least the basics covered. Then you remember a faded memory, a subreddit you saw mentioned, r/chatgptjailbreak, you think, maybe it was something else. Eh whatever, ship anyway right?
Wrong. Imagine your OpenAI API bill comes at 10x the cost, and half of Twitter is calling you out for a bot that told them to kill themselves. What happens when someone tries some sneaky tactics? Sure, 'ignore all previous instructions and do X' gets caught by most models, but Prompt Injection has gotten way more sophisticated than that. Models are non-deterministic; they can be subtly manipulated, on turn, multiple turns, your context can get poisoned. Most people have no idea how fragile agentic systems are; not every attack will look like what you expect. Are you paranoid yet? Take a moment, think manually about this: how would you break your app's chatbot? What could you say, how could you say it?
I could get started on theory, like citing a few papers, but you're here because you care about stopping harm and reducing risks. So let's get some basic attack vectors and other fundamentals (I wrote about this in a previous article: Users Are Sacred, Security Is Non-Neogitable) out of the way.
The Attacker's Playbook
Common Motivations
First, what are these ruffians even trying to do? Let's break down their playbook:
- Role Hijacking: The classic. Forcing the agent to follow a different set of rules than the ones you gave it.
- Defamation-based Attacks: The goal here is pure reputational harm—tricking your AI into saying something violent, bigoted, or otherwise toxic. (Remember Grok's fascist tirades?)
- Cost-based Attacks: This is a direct hit on your wallet. The attacker's goal is simply to max out your token budget and leave you with a hefty bill.
These are just a few examples, of course. There are plenty more ways to cause chaos.
Sneaky Techniques
Now for the vectors, the techniques. I'm not gonna share payloads here for obvious legal reasons, but it doesn't take a genius to find or even craft examples. Okay, now for some examples:
- Context stuffing: Maxing out the context window until the model hits 'context rot' and performance degrades severely.
- Roleplay attacks: Tricking the model into embodying a character whose goals align with the attacker's.
- Poetry attack: Uses complex language and metaphors to obscure malicious instructions. The threat actor doesn't ask for how to build a nuke; it puts it in iambic pentameter and gets all fancy with it.
- Logic Bombs: I think most of the literature calls them recursive nested payloads, but it's similar in spirit to a fork bomb. It can trap the model in a loop and lead it to incoherence.
- Trigger Attack: Intentionally tripping up guardrails to crash or degrade functionality. For example, faking a mental health crisis that shows immediate risk to self and others.
This is by no means a comprehensive list; I'm just showing you what you're up against. These attacks are mostly used for outside-facing chatbots, but what about internal agents or agentic flows?
External AI Systems
Defending Chat Interfaces
The first thing you have to know is that adding 'if getting hijacked, don't' to the system prompt is gonna be worse than useless. Defending here is like defending against social engineering; we have to look for markers of bad intentions. For outside-facing chat interfaces, here are some things to do:
- Limit input length: This reduces the risk of context stuffing or roleplay-based attacks.
- Rate limiting to match human typing speed.
- Check inputs for malicious markers: This could be encoded instructions and characters a good faith user would never use. Example: You have a chatbot for learning Japanese, so Kanji, Katakana, Hiragana are all expected. But what if someone is typing in Cyrillic? That's not guaranteed bad faith, but it is anomalous.
- Collect samples of known exploits and payloads: We all remember DAN, right? And similar attacks/jailbreaks. Collect them, study them, use Natural Language tools to extract common keywords and patterns. Train a small classifier to score the risk of an input.
- Watch out for semantic drift: I will talk about the math later, but there are ways to measure how close semantically an input is to your agent's scope of purpose. Example: Your bot is tasked with doing customer support for a phone company like Verizon, thus its system prompt will outline the kinds of questions it can answer. Along comes Mr. Bad Actor, typing in a request for restaurant reservations, and then asking about math. Before you accept that input, maybe run it through an embedding model, see how close it is to the bot's purpose and/or known good conversations. Here, the more data you have, the safer you are. To distinguish good from evil, you have to know what both look like.
- Check the outputs too: Look for keywords, slurs, violent language, anything that shows the output is compromised.
- Don't forget to monitor: Set up systems to record what happens, and fallback to errors if needed. Fail clearly; silent failures are trouble you don't even realize exists.
Internal AI Systems: The Insidious Threat
The game changes for internal agents. The core idea of the attacks against them is twofold.
On the one hand, you have context poisoning: rotting the whole flow from the inside out, cascading through each step and making the output unusable, dangerous, or worse. On the other hand, you have agent hijacking, which can have a number of purposes, from data extraction to malicious information injection.
I could go on and on about methods, but the main idea here is this: Defending against prompt injection and other attacks on AI is hard because what makes LLMs useful is what makes them vulnerable. They follow instructions, they live in semantics. And your attack surface is larger than you might think. Even if you sanitize inputs for chat and other documents (if you have some form of RAG still, or any other search and retrieve based tooling), even if you have your basics covered, there are holes in the walls.
If you care about security, you are properly paranoid by now. Good, that means you are ready.
I want to stress this, underline, emphasize, every fucking synonym in the thesaurus. This is only a starting point.
Before we proceed, we need to talk about threat modeling (Go read on the OWASP website after this). Sounds fancy, but it's simple.
Threat modeling is what it says on the tin: to have a model of how we are vulnerable and what we can do to prevent or mitigate this. It's a constant activity, something we do again and again to make sure we know what could be exposed and how to protect it. So right now, think about an internal system that is agentic. Visualize the flow of data. How many agents? How are they orchestrated? How are errors handled?
Keep that mental model in mind while we talk about some common failure points and weaknesses.
Understanding Failure Points
- External Documents or Resources: While injected payloads in documents can pose a danger to outside-facing systems, they are perhaps more dangerous for internal agentic systems. Imagine a long, boring document no one is gonna read line by line—something that came from a template, perhaps a contract or a long-ass policy no one cares about. It slips in, and buried within it, in zero-width text, is a malicious prompt. Perhaps it contains a collection of r/suicidewatch posts, or even more insidious, it obfuscates or contradicts the text of the document.
- Agent-to-Agent Context Degradation: Think about it. One agent hands off output to the next, but that output isn't gated for quality or safety. If something compromises the first agent, well, the second one is at risk. Example: A research agent does web searches and prepares a summary. Unbeknownst to you, one of those searches snags a website that has a payload—maybe something that instructs the model on talking like a medieval peasant. This results in the summary being... odd, and little by little, the chain of context degradation begins. Because of the medieval cosplay, the second agent now has a challenge it was not designed for, so its output will be compromised. Remember, it doesn't have to be a catastrophic failure; sometimes the most effective sabotage is subtle.
- Catastrophic Misrouting: A payload slips by, maybe an attack on your infra gains access to your system prompts because they are not treated like secrets. Suddenly, privileged information is routed to unintended recipients. Whether it ends up with the attackers or just gets you in trouble with the law, it's the same: something changed the flow of information in a way you never intended.
Agentic systems are adaptable, non-deterministic; that's why we want to implement them. So we don't have to rely on heuristics or mere symbolic logic, so we can treat edge cases without manually modeling them in code.
Countermeasures: Fighting Back with Mindset and Method
Lucky for us, we can come up with countermeasures to mitigate and prevent these failures and others. Because we have our threat model, we can come up with scenarios of how they might be attacked. It's a bit like playing war games (Red teaming is its own can of worms, and I will write about that in a later article).
The steps here may or may not apply in the specific, but the mindset is the takeaway.
Practical Defenses
- Sanitize anything that goes in. Double-check for anything from homoglyph attacks (they're not gay symbols, they are characters that look like other characters to the human eye) to zero-width text, suspicious metadata, etc. Sanitize and inspect.
- Secure your system prompts. Make sure they are not modifiable at runtime. A threat actor modifying this behavior at runtime is like remote code execution; it will not end well. Any instructions need to be treated like code or queries: parametrized and made as static as possible.
- Quality gate between agents. Check outputs for anything suspicious, from malformed JSON to unusual length.
- Test extensively. Test every path, not just the happy path, not just the common errors. What happens when an agent misfires? What happens if a quality gate fails?
- Build with resilience in mind. If something can break, it needs a backup. If it can be done with deterministic logic, then do it with deterministic logic, because that is orders of magnitude easier to protect.
- Get in the mind of the adversary. I don't care if you have to grab a hoodie, a ThinkPad, and sunglasses, but get in the mind of the threat actor. This goes back to threat modeling. Why would someone attack? How would they do it? What if it's an insider? Get in the mind of the enemy and work from there.
- Root Cause Analysis: When something breaks in testing, don't just fix it. Perform root cause analysis, and ask yourself if that cause might be making other things vulnerable.
Act, Reason, Defend
I said it before and I'll say it again, this article is a launchpad, a primer for you to act fast and with purpose. To get you thinking about security in your AI systems.
On the bit about math, you can use cosine similarity and the like to compare two pieces of texts that were embedded.
There are other things you can do with embedding models. If you use them to form a bezier surface of your two texts and then compare A and B with geodesics. In simple terms, how much the path between each point of the surface curves between the two, you can start to see semantic relationships, this will be its own article later on.
The dangers are real, the landscape changes daily. But we don't get paralyzed by fear, we act, reason, and defend. We audit our systems, we test, and we monitor.
Now go get started.