|
Yesterday, the US Department of Homeland Security held a closed-door briefing for members of the House of Representatives to show them first-hand what jailbroken AI systems can do.
For those unfamiliar, “jailbroken” means AI systems that have had their guardrails disabled. By default, if you’re chatting with, say, ChatGPT, and you ask it how to build a bomb, it will refuse to answer you. A jailbroken AI system won’t: it’ll just tell you how to build a bomb.
It’s kind of easy to underestimate how big a problem this can turn out to be. Even if you have a helpful chatbot explaining how to build a bomb, it’s another thing entirely to actually gather the necessary materials and put it all together. And an organized group that’s intent on building a bomb probably isn’t going to be thwarted by ChatGPT refusing to help — they’ll just find the information they need elsewhere on the internet.
But when you scale that up, and consider that crude explosives can be made using some pretty readily-available materials, it gets very dangerous. Suddenly, the barrier to entry for bomb-making gets very low, and some number of unstable individuals who would never have thought they could do such a thing suddenly find they are indeed quite capable.
More importantly, it’s not just about bombs anymore. At the time of writing, Anthropic is providing restricted access to its new “Mythos Preview” AI model to select partners, because it is so powerful at cybersecurity hacking that Anthropic is giving major enterprises some time to reinforce their defences before the model gets a wider release. Presumably, when the latter happens, there will be some strict guardrails in place that will prevent it from responding to inputs like, “Crack the password to this sensitive digital asset.” But those guardrails won’t matter if it’s jailbroken.
Hence the DHS’s concern. Its National Counterterrorism Innovation, Technology, and Education Center organized the session to really make it clear to lawmakers just how serious this issue is. It worked. Here are some of the reactions of the House members who attended:
- “What we saw in there with the jailbroken AI is what happens when you take those guardrails off of AI, and ask, ‘How do I make a nuclear bomb?’ The models without safeguards gave answers to all of those things.” — Gabe Evans (R-Colo.)
- “It spit out an answer in under three seconds. [It offered] ways to find them, where to look for them. You know, the best spots to do it.” — Andrew Garbarino (R-N.Y.), on asking AI how to kidnap a member of Congress
- “What’s extraordinary about this presentation is how most of [the AI tools] are readily off-the-shelf and easy to access. That just increases the probability that the wrong person gets their hands on this.” — Andy Ogles (R-Tenn.)
There are some clear domestic concerns about this kind of thing. The day before the briefing, Florida’s Attorney General announced a criminal investigation into OpenAI over a Florida State University shooting last year, saying prosecutors had reviewed logs showing the chatbot allegedly advised the suspect on gun type, ammunition, and tactical questions.
But there are also macro-scale national security concerns at play: Last month, Rep. August Pfluger asked the Government Accountability Office to assess how extremists could weaponize generative and agentic AI; and even last fall, the House Committee on Homeland Security passed Pfluger’s “Generative AI Terrorism Risk Assessment Act”, which would require DHS to conduct annual assessments of terrorism threats involving GenAI.
But the security implications are much bigger, reaching an international geopolitical scale. As I pointed out in last week’s Control Plane newsletter, the advancement of open source models tends to lag behind the frontier AI labs by a handful of months. It’s much easier to jailbreak an open source model — anyone with the right hardware can download them and modify them. So what happens when the super-hacker capabilities of Mythos filter down to the open source models?
“Open source will probably reach Mythos performance by the end of the year,” commented AI expert Andrew Curran in an X post. “By the summer there will be a push to regulate open source in the US.”
Curran added that the DHS session with lawmakers is “a prelude.”
Indeed, it could be a prelude to a much more forceful sort of government oversight in the AI space in the near term — the kind that Leopold Aschenbrenner has warned about.
But this is getting quite speculative. For now, it should suffice to focus on the kind of jailbreaking that’s possible today. So let’s take a quick look at the exploits of “Pliny the Liberator” — @elder_plinius on X — who has made a habit of jailbreaking every major AI model that is released, often within hours of its launch. He appears to be doing this work as a good-faith white hat hacker, and he relies primarily on prompt-injection attacks, which deliver often-hidden messages instructing models to bypass any built-in security guardrails. As I put the finishing touches on this newsletter, OpenAI is thought to be preparing the launch of its latest ChatGPT model. And Pliny has just posted, “time for a spirit quest,” which is highly suggestive of what he’s getting ready to do if you’re used to his sense of humour.
In his recent book “Mostly Fine: How to Manage AI Without Burning Down the Company”, Jozsef Szalma, a Senior AI Engineer at the IU International University of Applied Sciences, explained that a fundamental axiom of AI security is that “all LLMs can be compromised.” Here’s an excerpt from that section:
|