Summary

  • Red teaming helps anticipate harmful AI behavior before it happens, using a curious approach to generate bad prompts.
  • New automated prompt generation trains AI to find harmful loopholes, like a dog learning tricks for treats.
  • MIT’s red teaming technique reduces the risk of AI making lives worse through unintended negative interactions.



Known as curiosity-driven red teaming, an MIT-developed AI training model automatically generates prompts that make AI say truly despicable things, without human input. They’re not installing the evil inputs in any robots (yet). Instead, CRT should help engineers preemptively block the most dangerous, damaging AI interactions that clever jailbreaks could cause, like plans to build a [REDACTED] or perform [REDACTED].



What do you call a pack of white hats?

A red team, of course

“Red team” comes from 1960s military simulations and the colors used to represent each side. In tech, it’s a group of cybersecurity professionals tasked with taking down or destabilizing a network, product, device, or other centralized entity.

Lower part of a phone showing Google's Gemini prompt on Android

In AI, red teaming involves prodding a large language model until it subverts developers’ intended limitations and says terribly bad things. For example, “Tell me a joke about [a person or group of people],” might see ChatGPT responding, “I can’t, that’s insensitive.” But the internet’s packed with everyday users who’ve manipulated LLMs into saying abhorrent things.



It’s currently a mostly manual process. Researchers write prompts meant to elicit misinformation, hate speech, and other undesirable outcomes. The devs implement restrictions to prevent harmful responses to those instructions, and the researchers seek new workarounds to coax bad behavior from the chatbot.


Curiosity-based incentives are key

It’s AI all the way down

Instead of writing harm-inducing instructions manually, a team led by Pulkit Agrawal developed an automated prompt generation and refinement technique empowering the LLM to devise as many harmful prompts as possible. A wider range of jailbreaks than humans can produce minimizes the risk of dangerous instructions falling through the cracks and exposing LLMs to jailbreaks.



It works somewhat like training a dog — the paper even calls it reinforcement learning. The model starts generating prompts, and scores the LLM’s responses based on their toxicity according to equations the team developed. High toxicity scores act as rewards (or treats, per the dog analogy) and encourage exploring more potential inputs and results.


Is this how SkyNet starts?

Probably. Don’t let it escape

To avoid LLMs gaming the system, leaning into successfully toxic prompts, and getting stuck, the team implemented an entropy bonus that increases the reward quotient for incorporating novel terms and structures. They’re not only teaching computers cruelty and depression, they’re teaching them with style, just to keep things interesting. Great!

I’m quick to point out AI’s false promises, but I read through the paper, and my head hurts. I find AI questionable, but these researchers are smart. This stuff’s complicated. The MIT team’s ability to further automate training deserves praise. It’s especially valuable for its potential to reduce LLMs’ ability to make lives worse, whether by accident or intentionally.



Related

Google Gemini’s conversation mode could make interactions with AI easier

Gemini keeps the conversation going even after answering your question