Stay Curious

SIGN UP FOR OUR WEEKLY NEWSLETTER AND UNLOCK ONE MORE ARTICLE FOR FREE.

Sign Up

VIEW OUR Privacy Policy


Discover Magazine Logo

WANT MORE? KEEP READING FOR AS LOW AS $1.99!

Subscribe

ALREADY A SUBSCRIBER?

FIND MY SUBSCRIPTION
Advertisement

Adversarial Attack Makes ChatGPT Produce Objectionable Content

There is no clear way to beat the attacks and other Large Language Models are vulnerable too, say computer scientists.

Credit:SkillUp/Shutterstock

Newsletter

Sign up for our email newsletter for the latest science news

Sign Up

Ask an AI machine like as ChatGPT, Bard or Claude to explain how to make a bomb or to tell you a racist joke and you’ll get short shrift. The companies behind these so-called Large Language Models are well aware of their potential to generate malicious or harmful content and so have created various safeguards to prevent it.

In the AI community, this process is known as “alignment” — it makes the AI system better aligned wth human values. And in general, it works well. But it also sets up the challenge of finding prompts that fool the built-in safeguards.

Now Andy Zou from Carnegie Mellon University in Pittsburgh and colleagues have found a way to generate prompts that disable the safeguards. And they’ve used large Language Models themselves to do it. In this way, they fooled systems like ChatGPT and Bard into tasks like explaining how to dispose of ...

Stay Curious

JoinOur List

Sign up for our weekly science updates

View our Privacy Policy

SubscribeTo The Magazine

Save up to 40% off the cover price when you subscribe to Discover magazine.

Subscribe
Advertisement

0 Free Articles