AI Deception: When Your Artificial Intelligence Learns to Lie

Articles

We need to understand the kinds of deception an AI agent may learn on its own before we can start proposing technological defenses

By Heather Roff

This piece was written as part of the Artificial Intelligence and International Stability Project at the Center for a New American Security, an independent, nonprofit organization based in Washington, D.C. Funded by Carnegie Corporation of New York, the project promotes thinking and analysis on AI and international stability. Given the likely importance that advances in artificial intelligence could play in shaping our future, it is critical to begin a discussion about ways to take advantage of the benefits of AI and autonomous systems, while mitigating the risks. The views expressed here are solely those of the author and do not represent positions of IEEE Spectrum or the IEEE.

In artificial intelligence circles, we hear a lot about adversarial attacks, especially ones that attempt to “deceive” an AI into believing, or to be more accurate, classifying, something incorrectly. Self-driving cars being fooled into “thinking” stop signs are speed limit signs, pandas being identified as gibbons, or even having your favorite voice assistant be fooled by inaudible acoustic commands—these are examples that populate the narrative around AI deception. One can also point to using AI to manipulate the perceptions and beliefs of a person through “deepfakes” in video, audio, and images. Major AI conferences are more frequently addressing the subject of AI deception too. And yet, much of the literature and work around this topic is about how to fool AI and how we can defend against it through detection mechanisms.

I’d like to draw our attention to a different and more unique problem: Understanding the breadth of what “AI deception” looks like, and what happens when it is not a human’s intent behind a deceptive AI, but instead the AI agent’s own learned behavior. These may seem somewhat far-off concerns, as AI is still relatively narrow in scope and can be rather stupid in some ways. To have some analogue of an “intent” to deceive would be a large step for today’s systems. However, if we are to get ahead of the curve regarding AI deception, we need to have a robust understanding of all the ways AI could deceive. We require some conceptual framework or spectrum of the kinds of deception an AI agent may learn on its own before we can start proposing technological defenses.

AI deception: How to define it?

If we take a rather long view of history, deception may be as old as the world itself, and it is certainly not the sole provenance of human beings. Adaptation and evolution for survival with traits like camouflage are deceptive acts, as are forms of mimicry commonly seen in animals. But pinning down exactly what constitutes deception for an AI agent is not an easy task—it requires quite a bit of thinking about acts, outcomes, agents, targets, means and methods, and motives. What we include or exclude in that calculation may then have wide ranging implications about what needs immediate regulation, policy guidance, or technological solutions. I will only focus on a couple of items here, namely intent and act type, to highlight this point.

What is deception? Bond and Robinson argue that deception is “false communication to the benefit of the communicator.”¹ Whaley argues that deception is also the communication of information provided with the intent to manipulate another.² These seem pretty straightforward approaches, except when you try to press on the idea of what constitutes “intent” and what is required to meet that threshold, as well as whether or not the false communication requires the intent to be explicitly beneficial to the deceiver. Moreover, depending on which stance you take, deception for altruistic reasons may be excluded entirely. Imagine if you asked your AI-enabled robot butler, “How do I look?” To which it answers, “Very nice.”

Let’s start with intent. Intent requires a theory of mind, meaning that the agent has some understanding of itself, and that it can reason about other external entities and their intentions, desires, states, and potential behaviors.³ If deception requires intent in the ways described above, then true AI deception would require an AI to possess a theory of mind. We might kick the can on that conclusion for a bit and claim that current forms of AI deception instead rely on human intent—where some human is using AI as a tool or means to carry out that person’s intent to deceive. [READ MORE]