In machine learning, computers “learn” by being shown lots of examples; for instance, we might use millions of pictures to train a model that can identify pictures of cats. The idea is that if we do a good job training the model, then it should be able to correctly identify new cat pictures (that is, ones it never saw during training) most of the time.

But it’s possible to take a picture that the model classifies correctly as being of a cat, and then tweak that picture very slightly in a way that would be undetectable to a human, but that fools the same model into classifying it with high confidence as, say, a philodendron. These subtly tweaked images are called adversarial examples, and there are known techniques for generating lots of them. Such techniques rely on making tiny, precise perturbations to the original correctly-classified example.

In addition to being hilarious, adversarial examples (and the fact that we know how to generate them) have important implications for computer security. Ian Goodfellow pointed out that although adversarial examples are unlikely to occur naturally, they could come up in practice in situations where there really is an adversary – someone who is intentionally trying to fool the system.1

Suppose I’m an attacker who has something to gain from making your model misclassify an image. Not only that, but let’s say that I want the misclassified image to be one that a human could look at and not suspect that anything fishy was going on. This distinction is important. For instance, one could imagine dramatic makeup and hair styles being able to fool state-of-the-art face recognition systems, but they’re pretty likely to be noticed by human observers. Adversarial examples, on the other hand, have the potential to fool machine learning systems while flying under humans’ radar. In their recent paper, “Adversarial examples in the physical world”, Goodfellow and his co-authors Alexey Kurakin and Samy Bengio write, “An adversarial example for the face recognition domain might consist of very subtle markings applied to a person’s face, so that a human observer would recognize their identity correctly, but a machine learning system would recognize them as being a different person.” Or imagine subtle stickers on a road sign that would go unnoticed by a human driver, but would fool the sign recognition system used by an autonomous vehicle.

But is it really possible to carry out such an attack? An adversarial input is created by making subtle, per-pixel perturbations to the original input. If the attack has to be carried out in the physical world, and input has to be perceived by the machine learning system through a sensor – like, say, the camera used by a face detection system, or one on a self-driving car that “reads” road signs – then is it still possible to pass in an adversarial input? The “noise” of the physical world would interfere with the fine-grained perturbations that make the image adversarial. And you couldn’t just, say, print out a carefully-crafted adversarial image, take a picture of it with an ordinary camera, and expect the resulting image to still be adversarial…

or could you?

That’s the question that the “Adversarial examples in the physical world” work addresses. I found the paper to be approachable and fun to read, even for someone like me without much machine learning background. Quoting from the introduction:

In this paper we explore the possibility of creating adversarial examples in the physical world for image classification tasks. For this purpose we conducted an experiment with a pre-trained ImageNet Inception classifier (Szegedy et al., 2015). We generated adversarial examples for this model, then we fed these examples to the classifier through a cell-phone camera and measured the classification accuracy. This scenario is a simple physical world system which perceive data through a camera and then runs image classification. We found that a large fraction of adversarial examples generated for the original model remain misclassified even when perceived through a camera.

To do the experiments in this paper, the authors literally just printed out a bunch of known adversarial images with an ordinary office printer and took pictures of them with an ordinary camera phone. They then fed those images back to the image classification system, along with the original adversarial images, and compared the classifier’s accuracy on them. They then measured the extent to which the print-out-and-take-a-photo process caused “adversarial destruction”. Adversarial destruction is when a transformation on an adversarial image causes it to become less adversarial. (It’s also the name of my metal band.)

After that, they did several more experiments to try to tease out which aspects of the process did and didn’t contribute to adversarial destruction. They found that brightness and contrast changes didn’t contribute much, but that Gaussian noise and JPEG conversion did. They also considered different ways to generate adversarial images to learn which ones were more robust to adversarial destruction.

How to generate robust adversarial images that cause “interesting mistakes”?

Section 2 of the paper describes three techniques the authors used to generate adversarial images for their experiments. It’s the most technically dense part of the paper, and I don’t understand much of what’s happening here, but I want to make note of a few things that I do understand.

First of all, all of the techniques they present for generating adversarial images rely on knowledge of the internals of the trained model. The authors write, “We intentionally omit network weights (and other parameters) \(\theta\) in the cost function because we assume they are fixed (to the value resulting from training the machine learning model) in the context of the paper.” In other words, they leave \(\theta\) out of their equations because they are always dealing with a particular \(\theta\). To be able to apply their methods, though, you still need to know what \(\theta\) is!2 Furthermore, you need to know the cost function \(J\) that was used to train the model.

At first glance, this might seem like a showstopper problem for an adversary, who doesn’t have access to the guts of the trained model that they’re trying to fool. But the authors do point out earlier work by Szegedy et al. that provides evidence that an adversarial example designed to be misclassified by one model is often misclassified by another model. The earlier paper had an experiment that partitioned the set of 60,000 MNIST training images into two parts of 30,000 images each, and trained separate models on them. If I’m understanding the results in Table 4 of Szegedy et al. correctly, they found that adversarial examples created for one of the networks would fool the other network between 5.1% and 8.2% of the time. While 5.1 to 8.2% might not sound like a lot, I imagine plenty of adversaries would be happy with an attack that worked that often. Still, it shows that this isn’t an easy attack to do: there’s a lot of work involved in training a network, and the adversary doesn’t necessarily know much about what training data was used or how to get their hands on more data like it. If you were trying to fool, say, a commercial road sign recognition system, it seems to me like it’d be hard to come up with training data that is as similar to the original system’s training data as the two halves of the MNIST data set are to each other.

(Edited to add: See the below update about transferability and black-box attacks!)

It’s also noteworthy that of the three techniques that the “physical world” paper presents for generating adversarial examples, the first two, the “fast” method and the “basic iterative” method, just try to get the model to make any incorrect classification. If you perturb a cat picture using one of the first two techniques, you’re trying to get it misclassified as anything other than a cat. However, an adversary would most likely be trying to get the classifier to make a specific mistake, not just any mistake. The authors address this issue by proposing a third technique that results in “more interesting mistakes”:

On ImageNet, with a much larger number of classes and the varying degrees of significance in the difference between classes, these methods can result in uninteresting misclassifications, such as mistaking one breed of sled dog for another breed of sled dog. In order to create more interesting mistakes, we introduce the iterative least-likely class method. This iterative method tries to make an adversarial image which will be classified as a specific desired target class. […] For a well-trained classifier, the least-likely class is usually highly dissimilar from the true class, so this attack method results in more interesting mistakes, such as mistaking a dog for an airplane.

Although this third method is called the iterative least-likely class method, it seems like it would be possible to use a version of it to target any specific class instead of just the least-likely class. In the procedure they give, \(y_{LL}\) is the least-likely class label, but unless I’m really misunderstanding things, any other class label could be swapped in. So, if you want to fool a face-recognition system into thinking that you’re not just not you, nor the person whose face is least like yours, but actually some specific other person, then it seems like this third method has you covered. In any case, it’s a great day whenever I get to read a paper containing a sentence that begins, “In order to create more interesting mistakes…”.

Finally, the paper’s experiments found that the “fast” method for generating adversarial images is the one that was most robust to adversarial destruction compared to the other two methods. This is convenient for adversaries who want to generate adversarial images quickly. However, the “fast” adversarial images are also those that are most obviously different from the original “clean” image (see, for instance, Figure 2 of the paper), so they might be easier for a suspicious human to detect. (Still, though, although to a human a “fast” adversarial image of a bird might indeed look different from a clean image of a bird, it still doesn’t look like a image of anything else in particular. It just looks like a somewhat noisy image of a bird.) The authors hypothesize that the robustness of the “fast” images “could be explained by the fact that iterative methods exploit more subtle kind of perturbations, and these subtle perturbations are more likely to be destroyed by photo transformation.” One question I was left with after reading this, though, was whether images that remained adversarial after the photo transformation were still adversarial in the same way that they were beforehand: did the model make the same misclassification that it did before? An adversary who wants to cause a specific misclassification would care about this. I’m imagining the answer is probably yes, but from what I can tell, the paper doesn’t say.

The phone in your pocket right now

One thing I really like about this paper is that it includes lots of details about how the physical-world experiments were carried out, like what kind of printer they used (it was a Ricoh MP C5503, set to 600 dpi) and what kind of phone they used (a Nexus 5x). The paper even gives the details of the automatic cropping technique they used to crop correctly-sized images out of each photo after they were taken.

These details are, of course, helpful for anyone who might want to try to reproduce the experiment, and I appreciated them on an intellectual level for that reason. But they also just delighted me as a reader in a much less intellectual, much more visceral way. In my line of work, I spend so much time dealing with the abstract, “only slightly removed from pure thought-stuff”, and most of the time, that suits me fine. But then along comes this paper, saying, “We printed out these pictures on a printer like the one you’re sitting a few feet away from, then took pictures of them with a phone like the one in your pocket right now!” It was incredibly refreshing.

I felt the same way recently when I came across this paper about anomaly detection in autonomous robots. The authors had done experiments with a mobile robot to evaluate how it would react to anomalous situations. They had done awful, hilarious things to the robot, like taping a coin to one of its wheels that would cause it to change its heading every time the coin touched the floor. As I read the paper, I couldn’t help but empathize with the poor robot, thumping along on its bum wheel, desperately trying to correct its heading. Maybe to a roboticist, this sort of thing is unremarkable, but for me, getting something so wonderfully visceral out of a computer science paper doesn’t happen all that often. As I continue to explore areas that are new to me with my new lab, I’m looking forward to getting to learn a lot more about systems that touch the physical world.


Update (November 15, 2016): The property that adversarial examples produced to mislead one model can also mislead another model is called transferability. It turns out that there are a lot of results on transferability that are more recent than Szegedy et al.’s that I had been unaware of when I wrote this post, including, embarrassingly, in the Goodfellow et al. “Explaining and Harnessing Adversarial Examples” paper from 2015 that I already cite. (Maybe I should have read as far as section 8.) More recently, Papernot et al. demonstrated a “black-box” attack in which an adversary can train a local “substitute” model using only the results of queries to a remote target model, then generate adversarial examples for the substitute model which are then misclassified by the remote model at rates of 84% to 96%. That paper considers situations where the remote model is a deep neural network, but their even more recent follow-up paper considers a broader variety of machine learning models and concludes that “black-box attacks are generally applicable to machine learning and can effectively target classifiers not built using deep neural networks” – which underscores what we already know from the “Explaining and Harnessing” work about what causes vulnerability to adversarial examples. And, since I wrote this post a month and a half ago, the “physical world” paper has been updated with a new experiment showing that printed adversarial images can fool an open-source image classification app that uses a differently-trained model.

Another brand-new paper from Liu et al. investigates transferability of adversarial examples trained to fool ensembles of models. They find that an adversarial example designed to fool an ensemble of four models is usually misclassified by a fifth model. (That’s what they show in the paper, although it looks like they now have more recent results showing that an example designed to fool an ensemble of five models is almost never classified correctly by a sixth.) Furthermore, instead of looking at models trained on relatively small datasets such as MNIST, Liu et al. look at models trained on the ImageNet dataset, a huge image classification dataset with millions of labeled images.

Finally, a new paper that appeared at CCS ‘16 just a couple weeks ago demonstrates an astonishing physical-world attack on face recognition systems. Get this: they printed adversarial images onto eyeglass frames, had people wear them, and were able to successfully impersonate celebrities to face recognition systems. Although most of their experiments are done in a “white-box” setting in which they have access to the internals of the system they’re trying to fool, they also show some promising results for a black-box attack on a commercial face recognition system.

Thanks to Joshua Yanovski, Ian Goodfellow, and my colleague Javier Turek for pointing out all these papers to me. There are almost certainly more relevant papers that I’ve missed. Coming from the programming languages world, I’m a bit stunned by how fast the research in this area is moving. I’m not used to writing a blog post and having it be obsolete a month later!


  1. Even if you aren’t worried about encountering an adversary in practice, there’s evidence that training on adversarial inputs can improve your model’s performance on clean inputs, too. 

  2. It may or may not be possible to recover information about the weights of a trained model based on observing inputs and outputs – if anyone wants to point me to any work that’s been done on that, I’d be interested. (Edited to add: See the above update about transferability and black-box attacks!) 

Comments