Why those "training data poisoning" gimmicks don't really work
The human visual cortex uses an optimal representation of the world
One of my favorite science papers is Olshausen and Field (1996). It's about receptive fields in the primate brain. Receptive fields, roughly, are the area of the world that a neuron in the visual cortex cares about. In the human brain area V1, which is the first level of the hierarchically organized visual cortex, each neuron activates when a specific pattern of light and dark hits a specific part of the retina. These patterns of light and dark look like parallel dark and light stripes, sometimes thick and sometimes thin. They traverse a small, round area at some angle or other. V1 contains neurons that care about pretty much every orientation and thickness of stripe for pretty much every location on the retina, which is to say for every part of the world that your eye is able to take in. The output of V1's neurons, called feature detectors, feeds into the next level of hierarchy in the visual cortex. The way to think about these receptive fields—there is a mathematical function called a Gabor wavelet that approximates them, so vision scientists tend to just call them "Gabors"—is as the most basic building block that the brain uses to construct semantic knowledge of the visual world.
What Olshausen and Field wanted to know was if there was a mathematical way of explaining why the V1 receptive fields looked that way. If there was something about the visual organization of the world in general that made Gabors a particularly useful way of encoding it. They took a large collection of equally-sized small bitmap images of the natural world—mountains, fields, animals, bodies of water—and generated what's called a "basis function" for encoding them. A basis function, in this context, is a collection of images—bitmaps, of the same size as the natural images—that you can layer on top of each other to recreate any of the natural images. There are a huge number of different ways to do this. The simplest solution would be to make each basis function image entirely transparent except for one black pixel. By then by layering those images at different levels of opacity you could invididually "color in" each of the pixels in your output image however you pleased, and generate any image you wanted. In the mathematical language of image processing you would refer to each of those opacity levels as a “coefficient”, and every ordered collection of coefficients for each basis function uniquely describes an image.
The reason these images get called "basis functions" is that Olshausen and Field were actually treating those images as vectors. Every bitmap is made up of pixels. In a black and white bitmap each of those pixels is a number within a certain range—let's say 0 to 1. You might notice that those numbers are the same as the coefficients we just described. The sort of central mathematical trick of image processing is to treat those coefficients as describing a point. Each coefficient is one component of a set of coordinates in a high-dimensional space. Each image is reprented by a location. Think about the example I gave above where the basis functions consisted of a single black dot in one location. What that corresponds to is a point that is as far as you can go in one of the dimensions. If you had images made up of only three pixels, the three basis functions would be points at the far end of the X, Y, and Z axes. Thinking of images this way allows you to think about the relationship of different images geometrically—whether they’re close to each other or far away, whether they’re in a line—and to do math on them accordingly.
Olshausen and Field were interested in geometric regularities among all those natural images. They were looking for the basis functions—images, vectors—that could most concisely characterize the differences between all the natural the images. Think about all of those images are points in space. They’re going to form a sort of cloud. Maybe it’s sort of a football shape. If you wanted to describe that football shape, you could start by identifying its long axis, tilted just so, out just here. In other words, what's the longest line you can draw between two of those points? That's your first dimension. Then you find the longest line perpendicular to that one. Then the third. In three dimensions, you’d be done. Remember, though, that each image is a point in a very high dimensional space. The number of basis functions is equal to the number of dimensions, which is equal to the number of pixels in the image.
Olshausen and Field were looking for the set of images where the basis functions were able to recreate the natural scene images most efficiently. That is, they wanted the set of bitmaps where on average it took as few of them as possible to meaningfully recreate any given image. The basis functions where the most coefficients were zero. This is called a sparse code For our purposes what's important is that the solution they found is the set of bitmaps that is in some sense the "best" for recreating natural scene images. And when they looked at those images, what they saw were, to a first approximation, Gabor wavelets, the same kinds of features the early visual cortex uses.
That is, the features that the primary visual cortex uses to encode and represent views of the visual world are likely in some sense the most efficient possible features for doing that.
One thing this implies is that any system for encoding feature information about the natural world—for instance, an artificial neural network capable of describing or generating images—should, if it is to work on any image, have features in it which look something like Gabor features. This insight was central to the development of convolutional neural networks (CNNs), the first kind of artificial neural network to show real success in image classification and similar tasks. These networks were structured to encourage the development of Gabor-type receptive fields.
This insight about V1 also implies that visual recognition of all kinds of categories of objects involves, at its root, a similar process, and that techniques which make visual tasks like recognition for one kind of object difficult, things like adding certain kinds of noise to an image, should make recognition for all kinds of objects difficult, in fairly predictable ways. And indeed, in humans, that insight can be verified experimentally. You'd expect to see the same thing in any kind of system that is optimized for general-purpose recognition.
Except with artificial neural networks—CNNs and the more modern transformer architectures—you don't see precisely that. Adding noise to images in the same controlled way that degrades human recognition has unexpected effects. Models fail more quickly than you'd expect them to, and in ways that are different from human failures. Whatever is going on internally in these models, it's not precisely the same kind of sparse coding that happens in the human visual cortex.
You can manipulate computer vision algorithms because the way they represent the world is idiosyncratic and NOT optimal
You can test this another way, by forcing these models to misclassify images. Take an image in one category—let's say it's a picture of a horse—and feed it to the model. The model's output is essentially a list of percentages attached to different categories which represent confidence that the image belongs to that category. For our horse picture, you'd expect that the "horse" category would be really high, close to 100%, and all the rest of the categories—including, let's say, "toaster"—would be really low. Then you perturb the image ever so slightly and see if it changes the scores of "horse" and "toaster" at all. If it doesn't, trying another very, very slight perturbation. If it does, keep that perturbation and try again. Over tens of thousands of perturbations, you'll end up with an image that the model is absolutely convinced is a toaster. If what ML models are doing to extract features from images is pretty much similar to what the brain is doing, you'd expect the final image to look like a toaster. That isn't what happens. Sometimes the image will look a bit like a toaster, but most of the time it'll just look like junk. In fact, you can constrain your perturbations so that you are only changing the image in ways that are not visible to a person. You can end up with a final image that looks pretty much identical to the original horse picture to any person looking at it, but where the ML model you're testing is absolutely convinced that it's a toaster.
What does this mean? It means that when the ML model was trained, it was not necessarily learning to care about the kinds of universally present natural images features that the brain is sensitive to. It discovered other features—ones that can be invisible to people—that existed in the training images that were able to distinguish categories from each other. This seems, on its face, unlikely. Why would there be some invisible characteristic in the particular images of horses you chose that would be as useful for grouping those images together than regular old visible horse features that people pay attention to? The reason is that the model isn't really seeing images as images. Recall again that a bitmap image, for a machine learning model, actually represents a point in a very high dimensional space. Each pixel of that training image is a dimension. If you were training your model on three pixel images, you could make a graph showing where each of the images landed, and see with your eyes which ones tended to cluster near each other, and you could figure out how to draw a line through the space that would capture all the images in a certain category: you could find a feature (or basis function) that identifies that category. Three pixel images aren't terribly interesting. If you're training on, let's say, 256x256 pixel images, each image is a point in a space with a bit more than sixty-five thousand dimensions. That's not as easy to visualize. But the same intuition—the model is trying to figure out the spatial relationship of images that are similar to each other—carries through. In this kind of approach, there's no reason to weight certain "kinds" of features—certain directions in this very high dimensional space—over others unless those directions are important in the training set. And the number of directions is so high, because the number of dimensions are so high, you can end up with ways of grouping images together that are perfectly effective, for your training set, but which don't rely on visual features that make any real sense to people looking at the images. For a given model, there are going to be thousands or hundreds of thousands of ways to get it to misclassify an image that exploits some high-dimensional quirk of grouping it latched on to when trying to find a good enough solution to the problem of classifying its training set. These correspondences in high-dimensional backwaters of the training region are contingent on not the visual similarity of the images, in any generalizable way, but on vicissitudes of how the training process approximates the best answer and idiosyncrasies in the particular collection of several million (or whatever) 256x256 pixel images you've chosen to train on.
Which brings us to Nightshade. A well-meaning and in many ways thoughtful academic group at the University of Chicago released Nightshade as a tool for artists to "fight back" against the proliferation of generative models that are both potentially extremely destructive to the professional prospects of working artists and notably prone to generating images that make a mockery of intellectual property, lifting the style, content, and even literal images of copyrighted art posted on the internet. Nightshade was developed using a process like the one I mention above, using perturbations of inputs into a machine learning model to find modifications that are invisible to humans but which cause that model to misclassify images. By using this system, they aver, you can "poison" your images so that they degrade the training efficacy of generative models.
This is a noble idea, and people get excited about whenever it comes up. The problem is, it's vanishingly unlikely to work. Or rather, it's certain to work on the model that it was trained on and overwhelmingly likely not to work on any other model. Remember what we said about the features the model finds in its training set. They're directions in super high dimensional space along which images with similar labels seem to congregate. Any one of them is as possible as any other to the model. But they're the products of specific interactions between the approximations involved in training one model and the training set used. They’re unexpected, high-dimensional backwaters very unlike the generic features that underlie human judgments of similarity. Unless another model uses pretty much the same approximations and the same training set, it is virtually certain that the unusual, high-level, invisible-to-humans features it ends up caring about will be unique. If the features were common between all data sets—like the natural image basis functions Olshausen and Field found—then they would be part of that most efficient sparse code, and they'd be visible to humans. The very fact that these features are not visible to humans, the most efficient and best-tuned system for perceiving the visual world that we know about, means that they are not likely to generalize between different models. The unfortunate truth is that any “data poisoning” approach that relies on perturbations that are invisible to humans is going to be tough to generalize between models, and we know this because they don’t generalize, intentionally, to the system that’s the very most efficient at understanding and classifying the natural world for us, which is the visual cortices of our own brains.