Computer vision is badly defined

                August 4, 2022

            Computer vision is badly defined

            DALL-E 2 trying to express the puzzlement of a computer looking at a human
The perception systems in autonomous cars have been struggling with providing the information necessary for driving. One reason is the way that computer vision is defined. The space of possible questions that computer vision—actually, machine learning systems in general—can be said to have successfully answered is constrained. This constraint exists for very good reasons, but as the problems facing machine learning-powered systems get harder, progress gets harder.
I'm going to talk about supervised learning. That's not the only way to do machine learning, but it is by far the most common. It's also a reasonable synecdoche for machine learning in general. I'm also going to call it machine learning, not artificial intelligence. AI is basically a marketing term, at this point. It's not a completely inaccurate marketing term, but one of the freedoms afforded me now that I'm not selling anything is that I can call things what they are. Machine learning is the use of statistical learning techniques to produce algorithms that can identify patterns in data.  That’s also what AI is, these days, but people like to use the fancier term.  Machine learning is what you do under the hood.
Machine learning in its supervised form works by progressively learning statistical regularities in collections of data. Let's say you're trying to learn to identify a bus. You collect a few hundred thousand images of buses—maybe via a CAPTCHA system—and a few hundred thousand images that are roughly similar, but which don't contain buses. Then you feed most of those images to your statistical learning engine. You hold 10% or so back for testing. That statistical learning engine has tended for the past ten years or so to a be deep neural network, but it doesn't actually really matter what kind of learning engine you use. The technique is the same. The learning engine is capable of identifying statistical regularities in the two sets, and differentiating one from the other. Whatever happens to be common in the "bus" images in the training set, the learning engine will eventually come to correlate those with the "bus" label. Likewise, if there are statistical regularities in the not-"bus" set, the learning system will come to correlate those with the "not bus" label.
This method for training machine learning systems works extremely well, in practice. In fact, you can make some pretty strong guarantees that if your training set gets big enough and your labels are correct it will converge on working perfectly.
I sneaked an important caveat into the previous sentence: if your labels are correct". Take our simple example of "bus" vs. "not-bus": the theory that says that supervised learning will work assumes that every image labeled with "bus" actually contains a bus, and every image labeled with "not-bus" does not contain a bus. If that’s not true, then the statistical regularities that indicate bus-ness and the statistical regularities that indicate not-bus-ness are mixed up with each other.
Even in this simple case, you can see potential problems. What if an image contains only a portion of a bus? What if it's just the wheel of a bus? What if it's just the wheel of a bus and busses use the same kind of wheels as delivery trucks? What if it's a drawing of a bus? Remember, each of those labels was provided by a human. What if somebody was lying, or wasn’t paying attention? What if—and in practice, this happens all the time—a meaningful percentage of your labels, say 10% (the size of your held back test set!) are likely to be wrong?
Problems like these, with bad or ambiguous labels for a well-understood task with a definite right answer, are actually pretty tractable these days. There has been careful work done on what percentage of labels can be wrong before your performance suffers. There are datasets that have been combed extensively for incorrect labeling (there are other problems with some of the big datasets, but let’s leave that aside for now). The accuracy of labels is taken very seriously in the industry.
There's a different problem, though, that's a lot less well understood. It’s the problem I mentioned in the first paragraph. The problem consists of constraining the set of questions you can answer to only those questions where there's an unambiguous right and wrong answer. That’s a constraint that people doing computer vision and machine learning apply intentionally.  In some ways it’s central to the practice of machine learning, because it lets you know that your system actually works.  However, it’s also a problem, and an increasingly important one. Because it turns out most of the most useful questions the human visual system can answer about the world are not questions with an unambiguous correct answer.  When you are trying to answer those questions, it’s hard to even know what “accuracy” means.
The issue goes all the way back to how computer vision is conceptualized. There was a (very good) computer vision class for graduate students in the computer science department at Harvard; the title of the day 1 lecture was "the ill-posed inverse problem". That problem is defined as: given a two dimensional representation of the world on some kind of sensor, how do you recover the semantically meaningful facts about the three dimensional world? It's a nice, clean way to define the problem, and it makes it easy to understand if you've gotten it right.  Take a world with a known set of facts, point a camera at it, and see if your system can report the correct facts. For computer vision practitioners, it defines the world of problems you can usefully solve.
But as I’ve said before, humans know a lot of things about the world that aren’t precisely facts. Humans perceive, and act on, aspects of the world where one person might see something completely different from somebody else. Subjective judgments are central to our picture of the world. So are judgments that we think are objective but aren't.  When you start really digging in to how people answer questions about what they see in the world—that’s a high-level way to describe vision science, the discipline in which I did my PhD—you discover that more often and not, what people are seeing and describing isn’t precisely the same between people, or even easy to pin down in one individual.  That isn’t usually obvious, because we’re really good at using our ambiguous perceptions of the world to guide our behavior.  Remember, that’s what human vision is for.
This presents a big problem for people who are trying to use computer vision to emulate complex human behavior. Not only do people see much more of the world than computers do, much of what they see is subjective, and varies from person to person. The inputs into the human behavioral system are not just easily verifiable facts about the world. The standard techniques for evaluating the performance of machine learning models—checking that those models have successfully applied the correct factual labels to a training set—are no longer sufficient. The question of whether the labels are accurate or not becomes less meaningful. Instead, the question is whether the labels are behaviorally useful. Techniques for training machine learning models based not on whether their outputs are accurate, but whether they're useful as input that can guide behavior in some other task, don’t really exist. A lot of what we ended up having to develop Perceptive Automata was systems for evaluating our models in terms of their usefulness for vehicle behavior, because there weren’t existing tools to do that.
Some of the newer players in autonomous cars are using a technique called end-to-end learning. This involves training all of the vehicles' systems—perception, planning, controls—together. That way, hopefully, the perception system can learn to provide the information that's most useful for the planning system, without worrying about distracting questions of whether what the perception system is providing is right in some explicit or objective sense. It's a promising approach, but it's a very new approach. I suspect it's still going to run into trouble. 
If you don't have an explicit understanding of the questions you need to be asking about the world, it’s hard to be confident that the vehicle’s behavior system will have the information it needs. You also need to have an explicit understanding of the target the planning system is trying to hit, or else you won't be able to work backwards to your goals with perception. Understanding how humans do these things—what we are seeing, in all its individuality and ambiguity, and how we’re behaving in response when we’re driving—might not be the only way to get there. From my perspective the alternatives are not obvious.
The first step, I argue, is changing our idea of what computer vision—and a lot of machine learning in general—is trying to accomplish. Don’t think about systems that are trying to recover factual information about the world.  Don’t think about systems that take the truth of the labels being fed to them as a given. Think about systems that are trying to see the world like people do, with all the attendant ambiguity, so they can behave like people do. Once you do that, you still need techniques like end-to-end learning to build a practical autonomous car. But without feeding a deep, empirical understanding of the link between human perception and behavior when driving into your systems, the raw material for the end-to-end systems to learn from will be insufficient.
Subscribe now

Don't miss what's next. Subscribe to Apperceptive by Sam: