Understanding (and) psychology
I've talked in the past about the difficulty of knowing what you're measuring, and how important that is to machine learning. It came up a lot at my company, but it is also incredibly important to understanding what LLMs are and are not and is embedded in an often disturbing history of ill-motivated and biased measurements being promoted as dispassionate and empirical indicators of the inferiority of some humans. In all of these cases, I think, the failure to fully understand what is happening and why is rooted in a lack of intuition for what behavioral measurement – psychology – is, how it works, what it's good for, and what it's not good for. I think this misunderstanding is a substrate underlying a lot of things – misapprehensions around [link]autonomous cars, overly optimistic scenarios around LLMs, and the continued and the continuing elite promotion of long-since debunked "race science", to name a few – it's worth trying to understand the connections that make this the case.
The machine learning methods that my collaborators and co-founders and I developed, around which we built our company’s product, were based on using a portion of the methodological toolbox of experimental psychology to improve machine learning. The specific thing we realized is that when you are using supervised learning to train a machine learning model, what you're really doing is teaching it to predict a human response—the applying of a label—to a stimulus—a training example—that you've shown them. In other words, the process of generating a labeled data set is a process of controlled human behavioral measurement, just at massive scale. Psychology, and specifically the subfields of psychophysics and psychometrics, is the science of human behavioral measurement. When people do machine learning with large labeled data sets, they don't generally think of themselves as doing experimental psychology. What we realized was that whether they intended to be or not, they were. They were simply oblivious to it, and thus doing it badly.
This presented—still does present, I think—a huge opportunity. What it means is that just about everybody doing machine learning is ignoring an absolutely central aspect of what they're doing. We were seemingly alone in understanding that the techniques of psychological experimentation were immediately and directly relevant to the tasks of training and evaluating machine learning models, and we alone seemed to be seriously trying to unlock the potential available by training models with these factors in mind. The immediate question for us—back when we were first developing these techniques, publishing papers and filing patents—was why nobody else was paying attention to this.
I got some clarity on that question as we built the company. As we talked to potential customers, and even more so as I talked to the software industry veterans and machine learning experts we hired, I realized that there was a fundamental lack of understanding of what is difficult about psychological experimentation. When I walked through our methodology with very smart and capable software engineers, I saw their blank looks as I talked about the construction of our annotation tasks. Sure, was their implicit question, you show people a picture and ask them to label it, what's the big deal?
I knew that psychological experimentation was in fact a very big and difficult deal. But one of the most confounding aspects of its difficulty is that it is hard for reasons that are neither obvious nor intuitive for non-experts, more so than in other fields of science. In physical experimentation you are controlling for measurable, and often visible, confounds. In computer science you generally (well, often) are able to build in tests that flag incorrect or inappropriate information passing through to your measurement. Even in fields as involved as neuroscience the ability to confirm that you are measuring what you think you're measuring—and seeing what you think you're seeing—is mostly a matter of working levers that are visible and quantifiable.
Psychology is different. There's a fantastic blog post that I am absolutely not going to dig up that glossed psychophysics experiments using response time ass single voxel neuroimaging. You put some kind of input into the brain, the brain whirrs and clunks for some amount of time, an answer gets spit out. Then you refer to that amount of time, that single number, and you draw conclusions about how the brain works. That it’s possible to do well at all is surprising. It is possible, though, and in fact over the hundred years or so when behavioral experimentation was the only method available to learn about human brains, psychologists were able to learn as astonishing amount about what was really going on with the hidden machinery of cognition simply by looking at simple measures like response time and the answers to multiple choice questions.
The secret is that you have to be maniacally thoughtful and careful about how you structure the input and how you measure the output. You have to structure your stimulus (the thing you're showing them) and response measure so that the only way the person you're testing can answer the question you're asking is by using the cognitive faculty of interest. I can give an example. My adviser and some collaborators developed the gold standard test of face recognition. It's called the Cambridge Face Memory Test. It's designed to characterize people's ability to identify strangers based on their face appearance. Some of the characteristics that had to be controlled in the development of this test included: hair, makeup, skin tone, head position, head size, head angle, gender, age, order of trials, number of trials, length of trial blocks, presentation time, the test subject's eye location, instructions, test subject seating distance... I could go on. The instructions in particular took many iterations. Test subjects tend not to read words that you put in front of them, so how few words were necessary to get the point across? You needed example trials that were both fast and impossible to misunderstand, how do you make images for those trials that don't interfere with results you're looking for on the experimental trials? In the event the test instructions used around twenty words total, relying instead on sample trials. Those sample trials used a universally recognized face that could not possibly confound the recognition of real faces later on: Bart Simpson.
The factors that are controlled for, and the ways they are controlled, can seem arbitrary. Does it really have to be Bart Simpson? From the outside that can seem like a choice both arbitrary and capricious. It is neither. Finding a universally recognized face image with multiple available poses that would not resemble the face of any living human took effort and creativity. Knowing that it was necessary took real expertise. Yet the result seems simple, even banal. That counter-intuitiveness is what I struggled to capture when helping the non-experts at my company understand the heart of the value we were creating, and it also explains the disinterest in the machine learning community at large in really engaging with the task of measuring what they are doing: the set of skills and the expertise required, and the iterations that are necessary, come from a totally different methodological and scholarly tradition than that of computer science and involve intense, single-minded care about aspects of human cognition as it exists in the world that computer scientists—and, indeed, most people, even many psychologists—are neither predisposed nor equipped to understand.
One of the interesting side effects of this realization—that machine learning is held back by its inability to understand the ways psychological measurement is both inherent to the task and phenomenally difficult—is that you start to see this kind of pattern everywhere. Across a broad swath of fields of inquiry, particularly those that are both nominally empirical and widely applied, you see non-experts vastly overestimating the reliability and validity of psychological measurements because they do not have the intuition for the ways that go wrong. In truth, the failure to understand the importance of psychology to machine learning is a relatively minor lacuna. Vastly greater harm has come from elite acceptance of measurements of cognitive capacity like IQ.
The truth about IQ is that IQ tests have been carefully designed and improved over the course of a century to very precisely measure a person's ability to do well on an IQ test. The idea that they are measuring anything like inherent general intelligence has been long since disproved by failed replications and failed predictive power. Cosma Shalizi's series of blog posts titled "G, A Statistical Myth" lay out the fallaciousness of believing we can measure general intelligence with tremendous clarity and in vastly more detail than I could. What I am interested in is the perverse staying power of belief in not just the abstract concept g, but the actual validity of IQ tests as meaningful measurements of differences in cognitive capacity.
Two of the reasons IQ has remained popular are, of course, racism and sexism. It also provides a helpful narrative hook for people who have achieved and been given much to explain their success without recourse to luck or privilege. But I think there is also a general underlying misapprehension that the measurement of a cognitive capacity like general intelligence should be not only possible but relatively straightforward. At a high level it seems obvious that we can tell whether somebody is smart or not, different people have different levels of intellectual achievement, why is it hard? The answer, then, is that it is hard because controlling for all of the factors that are necessary to accurately and unbiasedly measure something as amorphous and ill-defined as general intelligence is orders of magnitude more difficult than controlling for and unbiasedly measuring something as bounded and well-defined as the ability to recognize unfamiliar faces. From the perspective of expertise in and experience with psychological measurement, and looking back on decades of occasional successes and much more common failures in using those tools to capture aspects of human cognition, the measurement of general intelligence starts to look like one of the hardest possible things you could try to do with the toolbox of experimental psychology.
So when somebody like Richard Hanaina makes the claim that he has abandoned his bog-standard, brutish racism for a more more elevated and pragmatic understanding of the cognitive differences between groups of humans, his continued acceptance among elites is a product of their comfort with ideas that are fundamentally racist and sexist, yes, but is also due to their misunderstanding of the difficulty and limits of empirical measurements of human cognition and behavior. The same inadequate intuitions which I have described as underlying peoples' overestimation of the power of LLMs support the continued disingenuous efforts of those who would judge others by their gender or the color of their skin to import their ideologies into the mainstream under the cover of scientific respectability