How the field of "AI" got like this
The AI world can a deeply hostile place for people who are not white men. This is inarguable, and is a huge problem for the field, and one that should be solved not just for reasons of equity and fairness but because it holds the field back in essential ways. I’m not going to write about those issues explicitly—there are others who speak to that far more effectively and cogently than I could—but they are extremely real, and urgent, in most ways more so than the issues that I will write about. To my eye, the intolerance of heterogeneity is part and parcel of a certain limited mindset, one that adopts some of the worst habits of mind of computer engineering, exacerbated in its effects by the particular problems facing machine learning today. There is a fundamental incuriousness about the world beyond immediate, measurable engineering problems that, while not universal, suffuses "AI", in ways that are not only limiting to the field but increasingly dangerous and unfortunate for the world at large.
I should say up front that this essay is more than usually (for me) opinionated. I am giving my gloss on historical events which I believe I understand, but which I was mostly not present for and around which my archival reading is not exhaustive. I could easily be wrong. I don't think I am. I think that the narrative I'll lay out captures the nature of the field, and a part of how it came to be that way.
The field that has become AI (an ill-defined term I continue to hate) was born of an interdisciplinary ferment in the 1950s. The nascent science of information theory seemed to provide a path both to computer scientists and psychologists to implement the algorithms that defined the way humans think in a machine. By doing this, a machine could be created that was intelligent, as the word was then defined for humans. This effort failed, and failed dramatically, largely because the mechanisms which psychologists and other students of human reasoning confidently asserted lay beneath human cognition turned out to be comprehensively incorrect.
To say that the field of AI was scarred by this failure would be a tremendous understatement. After the failures of early AI, funding for the field dried up almost completely. Books were written dismissing out of hand the possibility of accomplishing the goal of a computer that thought like a human. Whole departments, and, indeed, a whole industry, centered on MIT and closely identified with its flagship corporation Symbolics, disappeared overnight. This history has been carried like an epigenetic marker through computer science to this day. I’ve known computer science students with strong and ready-to-hand opinions about the Lighthill Report, despite its vituperatively negative conclusions about the promise of late ‘60s “AI” having been offered fully fifty years ago.
As a consequence of the succeeding AI winters, the subfields of computer science that attempted to design algorithms capable of learning from data first rebranded—no longer AI, now branded "computer vision", "machine learning", "machine translation" and so forth—and refocused themselves on self-definitions that as rigorously grounded in solid engineering principles as possible. The airy, philosophical looseness of the concept of “intelligence” was jettisoned. For computer vision—subject of one of the most famously mockable early failures in AI (they figured solving vision would take about two months of a student’s summer time)—this meant strict adherence to a definition of the field known as "solving the ill-posed inverse problem": the goal—the only goal, definitionally—of computer vision was to take a 2D impingement of light from the world on a sensor and use it to construct the position, location, shape, and semantic identity of the objects in that world. For machine learning in general, the goal was to predict the labels on an unlabeled data set given a labeled training set. No more pretension to sentience, no more dreams of erudite humanoid robots discussing matters of philosophy. You are given a set of labels A on set X, can you predict labels B on set Y.
The reviewing standards for machine learning conferences rely on application of this standard. Reviewers look for a novel approach and interesting ideas, to some degree, in deciding which papers to accept to the most prestigious conference, but fundamentally, if your paper does not perform better at predicting the labels on the held out test set (corresponding to set Y in the previous paragraph) on a known, published dataset your chances of acceptance are slim. Advancement of the field as whole is gauged by improvements in performance on well known and studied benchmarks, full stop. Every other consideration—algorithmic efficiency, interesting patterns of failure, biological plausibility of algorithms—however interesting to the researchers, is irrelevant if the approach does not work better than the extant state of the art on known benchmarks.
Publication at these conferences is a big deal. Academic and professional careers have been made from single papers published at the most high-profile conferences like NeurIPS, CVPR, and ICML. Three of these conference publications comprises sufficient scholarly work to be awarded a PhD from the most demanding departments. To succeed in the field, to advance the state of the art, and fundamentally to be a genuine part of the intellectual conversation requires publishing approaches that improve performance on reference datasets.
This is essentially, in the context of the history of AI, a field-wide defense mechanism, a way to draw a methodological line in the sand against the kind of unverifiable and ultimately unprofitable model building that afflicts cognitive science, economics, and other less engineering-aligned sciences. It also fits very well with the mindset of software engineering. If something works, you should be able to run it and show that it works. Theoretical advances are fine, but there is no need for them not to be weighed on an even scale against approaches that leverage brute force, whether via scale of computing or scale of data.
Perhaps ironically, this firm adherence to the necessity of advancements in the practical state of the art kept deep learning—the technique that is the taproot of all the advances in ML today, from LLMs to end-to-end trained transformer models of driving—a backwater for decades. Although the techniques behind it were developed in the 1980s, they were not practical with the scale of computing power available at the time. It took the pressures of Moore's Law to bring the necessary speed of operations to bear. Once it did, the same techniques originally published in the Reagan Administration became, without modification, the hottest topic in computer science, the subject of intensive study and fantastically rapid iteration that continues to the present day.
Helpfully for computer scientists, once the computing power met the theoretical needs of deep learning, the problems resolved even more neatly to engineering problems. Activation functions, network architectures, dataset throughput, model optimization: these are engineering problems. These are problems that a field culturally conditioned to reject the non-literal and untestable as out of scope could get a handle on.
There are other questions in machine learning, though. As deep learning and its rapidly evolving instantiations – particularly transformers and generative models – have reached a level of performance no longer easily measured by performance on existing reference datasets and especially as machine learning as it is establishing its commercial presence in the broader world, it is raising questions that are fundamentally not engineering questions in character. The most important questions in machine learning – “AI” – right now, are unlikely to ever be concordant with a publication process that requires specific, measurable advancements on extant, well-understood datasets. Questions about the true nature of the representations contained within these models. Questions about their role in society. Questions about the facility with which they engage with subjective topics, or creative ones, or with art: none of these are amenable to the field's traditional gatekeeping.
This doesn't mean people don't ask these questions. Increasingly, many are, from a broad range of fields. But the sense from the people actually building and deploying these systems that these kind of questions are an interesting, but ultimately irrelevant, sideshow are likely to be challenging to eradicate; the sense memory within the field of having been burned before by engaging too deeply with scholars and thinkers outside of computer science is strong, and the limited, demographically and imaginatively constrained culture of “hardcore” engineering—the culture that insists that practical, deployed performance on comprehensively quantifiable tasks as the only truly relevant desideratum—is even stronger. Moving away from those biases, not just at the margins but at the core of the field of ML, is an ever-more-intensely necessary transformation.