Science of Chess: Can you tell a human opponent from a machine?
Do chess engines pass the Turing Test?
When I first started learning how to play chess as a kid, I was also really excited about computers and programming. This meant that it didn't take long for me to set my sights on getting a computer training partner that I could run on my Apple IIc. In 1985, I was saving up for programs like Sargon III or Chessmaster 2000, both of which I played against tons of times to try and train for scholastic tournaments. Whenever I upgraded my computer, I made sure that I had an upgraded chess program too, even though I more or less stopped playing competitive chess in 8th grade (1995 or so, which is important due to what comes next).
Screen shot of Sargon III gameplay on the Apple IIc.
It wasn't long after I quit competitive play that computer chess took a huge leap forward. In 1996, Deep Blue managed to take a game from then World Champion Garry Kasparov, and the next year it would outright beat him under tournament conditions. Though I had mostly left chess behind, the news that a computer could beat the best player in the world was equal parts exciting and disappointing. Part of me couldn't help but want Kasparov to win, but part of me was intrigued and inspired by the idea of building an algorithm that could play chess at the highest levels.
Kasparov facing off against Deep Blue - Chessbase.com
Fast-forward to 2022 and I decided to give chess a try again. I was prepared to have to re-learn a ton of what I had forgotten during years of not playing, but what I wasn't prepared for was just how thoroughly incredibly strong chess engines had become incorporated into the game. The idea that I could analyze my games with the help of an engine more powerful than Deep Blue was startling. The more I started to consume online chess content, the more I was amazed by the way commentary worked now: Streamers and commentators didn't have to speculate about a position the way I remembered, but could now look at an eval bar or an engine line in real time. Gone were the days of wondering if a certain line was best - now the question was whether or not a player would make the move we all knew was strongest with the benefit of the engine. Could a GM see what the machine saw?
What are "computer moves?"
The more chess content I watched, the more I kept hearing a phrase that fascinated me. That phrase was computer move. I'd watch a recap video, for example, and the commentator might point out the best line by saying something like, "But that's such a computer move - there's no way a human is going to play that." I also saw discussions like this in videos exposing chess cheaters: In many of these videos, a player with a rating not far from mine would uncork a move that seemed impossibly strange, but ended up putting enormous pressure on the opponent. A "computer move" strikes again!
Really? h4 is best?
From a cognitive science perspective, I thought this was fascinating. What this idea of "computer moves" suggested to me was that chess engines aren't just better, but that they also play chess differently than people. What could that mean? Why would some computer moves seem "hard to play" or "impossible to see" when they're right there on the board? One possibility is that while people bring a lot of cognitive biases to their decision-making (including confirmation bias, which I've written about elsewhere), computers have no such constraints on their thinking.
A classic example of Confirmation Bias: If I tell you that every even-numbered card is red on the other side, which cards do you turn over to check if that's right? Most people will say "8" and "Red," but it's really "8" and "Blue" that you need to check!
This idea of "computer moves" raises an interesting question, however: Can chess players really tell the difference between a machine opponent and a human opponent? Those examples I gave you above are compelling, but also aren't quite enough to provide a firm answer. To really find out if we can tell a person from a bot, we need to know if chess engines can pass the Turing Test.
What is the Turing Test?
Alan Turing (pictured below), is widely considered one of the founders of computer science and made foundational theoretical contributions to the field (like his work on computability) and practical applications of computer algorithms (like his war-time work on cryptanalysis). His work was wide-ranging, including topics like pattern formation in biological and chemical systems alongside his more mathematical work. Perhaps his most enduring idea is the concept of the Turing Test, which he suggested as a simple criterion for deciding whether or not a machine could be considered "intelligent."
Public domain, via Wikimedia Commons
The key idea behind the Turing Test is that if a person cannot distinguish between a person and a machine in some setting, than we may as well say that the machine is intelligent. There is a LOT of argument about whether that's an acceptable stance to take, but nonetheless the Turing Test has been an important benchmark for AI ever since. The setting that Turing described to introduce the concept of his test was a situation in which a person and a machine would each be asked questions by a person he called the Interrogator. The Interrogator could ask whatever they liked of either the person or the machine while the machine would try to act person-like and the person would try to help the Interrogator identify who was who. At the end of the questioning, the Interrogator would have to try and guess which entity was the real person.
A schematic view of the Turing Test: Can the Interrogator (C) determine which of A or B is the machine? Schoeneh, CC0, via Wikimedia Commons
Since this initial proposal, there have been different attempts to conduct Turing Tests in widely different settings. These include conversation (like the original proposal), visual art (Daniele et al., 2021), music composition (Ariza, 2009) and even driving behavior (Bazilinskyy et al., 2021). But what about chess? This question is the subject of the study I'd like to tell you about, in which researchers examined how well chess players could succeed at an over-the-board version of Turing's "Imitation Game."
A Turing Test for chess (Eisma et al., 2024)
In this study, the researchers set up a straightforward "Imitation Game" scenario for evaluating how detectable machine play would be over-the-board. However, this included a number of important design choices to ensure that their version of a chess Turing Test wouldn't be trivially easy for their participants. For example, we all know that the best engines are simply far stronger than the vast majority of human players. Besides their strength, they also will choose good moves much faster than nearly any human, especially in difficult positions. Noticing that an opponent is playing with super-human accuracy and speed is an easy way of working out that you're playing a machine, so what can we do to make a Turing Test for chess more meaningful?
The strongest engines are just better than humans - how do we make a Turing Test about more than superhuman chess strength?
To look more specifically at the nature of computer play vs. human play, the authors of the current study set up their Turing Test to balance properties of human and computer play carefully. First, all participants were given the same set of starting positions to begin playing from, all of which were selected from games played by GM Drvitman on Lichess. Each of these positions was the result of the first 10 moves of a Rapid game, and were selected so that there were no tactical forcing moves. You can see one of these positions below:
Figure 2 from Eisma et al., (2024) illustrating one of the starting positions used in their Turing Test experiment.
The participants (24 of 'em in all) had an average Lichess Blitz rating of about 1550 and were told that the objective was NOT to win, but to try and work out who they were playing in each of 8 positions: A person or a machine? In reality, they would face one human player (with a Lichess Blitz rating of 1136) and 3 different engines across these conditions. The engines included two versions of Stockfish (one set to Level 4 and one set to Level 20, the highest you can choose) and Maia, an engine trained to play like a human at different playing strengths (McIlroy-Young et al., 2020). I need to point out that there are definitely people here on Lichess who know MUCH more about the way that Maia works than I do, so apologies if I am doing grave violence to it with this description. That said, the important feature of how Maia is trained is that it is trained on games played by people with the goal of identifying what a person will play when faced with a particular position. Maia is trained using data from games played by humans of varying skill, so users can choose different strengths for Maia that are intended to reflect specific human ratings. For the current study, the researchers selected a level meant to match an ELO rating of ~1100.
Compared to Stockfish, Maia is specifically trained to mimic human play of varying strength during play using deep learning. It achieves this through training data from human games used to establish matches to human moves rather than optimal moves. (image from Maiachess.com)
Subjective ratings of human-ness
With these human and machine opponents lined up for each participant, players were each given 5 minutes on the clock. Participants were informed that their opponent's moves in each position would be made every 10 seconds, minimizing the usefulness of playing speed as a way of guessing who was the machine. After playing through each of the 8 positions (2 for each opponent in a randomized order), the researchers asked the participants a number of questions about who they thought they were playing. These included asking whether the opponent was a human or machine, their confidence in that guess, and the estimated strength of the opponent. You can see the results of those questions below for the 4 different opponents used here - in order: Human, Maia, Stockfish (weak), and Stockfish (strong).
Figure 4 from Eisma et al., illustrating how players responded to various questions about the 4 opponents they faced.
There is a lot to see in this picture, so let me walk you through the main results. First, in general players thought both versions of Stockfish seemed more machine-like, while Maia was not rated as any less human than a real person. This suggests that even when Stockfish is relatively easy to beat (like the Level 4 opponent) there is still something machine-like about the way it plays that is different than what Maia will do. Second, players were consistently confident about their guesses across all opponents. They were also able to identify the difference in playing strength between Level 4 and Level 20 Stockfish.
Play-based descriptors of human vs. engine performance
Besides these ratings of participants' impressions of who they were playing, the researchers also collected data about how the games unfolded to compare objective performance across different opponents. You can see that bar graph below, which includes a number of clear differences across conditions in how many moves were played, the outcome of the games, and a "win rate" descriptor that the authors estimated using the engine evaluation of each game's final position (this may not match the game outcome due to time-out losses with a superior position).
Figure 3 from Eisma et al. (2024) displaying their various objective measurements of chess play across different opponents.
To my eye, the most important results here are that players fared more poorly against both versions of Stockfish (especially the stronger one, which is no surprise) and needed to take more time to plan their moves, too. Looking at this data can't help but make me a bit concerned that players' subjective impression of who they're playing likely depends a great deal on simple playing strength. Maia may appear more human-like simply because it's not as much stronger than the human opponent. I'd love to see more piloting work to help match playing strength between the human opponent, Maia and the weaker version of Stockfish as the current study seems like it has a bit of a confound built into the design. Still, it's interesting to see how these objective descriptors relate to the subjective ratings of playing strength up above, where Stockfish at Level 4 wasn't rated as drastically stronger than Maia. There is something worth studying further here, I think, insofar as it's not clear to me how subjective impressions of playing strength may relate to objective outcomes and/or impressions of human-ness - is there some threshold for relative playing strength that makes people switch their assignment of human vs. machine, for example?
Kibbitzing about who your opponent is
One analysis I thought was particularly neat is the data they have from recording what players said to themselves while playing. (this is in the last two panels of the blue bar graph up above). This allowed the researchers to measure how much players spoke when facing different opponents, but also how often they expressed surprise or confusion about their opponents' moves. Players tended to speak less when facing the stronger version of Stockfish, but express more surprise during play against the weaker version. Machine-like play appears to be more unexpected to humans and to elicit either commentary about the weirdness of a move or perhaps a bit of stunned silence as you realize you're done for.
Besides this, the authors included another natural language analysis of how participants made decisions about the identity of their opponent. What seem to be the key clues to who you're playing, both when you're right and when you're wrong? You can see their GPT-enabled summary of these comments in the table below.
Table 1 from Eisma et al. (2024) - What makes players think their opponent is a human or an engine?
I find this particularly interesting and I'd love to see a more granular look at individual moves coupled with more NLP descriptors of participants thinking aloud about their opponent at different points of the game. What you can see here is that blunders tend to be understood as a dead giveaway for humanity...except when they're not! Maia, for example, seems human because the engine makes various blunders and misses checkmates. StockfishL, however, seems like an engine because it - wait for it- because it makes blunders and misses objvious checkmates.
What are we to make of this? I think a ton of work is being done in these descriptions by words like 'cold," and 'unnatural' moves and 'strange' blunders. What exactly are those? Why exactly do they seem strange to us? What is the difference between a blunder that I can understand my opponent making and a blunder that seems like it must be the result of an engine's calculation? You may have your own intuitions about this and these may point the way for future work. My feeling is that there is a nice indication from this work that there are meaningful differences in the kind of play an engine engages in and the moves that a human makes, but the criteria we use to reach those judgments are still rather elusive.
Conclusions
Overall, the data suggest that there really are such things as "computer moves," at least with regard to how Stockfish plays. The different training used to build Maia leads to more human-like play, with participants generally being tricked into thinking Maia was a human more often than they thought the same of Stockfish. Chess competence still looms as a potential spoiler here, however, so future work would absolutely benefit from closer matching of engine and human ability.
Given that there may be "computer moves," another big question for cognitive science in this domain (at least, I think!) is why even strong players have biases away from these optimal moves that they need to unlearn to get better. Many current GMs (Carlsen and Nakamura included) have talked about studying engines with the goal of understanding why certain puzzling moves do serve a purpose and incorporating those plans into their play. What makes some moves seem more natural than others such that a machine plays like a machine and a human tends not to? For now, Stockfish can clearly beat the living daylights out of us all over the board, but doesn't pass the Turing Test.
Thanks from NDPatzer!
With 2024 nearly over, I can't help but look back on the past year of writing these articles. I first started blogging about chess and cognitive science in 2023, but this year I decided I really wanted to commit to working on my science communication skills while learning more about the science of the game. I've been very excited to see so many people reading these and your comments and questions in the forum have taught me a lot about writing for a wider audience than I've been used to in my professional life. So: Thank you all very much for a great 2024 - this will be my last post for the year, but I'm hoping to have more Science of Chess content for you in the New Year!
Support Science of Chess posts!
Thanks for reading! If you're enjoying these Science of Chess posts and would like to send a small donation my way ($1-$5), you can visit my Ko-fi page here: https://ko-fi.com/bjbalas - Never expected, but always appreciated!
References
Ariza, C. (2009) The Interrogator as Critic: The Turing Test and the Evaluation of Generative Music Systems. Computer Music Journal, 33, 48-70.
Bazilinskyy, P., Sakuma, T. & de Winter (2021) What driving style makes pedestrians think a passing vehicle is driving automatically? Applied Ergonomics, 95, 54968-7.1
Daniele, A., Di Bernardi Luft, C. & Bryan-Kinns, N. (2021) What is Human? A Turing Test for Artistic Creativity. Proceedings of EvoMUSART 2021.
Eisma, Y.B., Koerts, R. & de Winter, J. (2024) Turing tests in chess: An experiment revealing the role of human subjectivity. Computers in Human Behavior Reports, 16, 100496.
McIlroy-Young, R., Sen, S., Kleinberg, J. & Anderson, A. (2020) Aligning superhuman AI with human behavior: Chess as a model system. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1677-1687.