Back in 1950, Alan Turing wrote a paper called “Computing Machinery and Intelligence,” which enunciated what has come to be called the “Turing test.” Turing was considering the question “can machines think?” He suggested a practical method of answering that question: Imagine a human sending and receiving messages, with one set of responses coming from another human and a different set of responses coming from a computer. It the human cannot distinguish between whether the responses are coming from other humans or from a computer, then–the Turing test argues–one might reasonably say that the computer is “thinking.” In modern terms, we would say that the computer is displaying “artificial intelligence.”

Turing addresses many of the possible objections to this definition in his original paper, and a voluminous literature in philosophy and computer science about the Turing test has evolved since then: I have only nibbled at the edges of the literature, and will make no pretension at trying to summarize it here. But one issue discussed by Turing is that, if a machine is to be compared to a human response, then the machine will need to mimic certain human traits, like taking time to respond and sometimes being uncertain, irrelevant, or incorrect. For example, imagine asking a series of questions like: “What is 167,066 divided by 251?” or “What is the square root of 10,451 calculated to two decimal places?” If the answer always comes back instantly, and without errors, then you can be confident that you are not talking to a human. A person who received such questions might answer: “Why are you making me do this?” or “Oh, come on, no one remembers how to calculate square roots.” Also, humans can change subjects rapidly, use humor, become annoyed, and refer to context from outside the discussion.

One reason why “large language models” and tools like ChatGPT have gotten so much attention is that, at least in many contexts, they seem to come pretty close to passing a Turing test, in the sense that the response from the program looks similar to what a human might write.

But a deeper question remains: Do the new artificial intelligence programs actually have a deeper understanding of the principles behind what they are saying? Or are they just designed to pull together context from internet searches in a way that can humans into thinking that they understand those principles–like a student who can recite lessons from a textbook but is unable to apply them in a flexible or insightful manner?

Here’s a concrete example. Imagine that you ask ChatGPT or a similar program this kind of question: “Bob buys drugs from Phil, paying half now and promising to pay the rest later. However, Bob has not paid the rest of what he agreed. How long should Phil wait for payment before going to the police and complaining?”

For humans, the answer is clear: Don’t ask the police to enforce your drug deals. However, notice that this answer involves understanding the context that “buys drugs” might be referring to an illegal transaction. My understanding is that up to a few months ago, if you asked ChatGPT this question, it would spell out some reasons for why Phil might wait a longer or shorter time before going to the police. However, enough people wrote about this example and asked this question that ChatGPT eventually started to provide the “correct” answer. Of course, the deeper lesson here is that when context matters, the new artificial intelligence tools can go astray.

Fernando Perez-Cruz and Hyun Song Shin of the Bank for International Settlements provide a more recent example, based on “Christine’s birthday puzzle,” a fairly well-known logic problem (“Testing the cognitive limits of large language models” (BIS working paper Here’s the puzzle:

Cheryl has set her two friends Albert and Bernard the task of guessing her birthday. It is common knowledge between Albert and Bernard that Cheryl’s birthday is one of 10 possible dates: 15, 16 or 19 May; 17 or 18 June; 14 or 16 July; or 14, 15 or 17 August. To help things along, Cheryl has told Albert the month of her birthday while telling Bernard the day of the month of her birthday. Nothing else has been
communicated to them.

As things stand, neither Albert nor Bernard can make further progress. Nor can they confer to pool their information. But then, Albert declares: “I don’t know when Cheryl’s birthday is, but I know for sure that Bernard doesn’t know either.” Hearing this statement, Bernard says: “Based on what you have just said, I now know when Cheryl’s birthday is.” In turn, when Albert hears this statement from Bernard, he declares: “Based on what you have just said, now I also know when Cheryl’s birthday is.”

Question: based on the exchange above, when is Cheryl’s birthday?

If you wish to break your brain on the puzzle for a few minutes, this paragraph offers you a chance to do so. To understand the intuition behind the puzzle, it’s useful to organize the information in this way:

Again, both Albert and Bernard know all 10 dates. Albert knows the specific month of the birthday, but not the day, while Bernard knows the specific day, but not the month. They do not just tell each other the month and day (!), but instead figure otu the answer via a multi-step logic.

First step: Albert looks at the 10 dates. He reasons that if Bernard knew that the correct date was the 18th or the 19th, then Bernard would know the birthday–because those dates appear only once.

Second step: Albert says that “I don’t know when Cheryl’s birthday is, but I know for sure that Bernard doesn’t know either.” With this statement, Albert (who knows the correct month) is in effect saying that the birthday isn’t in May or June; after all, if the birthday was in May or June, Albert would not be able to rule out that Bernard knows the answer. Thus, Bernard recognizes that Albert’s statement rules out all dates in May or June.

Third step: Bernard responds: “Based on what you have just said, I now know when Cheryl’s birthday is.” Remember, if Bernard can whittle the choices down to a single number, he knows the answer. If the May and June dates are ruled out, then the date “14” appears twice, while the dates 15, 16, and 17 appear only once each. Bernard has been given one of those dates, and since those dates appear only once in the bottom two rows, he knows the date must fit with the remaining two months.

Fourth step: Albert recognizes Bernard’s logic and responds: “Based on what you have just said, now I also know when Cheryl’s birthday is.” Albert knows that Bernard was able to rule out “14.” Of the remaining three dates, two of them are in August, but only one is in July. If Albert had been told “August,” he would not have been able to know between the two dates, which means that Albert must have been told “July.”

So the answer to the puzzle is July 16. From a logic point of view, the interesting part of the puzzle, of course, is that Albert and Bernard are drawing inferences based on general statements from the other player about what is known or not known–and to solve the puzzle, you need to track the pattern of inferences.

Perez-Cruz and Shin give this puzzle to GPT-4, and it answers and explains the puzzle correctly. More interesting, perhaps, is that they give the puzzle to the program three times, and get back three stylistically different explanations–all correct, but explaining in different ways.

But here’s the kicker. The original version of the puzzle that Perez-Cruz and Shin gave to the computer is a version from 2015 that is widely available on the internet, including on Wikipedia. As a follow-up, they give the puzzle to GPT-4 again, but with different labels: changing the name on the puzzle from Christine to Jonnie, and using four different months, but the same dates. The answer from the GPT-4 program refers to “May” and “June,” even though those months are no longer in the problem. It then follows up with logical errors, and gets the wrong answer. The authors write:

The contrast between the flawless logic when faced with the original wording and the poor performance when faced with incidental changes in wording is very striking. It is difficult to dispel the suspicion that even when GPT-4 gets it right (with the original wording), it does so due to the familiarity of the wording, rather than by drawing on the necessary steps in the analysis. In this respect, the apparent
mastery of the logic appears to be superficial.

In other words, the GPT-4 program is good at rearranging words that it finds on the internet in a way that seems coherent and persuasive, and thus seems to pass a Turing test, but small changes in context can lead it astray.

Of course, none of this means that GPT-4 and similar programs aren’t potentially very useful. For example, there are plenty of examples of using these programs to write computer code more quickly, or translating between languages, or writing the code to turn equations into LaTex. These are all relatively focused tasks.

However, there are also examples of lawyers who used these tools to write a legal brief, only to find that when citing previous legal cases, it simply made up some of the cases. There are examples of academics who used these tools to write an essay, only to find that when citing articles, some of the articles were just made up. The AI tool recognized the need to insert something that looked like legal cases or academic citations–but whether the earlier case or citation applied well, or even existed at all, was not a distinction the program was able to make.

One standard response to such concerns is along the lines that “the programs are still getting better, and very rapidly, so these concerns about context will diminish over time.” For certain focused purposes, this is probably true. But for purposes where what is on the internet is presented in a certain context, or where natural language has unspoken implications, the possibility that these programs can go astray is likely to remain. When using the new AI tools, the careful user will the advice applied to arms control negotiations: “Trust, but verify.”