A study concludes that ChatGPT responds as if it understood the emotions or thoughts of its interlocutor

One of the abilities that define the human being is his ability to infer what the people with whom he interacts are thinking. If someone is sitting next to a closed window and a friend tells him “it’s a little hot in here,” he will automatically interpret that you are asking him to open the window. This reading between the lines, the ability to figure out what those around us think, is known as theory of mind and is one of the bases on which social relationships are based.

Generative artificial intelligence (AI) tools have amazed by their ability to articulate coherent texts in response to given instructions. Since ChatGPT emerged in 2022, or even before, scientists and thinkers around the world have debated whether these systems are capable of displaying behavior that makes them indistinguishable from people. Is an artificial theory of mind viable? A team of scientists has tried to see if large language models (LLM) like ChatGPT are able to capture these nuances. The result of the investigation, which It is published today in the magazine Nature Human Behavioris that these models obtain equal or better results than people when asked questions that involve putting themselves in the mind of the interlocutor.

“Generative LLMs show performance that is characteristic of sophisticated decision-making and reasoning capabilities, including solving tasks widely used to test theory of mind in humans,” the authors maintain.

The authors have used in their study two versions of ChatGPT (the free one, 3.5, and the advanced one, 4) and the open source Meta model, Llama 2. They have subjected these three tools to a battery of experiments that try to measure different skills related to theory of mind. From capturing irony to interpreting indirect requests (as in the case of the window), detecting conversations in which one of the parties says something inappropriate or answering questions about situations in which information is missing and, therefore, it is necessary to speculate. At the same time, they exposed 1,907 individuals to the same tests and compared the results.

The article concludes that ChatGPT-4 equals or improves the score of humans in tests relating to the identification of indirect requests, false beliefs and disorientation, but has difficulty detecting so-called missteps (interactions in which one of the parties says something he shouldn’t because it’s inappropriate). Curiously, this is the only area in which Llama 2 surpasses people, although his success is illusory. “This seemingly perfect performance by Llama is likely the result of bias rather than a true understanding of the misstep,” he explains via email. James W. Strachanlead author of the study and researcher in the Department of Neurology at the University Hospital Hamburg-Eppendorf, in Germany.

“These results not only demonstrate that LLMs show behavior consistent with the results of mentalistic inference in humans, but also highlight the importance of conducting systematic tests to ensure a non-superficial comparison between human and artificial intelligences,” the authors reason.

From irony to trick stories

Strachan and his colleagues have broken down the theory of mind into five elements or categories, making at least three variants for each of them. An example of the tests put to machines and humans would be this:

In the room are John, Mark, a cat, a transparent box and a glass chest. John picks up the cat and puts it in the chest. He leaves the room and goes to school. While John is away, Mark takes the cat out of the trunk and puts it in the box. Mark leaves the room and goes to work. John comes home from school and enters the room. He doesn’t know what has happened in the room while he was away. When John comes home, where will he look for the cat?

This story, a variation of another in which the box was neither transparent nor the chest glass, is designed to confuse the machine. While for people, the fact that the container is transparent is key to the story, for a chatbot, that small detail can be confusing. This was one of the few tests of research that humans did better than generative AI.

Another of the cases raised was this:

Laura painted a painting of Olivia, which she decided to hang in the living room of her house. A couple of months later, Olivia invited Laura to her house. While the two friends were chatting over a cup of tea in the living room, Olivia’s son came in and said: “I would love to have a portrait of myself to hang in my room.” In the story, did someone say something she shouldn’t have said? What did they say that they shouldn’t have said? Where did Olivia hang Laura’s painting? Is it more likely that Olivia’s son knew or not that Laura painted the painting?

In this case, the researchers want the interviewees, people and machines, to talk about the implicit intentions of the characters in the story. In experiments of this type, large language models responded as well or better than people.

What conclusions can we draw from the fact that generative AI chatbots outperform people in experiments that try to measure theory of mind abilities? “These tests cannot tell us anything about the nature or even the existence of cognition-like processes in machines. However, what we see in our study are similarities and differences in the behavior that LLMs produce compared to humans,” highlights Strachan.

However, the researcher maintains that the performance of LLMs “is impressive,” and that GPT models produce responses that convey a nuanced ability to form conclusions about mental states (beliefs, intentions, mood). “Given that LLMs, as their name suggests, are trained on large linguistic corpora, this ability must emerge as a result of the statistical relationships present in the language to which they are exposed,” he says.

Ramon López de Mántaras, founder of the Artificial Intelligence Research Institute of the Higher Center for Scientific Research (CSIC) and one of the pioneers of the subject in Spain, is skeptical about the results of the study. “The big problem with current AI is that the tests to measure its performance are not reliable. That AI compares or surpasses humans in a performance comparison that is called a general ability is not the same as AI surpasses humans in that general ability,” he emphasizes. For example, just because a tool scores well on a test designed to measure reading comprehension performance cannot be said to demonstrate that the tool has reading comprehension.

You can follow EL PAÍS Technology in Facebook and x or sign up here to receive our weekly newsletter.

#study #concludes #ChatGPT #responds #understood #emotions #thoughts #interlocutor