SAN FRANCISCO — When OpenAI introduced its chatbot ChatGPT late last year, millions were shocked by the human-like way it answered questions, wrote poetry and talked about almost any topic. But it took most people a while to realize that this new type of chatbot often makes things up.

When Google launched a chatbot several weeks later, it talked nonsense about the James Webb Telescope. The next day, Microsoft’s new Bing chatbot offered false information about Mexican nightlife and singer Billie Eilish. Then, in March, ChatGPT cited a half-dozen bogus court cases while crafting a 10-page legal brief that a lawyer submitted to a federal judge in New York.

Now, a startup called Vectara, founded by former Google employees, is trying to find out how often chatbots deviate from the truth. The company’s research estimates that even in situations designed to prevent it, chatbots make up information at least 3 percent of the time — and as much as 27 percent. Experts call this “hallucination.”

Because these chatbots can respond to almost any request in an unlimited number of ways, there is no way to definitively determine how often they hallucinate. “You would have to look at all the information in the world,” said Simon Hughes, the Vectara researcher who led the project.

Hughes and his team asked these systems to perform a simple, easily verifiable task: summarize news articles. Chatbots persistently invented information.

“We gave the system 10 to 20 pieces of data and asked for a summary,” said Amr Awadallah, CEO of Vectara. “That the system can still introduce errors is a fundamental problem.”

The researchers argue that when these chatbots perform tasks that go beyond simple summarization, hallucination rates may be higher.

In the research, OpenAI technologies had the lowest hallucination rate, around 3 percent. The systems of Meta, owner of Facebook and Instagram, were around 5 percent. The Claude 2 system from Anthropic, a San Francisco-based OpenAI rival, topped 8 percent. A Google system, Palm Chat, had the highest rate at 27 percent.

Google declined to comment, and OpenAI and Meta did not respond to requests for comment.

The researchers hope their methods will spur industry-wide efforts to reduce hallucinations. OpenAI, Google and others are working to minimize the problem using a variety of techniques, although it is unclear if they will be able to eliminate it.

Because the Internet is full of false information, these systems repeat the same falsehoods. They are also based on probabilities: what is the mathematical probability that the next word is “playwright”? Occasionally they guess incorrectly.

To determine how often chatbots stumbled when summarizing news articles, Vectara researchers used another large language model to verify the accuracy of each summary.

But James Zou, a computer science professor at Stanford University in California, said the language model that performs the verification can also make mistakes.

“The hallucination detector could be fooled—or hallucinate itself,” he said.

By: CADE METZ