Learn to use AI like a Pro. Learn More

Fact or Fiction?

AI Models Still Hallucinating: Study Finds Generative AI Often Struggles With Facts

Last updated:

Mackenzie Ferguson

Edited By

Mackenzie Ferguson

AI Tools Researcher & Implementation Consultant

A new study reveals that even the most advanced AI models, including OpenAI's GPT-4o and Google's Gemini, still frequently hallucinate. The research, led by teams from Cornell and other institutions, tested various AI models against challenging, non-Wikipedia questions, showing that no model consistently delivers factual accuracy.

Banner for AI Models Still Hallucinating: Study Finds Generative AI Often Struggles With Facts

A recent study by researchers from Cornell University, the University of Washington, the University of Waterloo, and the Allen Institute for AI has shed light on the persistence of hallucinations in generative AI models. Despite assertions from leading AI companies like OpenAI and Anthropic, the research indicates that these models still have a high propensity to produce hallucinated (i.e., incorrect or fabricated) content.

    The study evaluated over a dozen modern AI models, including popular ones such as OpenAI's GPT-4, Google's Gemini, Anthropic's Claude, and Meta's Llama. It found that none of the models were particularly good at consistently generating accurate information across a range of topics, which included law, health, history, and geography. Interestingly, the models that hallucinated the least achieved this not by being more accurate, but by opting to avoid answering questions they were uncertain about.

      Learn to use AI like a Pro

      Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

      Canva Logo
      Claude AI Logo
      Google Gemini Logo
      HeyGen Logo
      Hugging Face Logo
      Microsoft Logo
      OpenAI Logo
      Zapier Logo
      Canva Logo
      Claude AI Logo
      Google Gemini Logo
      HeyGen Logo
      Hugging Face Logo
      Microsoft Logo
      OpenAI Logo
      Zapier Logo

      Wenting Zhao, a doctoral student at Cornell and a co-author of the study, emphasized that trust in AI-generated content remains elusive. She highlighted that the top-performing models could only produce hallucination-free text about 35% of the time. This statistic underscores a significant gap between the current capabilities of AI models and the expectations set by their developers.

        In an effort to create a rigorous benchmark for testing, the researchers designed questions that were challenging and often lacked direct references on Wikipedia, a common source for AI training. This approach aimed to simulate real-world scenarios where users seek answers to questions that can't easily be found in widely available datasets. The study revealed that models tend to perform worse when they can't lean on information from Wikipedia, indicating a heavy reliance on the platform for generating responses.

          Among the models tested, OpenAI's GPT-4 and GPT-3.5 were noted for their comparatively lower rates of hallucination. However, the difference was marginal, and even these models struggled significantly with topics outside of common Wikipedia knowledge. Moreover, the study found that model size did not correlate strongly with accuracy; smaller models like Anthropic’s Claude 3 Haiku hallucinated just as frequently as larger models like Claude 3 Opus.

            This ongoing issue of hallucination has several implications for businesses relying on AI. For companies integrating AI into their operations, whether for customer service, data analysis, or content generation, the unreliability of AI outputs can pose significant risks. Erroneous data can lead to poor decision-making, misinforming users, or even legal repercussions depending on the context of the misinformation.

              Learn to use AI like a Pro

              Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

              Canva Logo
              Claude AI Logo
              Google Gemini Logo
              HeyGen Logo
              Hugging Face Logo
              Microsoft Logo
              OpenAI Logo
              Zapier Logo
              Canva Logo
              Claude AI Logo
              Google Gemini Logo
              HeyGen Logo
              Hugging Face Logo
              Microsoft Logo
              OpenAI Logo
              Zapier Logo

              The current study is one among many efforts to scrutinize the factual accuracy of AI models. Earlier studies, often criticized for their simplistic approach, typically focused on questions easily answered by scraping Wikipedia. By contrast, this study's more complex questioning strategy offers a better gauge of AI performance in realistic scenarios, making its findings particularly relevant for practitioners and academics alike.

                The findings suggest that improvements in AI's ability to reduce hallucinations have not kept pace with the hype surrounding these technologies. While some vendors might exaggerate the capabilities of their models, Zhao suggests that part of the issue lies in the benchmarks used for evaluation. Many AI evaluations lack essential context and are not designed to capture the full spectrum of real-world queries.

                  To mitigate the issue of hallucinations, the study proposes that models should be programmed to refuse to answer questions more frequently. This is akin to a know-it-all admitting gaps in their knowledge, which could increase the overall reliability of the responses provided. Indeed, in their tests, models like Anthropic’s Claude 3 Haiku that abstained from answering more often were actually the most accurate in terms of the information they did provide.

                    Looking forward, the researchers advocate for human-in-the-loop systems where human experts are involved in verifying and validating AI-generated content. This could be coupled with advanced fact-checking tools and mandatory citations for all factual content generated by AI. Zhao also notes the importance of policy and regulation to ensure accountability in the use of generative AI.

                      Businesses should be cautious when deploying AI technologies and must consider investing in robust fact-checking mechanisms to verify the outputs of AI models. By doing so, they can safeguard against the risks posed by hallucinations and ensure that the technology enhances rather than undermines their operations.

                        Overall, while the study underscores the limitations of current AI models in generating reliable information, it also highlights the ongoing efforts and potential pathways to improve the factual accuracy of generative AI, laying down a call to action for continued research and development in this crucial area.

                          Learn to use AI like a Pro

                          Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

                          Canva Logo
                          Claude AI Logo
                          Google Gemini Logo
                          HeyGen Logo
                          Hugging Face Logo
                          Microsoft Logo
                          OpenAI Logo
                          Zapier Logo
                          Canva Logo
                          Claude AI Logo
                          Google Gemini Logo
                          HeyGen Logo
                          Hugging Face Logo
                          Microsoft Logo
                          OpenAI Logo
                          Zapier Logo

                          Recommended Tools

                          News

                            Learn to use AI like a Pro

                            Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

                            Canva Logo
                            Claude AI Logo
                            Google Gemini Logo
                            HeyGen Logo
                            Hugging Face Logo
                            Microsoft Logo
                            OpenAI Logo
                            Zapier Logo
                            Canva Logo
                            Claude AI Logo
                            Google Gemini Logo
                            HeyGen Logo
                            Hugging Face Logo
                            Microsoft Logo
                            OpenAI Logo
                            Zapier Logo