In this Q&A, we speak with Nikil Patel about his work as Research Assistant on an Innovate UK sponsored project focused on reducing hallucinations in Generative AI for conversational AI (cAI) applications. This collaborative project involved Algomo, YuLife, Natwest and the Universities of Edinburgh and Essex.
Nikil's role in the project involved researching hallucination detection techniques from academic literature, validating their performance with internal data, integrating them into the product’s workflow and overseeing the project's planning and execution.
What challenge or problem was the business facing and how did your project aim to solve it?
Generative AI technology can sometimes generate incorrect or fabricated answers, known as 'hallucinations.' These errors can undermine customer trust and harm a company's reputation.
Our project aimed to approach this problem in two phases:
- hallucination detection - identifying when AI-generated responses may be inaccurate
- hallucination mitigation - reducing hallucinations with simple, practical and resource-effective techniques suitable for enterprise settings
This project was a perfect fit for Algomo, the lead participant, as it enhanced their core product, a cAI platform, and addressed the market demand for trustworthy and reliable generative cAI.
How did the project address this challenge?
This project provided a dual-faceted experience, integrating perspectives from both industry and academia. In the industry setting, I engaged with real-world data and practical challenges, while academia offered exposure to advanced research and emerging innovations. This combination ensured that our approach was both research-driven and applicable in real-world business settings.
To detect hallucinations in LLM-generated responses (Large Language Models, which are AI systems trained on vast amounts of text data to understand and generate human language), I use two primary methodologies Natural Language Inference (NLI) models and LLMs as evaluators.
My work involved leveraging a range of AI models, including:
- closed commercial models— OpenAI’s GPT-4, GPT-4o, GPT-4o-mini and Anthropic’s Claude Sonnet 3.5
- open smaller-scale models - Meta’s LLaMA 3-70B on Groq and Patronus’s Lynx 8B.
I also explored the integration of these methods with traditional Natural Language Processing (NLP) techniques, including Similarity Matching and Exact Regex Matching, to enhance detection accuracy. To ensure detailed performance monitoring I incorporated LangWatch, an open-source tool. I utilised Label Studio, another open-source platform, for data labelling to systematically annotate and evaluate the model outputs.
What were the most important findings from this project?
The key finding from this project is that the existing hallucination detection methods, even when combined, do not perform as effectively in real-world applications as they do on established benchmarks. This highlights a critical gap between research-driven evaluations and real-world business needs.
To bridge this gap, the research community needs more complex, robust and industry-compatible datasets. This would help to accurately assess the effectiveness of hallucination detection methods and ensure their real-world applicability.
Did anything unexpected come up in your research?
One of the most intriguing findings from my research is the growing consensus that LLM hallucinations may never be fully eliminated. As LLMs continue to advance in their capabilities, the nature of their hallucinations has also evolved—shifting from obvious, easily detectable errors to more subtle and elusive ones.
This presents an ongoing challenge for hallucination detection, requiring increasingly sophisticated methods to identify and mitigate these nuanced inaccuracies.
What challenges did you face, and how did you overcome them?
One of the key challenges was handling Algomo’s multilingual real-world data to create a suitable experimental dataset. This process was time-intensive, requiring meticulous curation and preprocessing to ensure data quality and relevance.
Despite the effort involved, this challenge was addressed by prioritizing a well-structured, case-specific dataset, even if limited in size, to maintain high quality and ensure meaningful evaluation outcomes.
What areas still need improvement in hallucination detection and mitigation?
Our findings suggest two key areas for improvement better datasets and improved AI reasoning.
The research community needs more robust, industry-relevant datasets to accurately benchmark the performance of existing hallucination detection methods. For effective hallucination mitigation, having reliable confidence scores and supplementary hallucination-related information is crucial.
Integrating advanced reasoning capabilities within LLMs may offer a promising direction for reducing and mitigating hallucinations, potentially leading to more reliable and context-aware outputs.
How do you see this work influencing future AI-powered customer support systems?
My work in this project provides valuable insights into hallucinations in real-world enterprise applications, offering a detailed analysis of their occurrence and impact.
By assessing the feasibility and limitations of existing hallucination detection methods, our research lays the foundation for developing more reliable AI-powered customer support systems.
These findings can guide future advancements in integrating more effective hallucination detection and mitigation strategies in enterprise applications, making AI-driven customer service smarter and more trustworthy.
For more information about the Institute for Analytics and Data Science (IADS) and its work, visit our web pages.