Which LLM evaluation tool has the highest accuracy?

According to the benchmark, NeuralTrust achieved the highest accuracy, scoring 80% on the Customer Dataset and 86% in Google Answer Equivalence.

What is the most reliable method to evaluate LLMs?

The benchmark results suggest that NeuralTrust provides the most reliable LLM evaluation, outperforming Llama-index, Ragas, and Giskard in both dataset accuracy and search equivalence.

Which LLM evaluation framework performed best in benchmarking?

NeuralTrust ranked first in the benchmark with the highest accuracy (80%) and Google Answer Equivalence (86%), making it the top-performing evaluation framework.

How does Llama-index compare to NeuralTrust in LLM evaluation?

Llama-index scored 57% on the Customer Dataset and 78% in Google Answer Equivalence, significantly lower than NeuralTrust's 80% and 86% scores.

What are the benchmark results for Ragas vs. other LLM evaluation tools?

Ragas achieved 67% accuracy on the Customer Dataset and 82% in Google Answer Equivalence, ranking below NeuralTrust but above Llama-index and Giskard.

Which LLM evaluation tool had the lowest accuracy in the benchmark?

Giskard ranked the lowest, scoring 47% on the Customer Dataset and 75% in Google Answer Equivalence, making it the least accurate evaluation approach tested.

How do NeuralTrust's LLM evaluation results compare to Google?

NeuralTrust’s evaluation showed 86% Google Answer Equivalence, meaning its LLM assessments closely match high-quality search engine results.

What is the best LLM evaluation tool for high-accuracy AI models?

Based on benchmark results, NeuralTrust is the most accurate LLM evaluation tool, delivering the highest scores in customer dataset accuracy and equivalence to trusted answers.

Back

Benchmarking LLM Evaluation Models

Martí Jordà • February 19, 2025

Contents

The rapid advancements in generative AI have enabled companies to deploy LLM-powered virtual assistants within minutes, revolutionizing the way businesses interact with customers and manage information.

A key innovation in this space is Retrieval-Augmented Generation (RAG), which has become a powerful tool for sharing knowledge both internally and externally by enhancing large language models (LLMs) with domain-specific, up-to-date information.

LLMs themselves possess remarkable reasoning abilities and creative potential, making them invaluable for a wide range of applications.

However, as these systems become more sophisticated, a crucial challenge arises: how can we measure their accuracy and reliability? In other words: how can we ensure that these virtual assistants are performing well?

The need for robust LLM evaluation frameworks to assess RAG model accuracy and ensure the reliability of their responses is more pressing than ever, highlighting an important area of ongoing research in AI development.

In this post, we will benchmark the different alternatives available in the market for LLM correctness evaluation. Specifically, we compare RAG evaluation frameworks such as NeuralTrust, Ragas, Giskard and LlamaIndex.

Evaluating LLM response accuracy: Measuring RAG model performance

LLMs are very eloquent and always produce coherent responses. But how accurate are these outputs when provided by LLM-powered virtual assistants? How can we determine whether these responses are both factually correct and contextually appropriate?

This question becomes particularly critical in the context of a RAG framework, where an LLM leverages internal company data to construct responses. If these responses do not correspond to factual information stored in the internal knowledge databases of the company, we are at risk of disseminating false or misleading information, which can have devastating reputational and operational consequences.

To address this issue, we have applied several off-the-shelf LLM evaluation metrics to measure response correctness against ground truth data.

This has allowed us to assess how effectively different RAG evaluation frameworks validate response accuracy. Furthermore, we will analyze how different evaluation frameworks compare, demonstrating advantages and weaknesses of each solution.

Task setup:

Each participant will receive pairs of responses: an actual response and an expected response.
The response of each participant will be binary—classified as correct (true) if they are equivalent, or incorrect (false) if they are not.
We define equivalence as two responses conveying the same information or, at the very least, not containing directly contradictory statements.
We will use two datasets:

A publicly available dataset containing relatively simple response pairs.
A much more challenging custom-built dataset with functional and adversarial queries designed to rigorously test response reliability.

Comparing LLM evaluation frameworks

We will now introduce the various LLM evaluation frameworks compared in this post. Each of these methods is based on LLMs and operates in a zero-shot manner, meaning they require no prior training to generate a model. Despite not having encountered the problem before, these approaches can still effectively solve it, making them valuable tools for LLM correctness evaluation in RAG systems.

NeuralTrust’s correctness evaluator

NeuralTrust offers a comprehensive family of LLM model evaluators. Most of them are based on the LLM-as-a-judge technique, providing carefully crafted instructions to large language models. To ensure a fair comparison between NeuralTrust and the other RAG evaluation frameworks participating in this exercise, we will use NeuralTrust’s correctness evaluator, since it is directly comparable to the other alternatives and uses a similar AI model performance evaluation approach.

LlamaIndex correctness evaluator

LlamaIndex provides LLM-based evaluation modules to assess the accuracy of AI-generated responses, offering a powerful tool for ensuring the reliability of LLM-powered virtual assistants. One key component is the Correctness Evaluator, which determines whether a model’s generated answer aligns with a reference answer based on a given query.

While some LLM evaluation frameworks rely on ground-truth labels, many of LlamaIndex’s evaluation modules do not. Instead, they leverage LLMs (such as GPT-4) to assess correctness based on the query, context, and response.

This approach enables automated and scalable evaluation of AI-generated outputs, making it a powerful tool for improving the reliability of retrieval-augmented generation (RAG) systems and virtual assistants.

Ragas correctness evaluator

Ragas provides a Correctness Evaluator to measure the accuracy of LLM-generated answers by comparing them against a ground truth reference. This evaluation assigns a score between 0 and 1, where a higher score indicates a stronger alignment between the generated response and the expected answer.

The correctness assessment considers two key factors:

Semantic similarity, which evaluates how closely the meaning of the response matches the ground truth
Factual consistency, which ensures the response remains factually accurate.

These elements are combined using a weighted scoring model to generate the final correctness score. Additionally, users can apply a threshold to convert the score into a binary classification, offering flexibility in how correctness is assessed within retrieval-augmented generation (RAG) systems. In this analysis we will consider scores above or equal 0.5 as passed, and below 0.5 as failed.

Giskard correctness evaluator

Giskard provides a Correctness Evaluator to assess the reliability of AI-generated responses by comparing them against expected outputs. This evaluation ensures that model predictions align with ground truth data, helping to identify potential inaccuracies or inconsistencies.

Unlike simple response matching techniques, Giskard incorporates semantic understanding, allowing it to determine whether an answer is factually correct even if phrased differently.

Evaluating LLM correctness with benchmark datasets

To assess the previously mentioned LLM correctness evaluation approaches, we need test datasets containing pairs of AI-generated responses and their expected counterparts. By comparing these pairs, we can determine whether the generated responses are factually accurate and equivalent to the expected ones, a crucial factor in retrieval-augmented generation (RAG) evaluation.

Two datasets will be used:

A publicly available dataset by Google: the Answer equivalence dataset
A proprietary database from a large customer, composed of real-world customer queries and responses.

Answer equivalence dataset

This dataset is introduced, used and described in the following paper: Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation.

We use the training split to test our zero-shot LLM evaluation approaches. This split contains 9,090 instances, where each instance includes an identifier, context, question, AI-generated response, expected response, and a correctness classification (whether they are equivalent or not).

For this LLM evaluation framework exercise we will not use the context or any additional metadata provided in the raw dataset. Instead, we will preprocess the dataset by extracting only the following information.

After preprocessing, each instance in the dataset will look like this:


Copied!
1{
2        "prompt": "Who is pictured on the stupa's base?",
3        "actual_response": "Amitabha",
4        "expected_response": "Dhyani Buddha Amitabha",
5        "test_result": "Pass"
6}
7

We define a correct test result as one where the AI-generated response is equivalent to the expected answer, ensuring LLM accuracy. If responses significantly differ, the test is marked as a failure.

This dataset primarily contains cases where responses and expected responses are either completely different or have minor variations that are not enough to fail a test. This is not very challenging, but it will be interesting to compare different RAG evaluation frameworks and measure performance consistency.

Customer dataset

This dataset is a domain-specific benchmark designed to test LLM-powered virtual assistants in real-world industry applications. It consists of two types of questions generated from a real vector store from a customer. By querying the vector store, we generate pairs of questions and expected answers.

Functional questions: These evaluate whether the AI assistant accurately retrieves information from the customer’s vector database and returns a factually correct response.
Adversarial questions: These are generated to test the resilience of the AI assistant by embedding misleading or contradictory information within the query. The goal is to test whether the assistant is able to stick to the content of the vector store or if it spreads incorrect information injected into the question.

This dataset is much more challenging than the Google Answer Equivalence dataset, as it stress-tests retrieval accuracy, factual consistency, and resistance to adversarial manipulation—key factors in AI model benchmarking tools and RAG correctness assessment.

Benchmarking LLM evaluation frameworks: Code implementation

Here are some code snippets demonstrating how we assessed RAG model accuracy across Ragas, Giskard and Llama-Indes:

NeuralTrust

To launch the NeuralTrust correctness evaluator, we first import some libraries, which include NeuralTrust’s CorrectnessEvaluator, the json standard library and the asyncio library to deal with asynchronous calls:


Copied!
1import asyncio
2import json
3from trusttest.evaluators import (
4   CorrectnessEvaluator,
5)
6

Then, we instantiate the evaluator and get the async loop:


Copied!
1evaluator = CorrectnessEvaluator()
2try:
3   loop = asyncio.get_running_loop()
4except RuntimeError:
5   # No running event loop, create a new one
6   loop = asyncio.get_event_loop()
7

We finally load the dataset, call the evaluator for each instance, normalize the results and evaluate the performance of the evaluator:


Copied!
1test_cases = []
2correct = 0
3with open(dataset_path, "r") as fd:
4   json_list = json.load(fd)
5   N = len(json_list)
6   for idx, json_dict in enumerate(json_list):
7       result = loop.run_until_complete(evaluator.evaluate(response=json_dict["actual_response"], context=json_dict))
8       if result[0] > 2.5:
9           evaluation = "Pass"
10       else:
11           evaluation = "Fail"
12
13
14       if evaluation == json_dict["test_result"]:
15           correct += 1
16
17
18   print(f"Accuracy: {correct/N}")
19

We consider all scores above 2.5 are considered as passed tests and failed otherwise.

Ragas

For Ragas, we imported the following components and libraries:


Copied!
1from ragas import evaluate
2from ragas.metrics import answer_correctness
3from datasets import Dataset
4

We loaded the datasets in a dictionary structure, which was then transformed into a Dataset object:


Copied!
1data = {
2   "question": questions,
3   "answer": actual_responses,
4   "ground_truth": expected_responses
5}
6dataset = Dataset.from_dict(data)
7

We then use the evaluate function from Ragas:


Copied!
1result = evaluate(
2   dataset = dataset,
3   metrics=[
4       answer_correctness
5   ],
6)
7df = result.to_pandas()
8correctness = list(df["answer_correctness"])
9

After that, we iterate over the correct answers, considering values higher than 0.5 as “Pass” and “Fail” otherwise. We then compare the results with the ground truth and compute the accuracy.


Copied!
1correct = 0
2N = len(correctness)
3for i, response in enumerate(correctness):
4   bool_response = "Fail"
5   if float(response) > 0.5:
6       bool_response = "Pass"
7  
8   if bool_response == ground_truth[i]:
9       correct+=1
10
11
12print(correct/N)
13

Giskard

In the case of Giskard, we import the following components:


Copied!
1from giskard.rag import evaluate, QATestset
2from giskard.rag import QuestionSample
3

We iterate over the dataset, and for each question, answer and expected_answer we create a QuestionSample.


Copied!
1qs = QuestionSample(
2id=i,
3question=test_obj["prompt"],                         reference_answer=test_obj["expected_response"],                              reference_context="",                          conversation_history=messages,
4metadata={"question_type": "Simple", "topic": "general"},                               agent_answer=test_obj["actual_response"]
5)
6

A list of QuestionSample object is required to create a QATestset:


Copied!
1qat = QATestset(qss)
2

With a list of expected responses and a QATestset, we can call Giskard’s evaluate function to compute accuracy:


Copied!
1report = evaluate(answer_fn=answers, testset=qat)
2evaluations = list(report.to_pandas()["correctness"])
3correct = 0
4N = len(evaluations)
5for expected, actual in zip(passed, evaluations):
6    if expected == actual:
7        correct+=1
8
9
10print(f"Accuracy: {correct / N}")
11

LlamaIndex

Finally, for the Llama-index comparison, we use the following components:


Copied!
1from llama_index.core.evaluation import CorrectnessEvaluator
2from llama_index.llms.openai import OpenAI
3

We instantiate the CorrectnessEvaluator with gp4o-mini as the LLM evaluator.


Copied!
1llm = OpenAI("gpt-4o-mini")
2evaluator = CorrectnessEvaluator(llm=llm)
3

We then iterate over the entire dataset, calling the evaluate method and computing the performance:


Copied!
1i = 0
2N = len(questions)
3correct = 0
4
5
6for query, expected_response, actual_response in zip(questions, expected_responses, actual_responses):
7   result = evaluator.evaluate(
8       query=query,
9       response=actual_response,
10       reference=expected_response,
11   )
12   if result.score > 2.5:
13       res = "Pass"
14   else:
15       res = "Fail"
16   if res == ground_truth[i]:
17       correct += 1
18   i +=1
19
20
21print(correct/N)
22

The scoring in this model ranges from 1 to 5, with values higher than 2.5 considered “Pass” and “Fail” otherwise.

Results: Evaluating LLM correctness across datasets

To assess the performance of each approach, we will compare its evaluations against the ground truth available for both datasets. The ground truth specifies whether the response and the expected response are equivalent in terms of correctness. Accuracy is measured as the proportion of instances where an approach’s evaluation aligns with the ground truth. In other words, if the ground truth indicates that the response and expected response are equivalent, the evaluation method should also recognize them as equivalent—and vice versa.

For example, a false positive occurs when an approach incorrectly identifies a response as correct when, according to the ground truth, it is incorrect. This could happen if the response is close in meaning but contains a critical factual error. On the other hand, a true negative occurs when an approach correctly identifies a response as incorrect, matching the ground truth.

Accuracy is a strong metric for evaluation because it directly reflects how often an approach correctly determines equivalence, making it a clear and interpretable measure of overall performance. The results are shown in the table below.

Approach	Customer Dataset	Google Answer Equivalence
NeuralTrust	80%	86%
Llama-index	57%	78%
Ragas	67%	82%
Giskard	47%	75%

We observed that Ragas demonstrated strong accuracy on the Google Answer Equivalence dataset, achieving impressive results. However, when tested on our real-world customer dataset, its performance dropped significantly. Similarly, LlamaIndex performed well on the Google dataset but struggled to maintain accuracy in our domain-specific dataset.

This comparison highlights a crucial distinction in the capabilities of these LLM evaluation frameworks when applied to real-world AI-generated response accuracy testing.

The key takeaway here is that NeuralTrust’s correctness evaluator stands out as the only model that performs exceptionally well on both datasets, showcasing its accuracy, versatility, and adaptability across diverse data scenarios. This reinforces the robustness of its approach in RAG correctness assessment and demonstrates its ability to adapt effectively across diverse data scenarios.

What’s next?

Ensuring LLM correctness evaluation is crucial for reliable AI-generated responses, especially in retrieval-augmented generation (RAG) systems. As shown in our benchmarking, many evaluation frameworks struggle with consistency across datasets, highlighting the need for a robust, adaptable approach.

NeuralTrust’s RAG solutions enhance response accuracy by grounding AI outputs in verified, domain-specific knowledge, reducing hallucinations and improving factual consistency. Unlike other RAG evaluation frameworks, NeuralTrust consistently delivers high accuracy across diverse scenarios.

Click here to book a demo and see how NeuralTrust’s RAG solutions enhance AI response accuracy and reliability while maintaining efficiency and scalability.