Testing the Tests: How to Measure RAG Evaluation Completeness

By Max StrueverNoah BroestlHersh GuptaAdel Abdalla, and Rajprakash Bale
Blog Post

Enterprise GenAI projects are everywhere, and Retrieval-Augmented Generation (RAG) has become one of the most widely used methods for connecting large language models (LLMs) to company data. By pulling relevant information from internal knowledge bases before generating a response, RAG helps organizations improve factual accuracy by grounding model outputs in trusted, domain-specific knowledge.

As organizations expand from experimentation to production deployment, testing the system for quality is a key step. However, despite the need for robust testing, we rarely ask the question: How do we know whether our tests actually cover what the model should know?

Current Evaluation Frameworks Don’t Tell the Whole Story

Most RAG evaluation frameworks measure how well a model performs on the test questions it’s given. Many organizations rely solely on human judgement to determine whether their tests sufficiently cover their data and use case.

And if the tests themselves aren’t comprehensive, none of the metrics they generate can be trusted. A system might appear to perform well not because it’s genuinely reliable, but because its test set didn’t touch every part of the knowledge base.

Without that assurance, even the most rigorous evaluation frameworks leave room for uncertainty.

A Framework for Quantifying Semantic Test Coverage

Semantic test coverage, introduced in our recent paper published on arXiv , addresses this gap by quantifying the amount of overlap between our questions and the semantic concepts in our corpus. It provides a systematic way to measure how thoroughly your RAG tests cover your data, find under-tested areas, and identify where to add new test questions.

The approach borrows from traditional software testing: just as code coverage measures which parts of a system are executed during tests, semantic test coverage quantifies how well your evaluation questions reach across the key topics and concepts in your data.

The result is a new, quantitative foundation for evaluating RAG systems, one that moves beyond just testing for correctness to also testing for completeness.

Why This New Approach Is Needed Now

With many organizations now using GenAI in production , the risks of incomplete testing have become impossible to ignore. The MIT AI Risk Repository tracks AI Incidents and is showing significant increases in incidents. 2024 saw ~250 AI incidents and the trend is increasing. Since RAG allows companies to augment foundational models with their own internal data, many companies have embedded RAG systems into critical workflows that affect customers, employees, and even regulators. In these contexts, an inaccurate or incomplete response can carry significant reputational, legal, or financial consequences.

Regulators and standard bodies like NIST are also raising expectations for transparent testing. It’s no longer enough to say a model performs well, organizations need to know what they’ve tested and where the gaps are.

Scale adds another challenge. Public benchmarks like MMLU can rely on large human review teams, but enterprise data is proprietary and often too big for manual inspection. Moreover, public benchmarks don’t evaluate an organization’s specific use cases, compounding the challenge of ensuring test coverage at scale. Companies need application-specific benchmarks that evolve with their knowledge bases.

Our semantic test coverage framework provides exactly that. It offers a repeatable, data-driven way to measure what your tests cover, and what they miss so organizations can deploy RAG systems responsibly and with confidence.

Test Coverage for an example Corporate Policies RAG app

How the Framework Works

Frameworks like RAGAS and ARES are excellent at setting up an easy environment for judging response quality and even generating synthetic queries . Still, they don’t quantify how thoroughly your tests span your knowledge base or flag irrelevant documents and questions that skew results.

Semantic test coverage complements those frameworks by auditing the coverage of your evaluation inputs, giving teams a way to see not just how well their RAG performs, but how completely it’s been tested.

Here’s a high-level overview of the methodology, which we describe in more detail in the paper :

  1. Map the territory: Break your knowledge base into smaller chunks and cluster them by meaning to reveal the main topics and themes in your data.
  2. Place your tests on the same map: Plot your test questions in that same semantic space to see how well they align with the content they’re meant to cover.
  3. Filter out irrelevant tests: Use outlier detection to remove questions that don’t belong, such as off-topic or mislabeled items so your coverage scores stay accurate.
  4. Measure coverage: Generate scores (basic, content-weighted, and multi-topic) that show how thoroughly your tests reach across topics and concepts in your corpus.
  5. Turn scores into action: Identify clusters (which represent related topics of source documents) with low coverage, extract the missing themes, and suggest new test questions to close the gap.
  6. Repeat: Regularly monitor to catch new areas of source data that are “not covered” by tests, as knowledge bases are often frequently evolving.

Turning Evaluation into Assurance

Semantic test coverage helps teams see what their evaluations miss and focus attention where it’s most needed. It turns testing into a more disciplined, repeatable process that builds confidence in a system’s reliability before it reaches production.

This framework gives organizations a scalable, systematic way to assess how well their test sets align with their underlying data. It doesn’t replace human oversight, it directs it. By making coverage measurable, teams can deploy expert reviews more intelligently, allowing us to scale human ingenuity.

With semantic test coverage, business and technology leaders can:

To put this into context, the graphic below shows how semantic test coverage fits into a typical RAG lifecycle, from setting up your corpus to maintaining ongoing assurance:

How the coverage workflow fits into your RAG lifecycle

As an example, consider a bank launching a RAG assistant for retail customers. Early evaluations look strong, but coverage analysis tells a different story. The “retirement accounts and rollovers” cluster has almost no test representation, despite accounting for 12% of the corpus and being a top driver of call center traffic. The team adds a focused set of questions about fees, eligibility, and exceptions, re-runs evaluations and immediately sees both coverage and answer faithfulness improve. That’s the power of this framework: helping teams uncover and correct blind spots before they impact customers or business outcomes.

Putting the Framework into Action

This framework isn’t about chasing a single “magic score” that guarantees safety or perfection. It’s a diagnostic tool for seeing how fully your test questions reflect the knowledge your RAG system depends on. The results are directional, showing where coverage is thin so teams can focus their improvements where they’ll have the greatest impact.

It doesn’t replace human judgment or output-quality checks. Those remain essential. Instead, it helps focus that human effort, pointing experts toward the areas that need closer attention.

It’s also not a replacement for existing RAG evaluation frameworks. It works alongside them, adding the missing piece that makes those assessments more reliable by ensuring you’re testing the right things.

To apply this framework, build upon your existing evaluation workflows and governance practices. Here are a few tips for getting started:

Integrating it into Responsible GenAI practices

As GenAI systems move deeper into core business operations , organizations need more than performance metrics, they need proof of completeness. Semantic test coverage provides that missing layer of visibility. By embedding coverage analysis into evaluation pipelines, companies can strengthen governance, improve reliability, and scale GenAI responsibly without slowing innovation.

If you can measure what your tests miss, you can fix it before your customers ever notice. While RAG made enterprise GenAI practical, semantic test coverage makes it predictable.

BCG X has published the full mathematical foundation of this framework and is already helping organizations put it into practice. Our team is working with clients to assess the completeness of their GenAI testing, integrate coverage analysis into existing evaluation workflows, and design governance processes that align with emerging standards.

This work is about more than improving performance. It’s about creating systems that people can trust. And trust in GenAI starts with knowing what you’ve tested, and what you haven’t. As GenAI becomes increasingly embedded in everyday business workflows, it’s completeness that will define confidence.