Testing the Tests: How to Measure RAG Evaluation Completeness

By Max Struever, Noah Broestl, Hersh Gupta, Adel Abdalla, and Rajprakash Bale

Blog Post 2025-12-19

Enterprise GenAI projects are everywhere, and Retrieval-Augmented Generation (RAG) has become one of the most widely used methods for connecting large language models (LLMs) to company data. By pulling relevant information from internal knowledge bases before generating a response, RAG helps organizations improve factual accuracy by grounding model outputs in trusted, domain-specific knowledge.

As organizations expand from experimentation to production deployment, testing the system for quality is a key step. However, despite the need for robust testing, we rarely ask the question: How do we know whether our tests actually cover what the model should know?

Current Evaluation Frameworks Don’t Tell the Whole Story

Most RAG evaluation frameworks measure how well a model performs on the test questions it’s given. Many organizations rely solely on human judgement to determine whether their tests sufficiently cover their data and use case.

And if the tests themselves aren’t comprehensive, none of the metrics they generate can be trusted. A system might appear to perform well not because it’s genuinely reliable, but because its test set didn’t touch every part of the knowledge base.

Without that assurance, even the most rigorous evaluation frameworks leave room for uncertainty.

A Framework for Quantifying Semantic Test Coverage

Semantic test coverage, introduced in our recent paper published on arXiv , addresses this gap by quantifying the amount of overlap between our questions and the semantic concepts in our corpus. It provides a systematic way to measure how thoroughly your RAG tests cover your data, find under-tested areas, and identify where to add new test questions.

The approach borrows from traditional software testing: just as code coverage measures which parts of a system are executed during tests, semantic test coverage quantifies how well your evaluation questions reach across the key topics and concepts in your data.

The result is a new, quantitative foundation for evaluating RAG systems, one that moves beyond just testing for correctness to also testing for completeness.

Why This New Approach Is Needed Now

With many organizations now using GenAI in production , the risks of incomplete testing have become impossible to ignore. The MIT AI Risk Repository tracks AI Incidents and is showing significant increases in incidents. 2024 saw ~250 AI incidents and the trend is increasing. Since RAG allows companies to augment foundational models with their own internal data, many companies have embedded RAG systems into critical workflows that affect customers, employees, and even regulators. In these contexts, an inaccurate or incomplete response can carry significant reputational, legal, or financial consequences.

Regulators and standard bodies like NIST are also raising expectations for transparent testing. It’s no longer enough to say a model performs well, organizations need to know what they’ve tested and where the gaps are.

Scale adds another challenge. Public benchmarks like MMLU can rely on large human review teams, but enterprise data is proprietary and often too big for manual inspection. Moreover, public benchmarks don’t evaluate an organization’s specific use cases, compounding the challenge of ensuring test coverage at scale. Companies need application-specific benchmarks that evolve with their knowledge bases.

Our semantic test coverage framework provides exactly that. It offers a repeatable, data-driven way to measure what your tests cover, and what they miss so organizations can deploy RAG systems responsibly and with confidence.

Test Coverage for an example Corporate Policies RAG app

How the Framework Works

Frameworks like RAGAS and ARES are excellent at setting up an easy environment for judging response quality and even generating synthetic queries . Still, they don’t quantify how thoroughly your tests span your knowledge base or flag irrelevant documents and questions that skew results.

Semantic test coverage complements those frameworks by auditing the coverage of your evaluation inputs, giving teams a way to see not just how well their RAG performs, but how completely it’s been tested.

Here’s a high-level overview of the methodology, which we describe in more detail in the paper :

Map the territory: Break your knowledge base into smaller chunks and cluster them by meaning to reveal the main topics and themes in your data.
Place your tests on the same map: Plot your test questions in that same semantic space to see how well they align with the content they’re meant to cover.
Filter out irrelevant tests: Use outlier detection to remove questions that don’t belong, such as off-topic or mislabeled items so your coverage scores stay accurate.
Measure coverage: Generate scores (basic, content-weighted, and multi-topic) that show how thoroughly your tests reach across topics and concepts in your corpus.
Turn scores into action: Identify clusters (which represent related topics of source documents) with low coverage, extract the missing themes, and suggest new test questions to close the gap.
Repeat: Regularly monitor to catch new areas of source data that are “not covered” by tests, as knowledge bases are often frequently evolving.

Turning Evaluation into Assurance

Semantic test coverage helps teams see what their evaluations miss and focus attention where it’s most needed. It turns testing into a more disciplined, repeatable process that builds confidence in a system’s reliability before it reaches production.

This framework gives organizations a scalable, systematic way to assess how well their test sets align with their underlying data. It doesn’t replace human oversight, it directs it. By making coverage measurable, teams can deploy expert reviews more intelligently, allowing us to scale human ingenuity.

With semantic test coverage, business and technology leaders can:

Build confidence before launch. Understand whether your test suite fully covers your products, policies, and edge cases, so you know what’s been tested and where the gaps remain.
Pinpoint improvement areas. Identify exactly where to add or refine tests, turning a broad “improve evaluations” goal into a concrete, prioritized plan.
Maintain continuous assurance. Track coverage trends in CI/CD pipelines to monitor progress over time and automatically flag or block releases that fall below defined safety thresholds.
Strengthen governance. Demonstrate systematic, evidence-based testing aligned with emerging AI assurance expectations like NIST’s GenAI Profile.

To put this into context, the graphic below shows how semantic test coverage fits into a typical RAG lifecycle, from setting up your corpus to maintaining ongoing assurance:

How the coverage workflow fits into your RAG lifecycle

As an example, consider a bank launching a RAG assistant for retail customers. Early evaluations look strong, but coverage analysis tells a different story. The “retirement accounts and rollovers” cluster has almost no test representation, despite accounting for 12% of the corpus and being a top driver of call center traffic. The team adds a focused set of questions about fees, eligibility, and exceptions, re-runs evaluations and immediately sees both coverage and answer faithfulness improve. That’s the power of this framework: helping teams uncover and correct blind spots before they impact customers or business outcomes.

Putting the Framework into Action

This framework isn’t about chasing a single “magic score” that guarantees safety or perfection. It’s a diagnostic tool for seeing how fully your test questions reflect the knowledge your RAG system depends on. The results are directional, showing where coverage is thin so teams can focus their improvements where they’ll have the greatest impact.

It doesn’t replace human judgment or output-quality checks. Those remain essential. Instead, it helps focus that human effort, pointing experts toward the areas that need closer attention.

It’s also not a replacement for existing RAG evaluation frameworks. It works alongside them, adding the missing piece that makes those assessments more reliable by ensuring you’re testing the right things.

To apply this framework, build upon your existing evaluation workflows and governance practices. Here are a few tips for getting started:

Run evaluations. For any important GenAI implementation, you should already be doing this. Popular frameworks like DeepEval make it easy to manage and run evaluations on your GenAI prompts and outputs.
Explore the full methodology. The complete framework includes clustering techniques, outlier thresholds, and detailed coverage metrics that teams can adapt to their own evaluation pipelines.
Align with governance. Tie coverage evidence to your AI risk documentation (e.g., NIST’s GenAI profile) so product, risk, and compliance share a consistent measure of test completeness.
Start simple, then scale. Begin with your top three customer-facing use cases, close the biggest gaps, and make coverage checks a regular part of your release criteria.

Integrating it into Responsible GenAI practices

As GenAI systems move deeper into core business operations , organizations need more than performance metrics, they need proof of completeness. Semantic test coverage provides that missing layer of visibility. By embedding coverage analysis into evaluation pipelines, companies can strengthen governance, improve reliability, and scale GenAI responsibly without slowing innovation.

If you can measure what your tests miss, you can fix it before your customers ever notice. While RAG made enterprise GenAI practical, semantic test coverage makes it predictable.

BCG X has published the full mathematical foundation of this framework and is already helping organizations put it into practice. Our team is working with clients to assess the completeness of their GenAI testing, integrate coverage analysis into existing evaluation workflows, and design governance processes that align with emerging standards.

This work is about more than improving performance. It’s about creating systems that people can trust. And trust in GenAI starts with knowing what you’ve tested, and what you haven’t. As GenAI becomes increasingly embedded in everyday business workflows, it’s completeness that will define confidence.

航空宇宙・防衛

自動車業界

消費財業界

Within 消費財業界

教育

Within 教育

エネルギー

Within エネルギー

金融機関

Within 金融機関

ヘルスケア業界

Within ヘルスケア業界

産業財業界

Within 産業財業界

保険業界

Within 保険業界

プリンシパル・インベスター、プライベート・エクイティ

Within プリンシパル・インベスター、プライベート・エクイティ

パブリックセクター

Within パブリックセクター

流通業界

Within 流通業界

テクノロジー、メディア、通信

Within テクノロジー、メディア、通信

運輸・物流

Within 運輸・物流業界

旅行・観光業界

Within 旅行・観光業界

都市計画

Within 都市計画

AI

Within AI

パーパス（存在意義）

ビジネス・レジリエンス

トランスフォーメーション

Within トランスフォーメーション

気候変動・サステナビリティ

Within 気候変動・サステナビリティ

コーポレートファイナンス＆ストラテジー

Within コーポレートファイナンス＆ストラテジー

コストマネジメント

Within コストマネジメント

顧客インサイト

Within 顧客インサイト

デジタル/テクノロジー/データ

Within デジタル/テクノロジー/データ

イノベーション戦略策定・実行

Within イノベーション戦略策定・実行

グローバルビジネス

Within グローバルビジネス

製造

Within 製造

マーケティング・セ－ルス

Within マーケティング・セ－ルス

M&A、トランザクション、PMI

Within M&A, Transactions, and PMI

オペレーション

Within オペレーション

組織

Within 組織

人材戦略

Within 人材戦略

プライシング・レベニューマネジメント

Within プライシング・レベニューマネジメント

リスクマネジメント、コンプライアンス

Within リスクマネジメント、コンプライアンス

社会貢献

Within 社会貢献

最新の論考

注目テーマ

CEOアジェンダ

BCGヘンダーソン研究所（BHI）

My Subscriptions

リーダーシップチーム

人材とカルチャー

Within People and Culture

BCGオフィス紹介

Testing the Tests: How to Measure RAG Evaluation Completeness

Explore Related Services