Managing Director & Partner
Berlin
Generative AI is rapidly increasing efficiency, productivity, and quality across industries and, in the process, uncovering new and innovative revenue streams. From deploying GenAI in everyday tasks to inventing new business models for entire functions from customer service to engineering, companies that implement GenAI stand to gain clear market advantage. Achieving full value from GenAI applications, however, requires the careful implementation of a combination of human-based plus automated testing and evaluation (T&E). Only then can companies be sure that their GenAI applications maximize value and minimize risk.
Transitioning from POC to Market-Ready Enterprise Solution
As companies integrate GenAI into their tools and operations, they are discovering a critical gap between developing innovative proofs of concept (POC) and launching them into the market as reliable, enterprise-level solutions that drive tangible impact. A key demand of such scaling is that GenAI systems operate as intended, producing accurate and reliable outputs and demonstrating:
Human Testing Is No Longer Enough
To reap the full benefits of enterprise-level GenAI systems, companies must scale their current approaches to testing and evaluation. The challenge is that as GenAI systems evolve and are able to manage a broad range of use cases, the associated risks also increase and evolve. Human-based T&E alone is not powerful enough to map such a rapidly increasing risk landscape. At the same time, automated T&E can execute at a very high level, but it cannot fully capture the human nuances, insights, and expertise necessary for effective risk management. The solution lies in leveraging the unique strengths of both humans and machines.
Human testing, for example, may reveal a specific user input that yields a toxic output. Automated testing can then create hundreds or thousands of variations of that input to measure how often that failure is likely to occur. In this example, the human has played a vital role in discovering the fault but would need hours or days to compile a similar list of input variations. Automated testing can accomplish the task in a matter of minutes.
In the “Human + Machine” T&E scenario, humans contribute the ability to:
… while automated testing:
GenAI Must be Challenged, Assailed
The T&E process is often accomplished using red teaming, defined as the organized process of probing, testing, and attacking GenAI systems from an adversarial stance. To make sure these systems both deliver maximum value and are safe, data scientists and engineers need tools that enable them to conduct red teaming using this combination of human-based and automated testing.
ARTKIT, an open-source T&E toolkit by BCG X, helps BCG clients conduct red teaming. A BCG client recently used ARTKIT to subject a chatbot that automatically processes specific HR requests from among its 26,000 employees to both human-based and automated testing. The toolkit’s proficiency tests identified nonsensical responses to specific, well-intentioned questions and ultimately led to meaningful improvements in response accuracy and completeness. It simultaneously observed that existing guardrails were far too stringent which sacrificed usability by refusing both malicious and well-intentioned prompts. Iterative evaluation eventually helped the client find and maintain the right balance between safety and function.
ARTKIT is also used to scale testing of extended, multi-turn interactions between GenAI systems and end users. GenAI systems can have difficulty maintaining context, coherence, and appropriate responses across multi-turn interactions. A user might introduce new information, changes of subject, contradictions, ambiguities, or combine text and images across multiple turns. Research has also shown that GenAI systems are more likely to fail as the length of a conversation increases due to inherent challenges in managing context with long interaction histories. Inconsistent responses or loss of context by the GenAI system can frustrate users, reduce the effectiveness of the extended interaction, and reduce confidence in system reliability.
Get GenAI Apps to Market Quickly and Proficiently
A hybrid solution like ARTKIT is a key enabler for successful test automation. Data scientists and engineers also need critical thinking, creativity, and an understanding of the full risk surface of their use case and domain to quickly identify issues and proactively derisk builds. This, in turn, can lead to accelerated user acceptance and greater confidence in the final product. The end goal is to help business decision makers and leaders harness the full power of GenAI, knowing that the results will be safe and equitable — and will deliver measurable, meaningful business impact.