" "

Text-to-SQL: Giving Users Natural Language Access to Data

In the world of data and analytics, the ability to converse with data using natural language can seem almost like science fiction. But in this article, we’ll showcase a real-life implementation of a GenAI solution that enables business users to access data using natural language, a technique known as text-to-SQL.

Context and Objective

Our journey began with a challenge: How could we empower business users, irrespective of their technical skills, to access and analyze data through simple natural language queries? This was not just about technological innovation; it was about removing barriers to data and enabling a culture in which data-driven insights are within everyone's reach.

Since our client was a highly regulated entity, we had to engage with regulatory compliance and data security teams during the design of our solution. As a result, the path of using more capable commercial models such as OpenAI GPT-4 and Anthropic Claude was not an option. Instead, we decided to push the AI frontier by fine-tuning open-source Large Language Models (LLMs) to be adept SQL “data analysts,” and host the models on the client’s Virtual Private Cloud (VPC). This decision ensured the security of sensitive data and the ability to manage model behaviors on our own.

High-Level Solution Design

Recognizing the complexity and scale of our goal, we began by selecting a leading GenAI platform, Scale AI. By providing a full-stack platform solution, Scale gave us a great starting point to host, fine-tune, evaluate, and deploy the text-to-SQL models.

Our first step was to design the solution architecture, which included the frontend application, the LLM layer, and the target database. Leveraging the capabilities of Scale's GenAI platform and using public text-to-SQL datasets, we were able to rapidly test a variety of top-performing, open-source models including Llama2, Falcon, Mistral, and Code Llama.

In parallel, we defined a compact but representative real-world evaluation dataset that allowed us to rapidly experiment and improve our solution. Soon, however, we encountered several unforeseen challenges.

Tackling Complications Head-On

Benchmarks

Our first challenge came before we even started. During the preliminary phase of exploring benchmarks for text-to-SQL capabilities, we initially turned to the most widely recognized public benchmark for this problem—the Yale Spider Challenge. At first glance, the Spider leaderboard suggested that text-to-SQL problem is well solved, with teams consistently achieving accuracies above 80%.

This perception proved to be wrong. The Spider dataset, while rigorous, is predominantly tailored towards academic research and fails to reflect the messiness and complexity of a real-world enterprise data model. A real-world enterprise data model might, for example, contain inconsistent data structures from legacy systems, unclear foreign key relationships, semantically ambiguous column names, inconsistent field names, or data quality issues. When we applied the benchmarks to real-world data, the performance dropped sharply, revealing a significant disparity between academic benchmarks and real-world complexity.
 
Accuracy Gaps

We also found that open-source models, unlike top commercial models trained with 1.5-2 trillion parameters, generally have significantly fewer parameters, which leads to major gaps in accuracy. Despite pre-training efforts on public text-to-SQL datasets, these open-source models inherently lack the out-of-the-box proficiency observed in their top-tier commercial counterparts.

To improve accuracy, we designed a state-of-the-art Retrieval-Augmented Generation (RAG) system, complemented by model fine-tuning. The RAG framework enhanced our text-to-SQL model by injecting pertinent domain knowledge including translations of business terminology, formulas, and database schema details directly into the prompt. Then we added model fine-tuning and In-Context-Learning (ICL) examples to further boost accuracy in high complexity requests.

Data Quality

Furthermore, generative AI solutions, compared to traditional machine learning systems, rely heavily on unstructured data sources often produced by humans. Inherent human subjectivity introduces a significant degree of ambiguity into data. Beyond ambiguity in training and evaluation data, the performance of human generated SQL queries can also cause downstream challenges. Inefficiently written queries, for instance, can lead to extended run times and degrade the overall user experience.

To overcome these challenges, we dedicated considerable time and effort to meticulously review and refine our data, and then re-trained our models on this improved data. We also established a robust data management and continuous improvement process designed to incorporate human-in-the-loop data governance, a strategy particularly well-suited for generative AI solutions.

Model Evaluation

Finally, finding the metric that could accurately measure the output quality proved to be as difficult as it was critical. Traditional accuracy metrics such as Execution Accuracy and Exact Matching could not fully capture the nuance and complexity of user interactions with the system. We needed a metric that could capture both the binary nature of correctness and the nuanced spectrum of human judgment. (See below for an example.)

To evaluate the solution appropriately, we deployed a metric system that covered diverse aspects of "accuracy." The more stringent metrics (e.g., Execution Accuracy) were used to assess baseline accuracy, while custom-built auto-eval metrics leveraging a separate LLM was used to approximate human interpretation. Other metrics such as Exact Matching and Answer Similarity offered additional ways to assess model quality.

Our Approach in Summary

A combination of data enhancement, RAGs, fine-tuning, and Reinforcement Learning from Human Feedback (RLHF) enabled us to tackle all the above these issues head-on. By iteratively experimenting and refining, our team was able to improve text-to-SQL model accuracy, reaching acceptable performance levels in a relative short timeframe.

Beyond Technology: Change Management and Responsible AI

This journey wouldn't be complete without acknowledging the human element. Integrating this cutting-edge AI solution into the everyday business user workflows required careful change management and training. Seamlessly partnering with the client team each step of the way, we embarked on a collaborative journey, conducting user interviews, crafting solution features, developing solutions, selecting testers, orchestrating User Acceptance Testing (UAT), planning rollouts, and designing trainings.

Our commitment to Responsible AI was underpinned by a rigorous assessment of RAI risks and the implementation of risk mitigations. Our robust RAI framework and detailed action plans ensured that our solutions not only excelled in their technical prowess but were also ethical, unbiased, and transparent.

What We Learned

Reflecting on our journey, we would like to share a few key takeaways:

  1. Data is the cornerstone of LLM fine-tuning. It is crucial to establish a robust approach and process to collect/create, review, and enhance training and evaluation datasets.
  2. LLM model evaluation can be more art than science. It requires flexibility to encapsulate the binary nature of correctness and nuanced spectrum of human judgment. We deployed a metric system that covered diverse aspects of "accuracy.”
  3. Supervised Fine-Tuning (SFT) is just the beginning. Limited data can be gathered for initial fine-tuning. Hence, quickly setting up the process to enable continuous learning based on tester/user input is important (e.g., RLHF).
  4. Actively experimenting and leveraging open-source communities can bring surprises. Newly released models and techniques can lead to solution breakthroughs with little effort.
  5. A strong RAI framework is crucial for GenAI solution roll-out. Hallucination and inaccuracy are inevitable: Clients must understand the associated risks and then, through change management, design guardrails and implement mitigants.

Conclusion

Text-to-SQL technology holds significant potential for enabling the entire organization to unlock valuable insights from data. But the path to effective implementation is intricate, requiring a tailored approach to design, deploy, and scale these solutions. To unlock text-to-SQL’s full value potential, enterprises must also design appropriate change management processes, RAI guardrails, and human-in-the-loop flywheels.