Evolution of Text-to-SQL Technology - An Analysis of Alibaba Cloud OpenSearch-SQL

By Wenyun

1. Introduction

The Text-to-SQL task aims to convert natural language queries into Structured Query Language (SQL), enabling non-technical users to access and manipulate databases easily. Recently, Alibaba Cloud's OpenSearch engine has achieved top ranking on the BIRD dataset leaderboard for Text-to-SQL tasks, thanks to its consistency alignment technology. This article will introduce the evolution of Text-to-SQL technology and provide an in-depth analysis of the OpenSearch-SQL approach.

Source: 2024.8.29

Additionally, the Text-to-SQL feature is now available in Alibaba Cloud OpenSearch. Welcome to try it out.

Platform: https://www.alibabacloud.com/help/en/open-search/search-platform/

2. Background

Research in the Text-to-SQL field has a long history, with the core focus being on how to enable people to ask questions in natural language and receive precise SQL query results. In the past, due to the complex grammar and logical requirements, the Text-to-SQL field was mainly of importance in academia. However, with the development of large model technologies, industrial-grade Text-to-SQL solutions began to emerge.

The main challenge in Text-to-SQL technology lies in accurately parsing user intent, identifying entities and relationships in the question, and mapping them to the tables, columns, and corresponding SQL operations in the database. This process not only requires the model to have a strong language comprehension capability but also a deep understanding of SQL syntax, along with good generalization across various database structures.

To advance the field, a series of public datasets and benchmark tests, such as WikiSQL, Spider, and BIRD, have been introduced. They provide standards for model training and evaluation, promoting technical exchange and competition. These rich resources have enabled Text-to-SQL systems to evolve from handling simple queries to managing complex SQL queries involving multiple inferences, comparison operations, and aggregate functions, greatly expanding their application scenarios.

Below is a simple example of Text-to-SQL:

A Survey on Deep Learning Approaches for Text-to-SQL

3. Technical Evolution

3.1 Traditional Methods

3.1.1 Sketch-based

These methods are based on the structure of SQL, decomposing the SQL generation process into multiple sub-modules, such as SELECT, AGG functions, and WHERE conditions. During the subsequent generation, specific methods are chosen for each slot in these modules, enhancing the flexibility and accuracy of the generated SQL.

Seq2SQL is an early representative of this approach, using a neural network's classification task to predict and fill in the content of slots within the decomposed structure. Although this architecture-based decomposition significantly simplifies the complexity of the Text-to-SQL task, it also limits its ability to handle more complex SQL. For example, for relatively simple datasets such as WikiSQL, this method can achieve good results; however, for problems involving complex syntax such as Spider, this method often fails to achieve the desired results.

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Other similar approaches include Coarse2Fine and RYANSQL. These not only generate the contents of slots within a specific structure but also attempt to have the model first generate the structure of the natural language query (NLQ). This enhancement improves their scalability when handling complex syntax and diverse database environments.

3.1.2 Intermediate Language-based

Some researchers have found that, compared with directly generating SQL, it is easier to maintain consistency with natural language by generating other content during the decoding process. This involves first generating an intermediate language, which could be an existing language or one specifically constructed for the task, before generating the SQL, known as the intermediate language-based approach.

Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation

For example, IRNet, as shown in the figure, constructs an intermediate language specifically for SemQL, allowing the model to simplify the process of generating SQL through this intermediate representation. Once the intermediate language is generated, the process continues to generate the final SQL.

3.1.3 Summary

Traditional model-based methods also include algorithms that use pre-trained models in place of traditional encoders, analyze syntax trees with graph structures, and focus on filtering database information. Overall, Sketch-based and intermediate language-based methods serve as supplements to address the limitations of model capabilities. However, the effectiveness of the ultimately generated SQL heavily relies on the representational power of the manually designed structures by researchers, which restricts the transferability of these methods.

3.2 LLM Methods

As the LLM capabilities continue to improve, LLM-driven approaches are showing stronger transferability and reasoning abilities compared with traditional methods, bringing the Text-to-SQL task into a new phase. In this phase, more complex SQL tasks can be effectively handled without being constrained by manually designed frameworks. From the perspective of datasets, the difficulty of tasks faced by LLM-driven methods has gradually shifted from Spider to BIRD, indicating a significant improvement in handling more complex queries.

For example, the classic model T5-Base, which is used in traditional methods, achieves 71.1% accuracy on Spider but only 7.06% on BIRD. On the other hand, GPT-4 reaches 83.9% accuracy on Spider and 54.89% on BIRD. This comparison reflects the notable advantages of LLM-driven methods in terms of transferability and tackling more complex problems.

3.2.1 Standard Framework

Although there isn't yet a clearly defined unified framework for LLM-driven Text-to-SQL tasks, the current effective frameworks can generally be summarized into the following four parts:

1. Preparation Phase: Gather the necessary information needed by the database.

Clean the DDLs used in each step of the process.
Process the values stored in the database, maintaining a vector database.
Prepare Few-shot examples.

2. Extraction Phase: Extract the required information based on the specific question to help reduce the complexity of the task for the model.

For large databases and complex tasks, reduce the difficulty in generating the task by filtering fields and values.

3. Generation Phase: Use the large model to generate SQL based on all the prepared information, while also designing some strategies for generating SQL.

Use methods such as COT and task decomposition.
Use Few-shot to drive the LLM to understand the task.

4. Optimization Phase: Automatically refine the SQL based on certain rules or the execution results.

Correct unexecutable SQL based on the execution results.
Select the SQL through the large model.

3.2.2 Representative Methods

Among the LLM-driven methods, some representative models include:

DIN-SQL: Use COT for SQL generation, enhancing the logic and traceability of the generated SQL.
ExSL + Granite-34B-Code: Employ a fine-tuned model for SQL generation, demonstrating the adaptability of pre-trained models to specific tasks.
MAC-SQL: Complete SQL generation through task decomposition, reducing the complexity of difficult tasks and enabling better handling of multi-step questions.
DAIL-SQL: Adopt a dynamic Few-shot strategy, further improving the model's adaptability across different scenarios.
CHESS: Utilize more sophisticated extraction patterns to select key fields, performing well with complex SQL.

These methods have all achieved good results in practical applications, showcasing the potential of large model technology in Text-to-SQL tasks.

4. Analysis of OpenSearch-SQL Methods

After analyzing LLM-based Text-to-SQL methods, we proposed OpenSearch-SQL to provide a standardized process for Text-to-SQL approaches and to address some common issues present in current methods. OpenSearch-SQL comes in two versions, both adhering to a multi-agent framework as follows:

4.1 OpenSearch-SQL, v1

In the OpenSearch-SQL, v1 (hereafter referred to as v1) version, we first defined the aforementioned Text-to-SQL framework.

Preprocessing Agent: Construct few-shot examples, a vector library of values within the database, and information on the database structure.
Generation Agent: Utilize dynamic few-shot to drive the LLM in generating SQL.
Optimization Agent: Based on the execution results of the generated SQL, it corrects and refines the SQL, ultimately producing an optimized SQL query.

Despite the v1 version achieving good results, through in-depth analysis, we found that during the multi-agent collaboration, the complexity of tasks in the generation phase and the failure to follow instructions were the main reasons for the LLM generating inaccurate SQL. The specific issues are as follows:

1. High complexity in the generation phase:

The process of converting SQL components such as tables, columns, and values into a complete SQL query involves complex reasoning. Current methods often require the model to directly complete this conversion, which undoubtedly increases the difficulty of the generation task.

2. Failure to follow instructions:

i. Field and value extraction: The extracted content is incomplete or inconsistent.

ii. Generation phase: The model does not fully utilize the extracted information during the generation phase.

iii. SQL style mismatch: Although the generated SQL may be logically sound, it may not match the style expected by the database.

iv. Ignoring requirements: Even when requirements are clearly stated in the prompt, the LLM might still overlook these requirements.

4.2 OpenSearch-SQL, v2

To solve the problems in the v1 version, in OpenSearch-SQL, v2, we defined two problems at the beginning:

How to reduce the difficulty of SQL generation: progressive generation
How to improve the success rate of LLM instruction adherence: consistency alignment

4.2.1 Progressive Generation

To lower the complexity of the SQL generation process, we introduced a novel approach to decompose the generation task. While some existing methods break down the SQL generation into sub-tasks and then merge them, in real-world scenarios, this decomposition can lengthen the generation chain and introduce errors during the decomposition and recombination processes, making it difficult to effectively combine the individual sub-SQLs. Therefore, our goal was to design a more efficient decomposition method that would simplify the generation process while reducing error accumulation, thus improving the overall accuracy and efficiency of SQL generation.

We propose progressively generating different parts of the SQL, such as SELECT, WHERE, GROUP BY, and others. Therefore, we use a COT approach to gradually generate and analyze the SELECT, Column, and Value in SQL and the SQL itself.

What is the phone number of the school that has the highest average score in Math?

#reason: The question want to know the phone number of the school, so the SQL SELECT schools.Phone and the condition is the school that has the highest average score in Math.
#columns: schools.Phone, schools.CDSCode, satscores.AvgScrMath
#values: highest average score in Math refers to ORDER BY satscores.AvgScrMath DESC LIMIT 1
#SELECT: phone number of the school refers to schools.Phone
#SQL-like: SELECT schools.Phone FROM schools ORDER BY satscores.AvgScrMath DESC LIMIT 1
#SQL: SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY T2.AvgScrMath DESC LIMIT 1

The benefits of such an approach over decomposing tasks and merging them are:

With minimal gaps between steps, it is easier to identify which part of the SQL generation process causes issues.
Progressive generation allows for temporarily ignoring less syntactically critical information, such as JOINs, until later stages.
Completing the task in one go within the COT framework avoids the inconsistencies that can arise from multi-agent collaboration.

4.2.2 Consistency Alignment

For the second issue, we first analyzed a classic method to improve model performance on subtasks: Post-training.

While post-training methods such as SFT, RLHF, and DPO can lead to a decrease in general capabilities, they are effective in quickly aligning the style of subtasks. However, this alignment often comes with certain challenges:

Data Preparation and Uncertainty: The process of aligning subtasks requires preparing a large amount of data and dealing with uncertain training outcomes. A common practice is to apply SFT only to the SQL generation step in Text-to-SQL tasks, while other parts still rely on a large base model like GPT-4 to complete. This strategy reduces the data requirements but may not fully leverage the potential of the model.
Loss of General Capabilities: This training approach often leads to a reduction in the model's general capabilities. For example, in terms of large models with smaller amounts of parameters, SFT might significantly improve performance on simple problems, but when faced with more complex or challenging problems, performance may actually decline. This suggests that relying solely on task-specific training can harm the model's adaptability across a wide range of applications.

Therefore, when aligning subtasks, it is necessary to carefully balance the advantages and potential risks, seeking a method that achieves alignment without compromising general capabilities.

In practice, we found that large base models often fail to follow instructions accurately when dealing with complex tasks, which is a significant reason why they perform worse than SFT models. So, can we ensure better instruction adherence in the large models within the Agent framework?

With this idea in mind, we conducted a series of experiments and observed two key phenomena:

1. Difficulty Correlation: When a complex task is decomposed into simpler subtasks, the adherence to instructions for these subtasks is higher.

2. Generation Variability: Even for SQL queries that the model can generate correctly, there is still a certain probability of errors, and this probability is positively correlated with the difficulty of the problem. Experimentally, the same prompt can result in about a 10% difference in bad cases between two experiments.

Based on these observations, we implemented a Double Check + Vote mechanism in OpenSearch-SQL. This mechanism works through the following steps:

a) Task Decomposition: Decompose the task into simpler subtasks to reduce the complexity of what the LLM needs to handle. This also makes it easier to check the quality of instruction adherence by the LLM and allows for re-generation if necessary. The steps involved in this task are:

i) Reduce the complexity of instructions through decomposition.

ii) Since the decomposed subtasks are simpler, it is easier to verify the results of the LLM's instruction adherence. In practice, this step can even be done through a simple check for alignment without needing the LLM to work within the Agent.

b) SQL Error Correction: Further correct the SQL based on the analysis of its execution results.

i) Use different prompts for different types of errors.

ii) Execute multiple generated SQL results for voting to reduce the impact of model variability.

c) Vote: With the self-consistency and vote mechanism, choose the SQL result with the highest consistency as the answer, and among those with the same result, select the one with the shortest execution time. This step improves both the accuracy and efficiency of the SQL.

4.3 Outlook

In terms of consistent alignment, there is still much room for optimization. By atomizing model outputs, there is significant potential to raise the upper limit of Text-to-SQL tasks. By atomizing Agent tasks, large models can quickly build chains for various tasks and have the capability for hot-swapping, allowing for flexible integration with different tasks. This mechanism not only enhances the model's adaptability but also helps developers rapidly implement specific functionalities based on particular needs, thereby improving overall efficiency and effectiveness.

5. Getting Started with OpenSearch-SQL

After understanding the specific methods and performance of OpenSearch-SQL, businesses and developers can now quickly get started with its powerful features. Currently, OpenSearch-SQL is officially available, and users can try it out on the OpenSearch-SQL search development platform, experiencing its excellent performance and application effect.

Link: https://www.alibabacloud.com/help/en/open-search/search-platform/product-overview/introduction-to-search-platform

Community

Evolution of Text-to-SQL Technology - An Analysis of Alibaba Cloud OpenSearch-SQL

1. Introduction

2. Background

3. Technical Evolution

3.1 Traditional Methods

3.1.1 Sketch-based

3.1.2 Intermediate Language-based

3.1.3 Summary

3.2 LLM Methods

3.2.1 Standard Framework

3.2.2 Representative Methods

4. Analysis of OpenSearch-SQL Methods

4.1 OpenSearch-SQL, v1

4.2 OpenSearch-SQL, v2

4.2.1 Progressive Generation

4.2.2 Consistency Alignment

4.3 Outlook

5. Getting Started with OpenSearch-SQL

Read previous post:

Alibaba Cloud Data Intelligence

You may also like

Comments

Alibaba Cloud Data Intelligence

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

OpenSearch

ApsaraDB for HBase