This topic describes how to use an AnalyticDB for PostgreSQL vector database to build a text semantic retrieval system.
Background information
Text semantic retrieval aims to retrieve data based on the meaning and intent of the query text, instead of only matching keywords. Text semantic retrieval can significantly improve retrieval performance.
Overview of text semantic retrieval
The following figure shows the architecture of text semantic retrieval that consists of two components.

Text vectorization and index creation
Use a machine learning model to convert a text into feature vectors. The machine learning model encodes a text to create a vector representation that captures the semantic meaning of the text. Similar texts have similar vector representations in a vector space. You can use the natural language processing (NLP) capabilities of Alibaba Cloud or other text vectorization services such as the Embedding API to implement text vectorization.
Store texts and their feature vectors in an AnalyticDB for PostgreSQL vector database and create vector indexes. Vector indexes can greatly accelerate the query of similar feature vectors.
Vector retrieval
Import the query text to the machine learning model to obtain the vector representation of the query text.
Use the query vector to query texts from the AnalyticDB for PostgreSQL vector database.
The system obtains and returns texts whose feature vectors are the most similar to the query vector from the AnalyticDB for PostgreSQL vector database.
Use an AnalyticDB for PostgreSQL vector database to build a text semantic retrieval system
To build a text semantic retrieval system by using an AnalyticDB for PostgreSQL vector database, perform the following steps:
Step 1: Install a Python environment
Step 2: Preprocess data
Step 4: Query data
Install a Python environment
We recommend that you use Conda to manage different versions of Python environments. In this example, a Python virtual environment and all the required libraries are installed. Run the following commands on the terminal:
# Create a Python virtual environment and set the Python version to 3.8.
conda create -n adbpg_text_env python=3.8
# Activate the Python virtual environment.
conda activate adbpg_text_env
# Install the required Python packages in the virtual environment.
pip install psycopg2==2.9.3
pip install wget==3.2
pip install pandas==1.2.4
pip install datasets==2.12.0 sentence-transformers==2.2.2
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
If you use macOS and an Error: pg_config executable not found
error is reported when you install psycopg2, perform the following operations:
Install postgresql.
brew install postgresql
Install psycopg2.
pip install psycopg2==2.9.3
Preprocess data
Before you use an AnalyticDB for PostgreSQL vector database to build a text semantic retrieval system, you must select a batch of texts as the text library. In this example, the Quora dataset is used.
Download a dataset
Download the Quora dataset from the datasets package that is provided by Huagging Face. The Quora dataset contains about 400,000 data entries. In this example, 10,000 data entries are used.
from datasets import load_dataset
dataset = load_dataset('quora', split='train[0:10000]')
print(dataset[0])
Each data entry consists of the following content:
The questions parameter contains two questions from Quora.
id: specifies the serial numbers of the two questions.
text: specifies the texts of the two questions.
The is_duplicate parameter specifies whether the two questions have the same meaning.
True: yes.
False: no.
{'questions':
{'id': [1, 2],
'text': ['What is the step by step guide to invest in share market in india?', 'What is the step by step guide to invest in share market?']
},
'is_duplicate': False
}
Extract feature vectors from texts
Extract the questions from the downloaded dataset to obtain a list of texts. Use a set to filter duplicate texts.
sentences = [] for data in dataset['questions']: sentences.extend(data['text']) # Remove the duplicate texts. sentences = list(set(sentences)) print('\n'.join(sentences[1:5])) print(len(sentences))
Sample result:
How can I know if my spouse is cheating? Can a snake kill a rabbit? How i get hair on bald head? How can I get my name off the first page on Google search? 19413
Use the SentenceTransformer library to extract the feature vectors of the texts. In this example, the all-MiniLM-L6-v2 model is used.
from sentence_transformers import SentenceTransformer import torch model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu') model
The all-MiniLM-L6-v2 model provides the following information. The
max_seq_length
parameter specifies the maximum text length that can be processed by the model, which is 256. The system truncates the part of the text that exceeds 256. Theword_embedding_dimension
parameter specifies the feature vector dimension that is produced by the model.SentenceTransformer( (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) (2): Normalize() )
Extract the feature vectors of each text in sequence and save the texts and their feature vectors to a CSV file. The CSV file contains three columns:
ID: the serial number of the text.
Sentences: the content of the text.
Vectors: the feature vectors of the text.
import pandas as pd vectors = [] for sentence in sentences: vector = model.encode(sentence) # Convert the feature vectors into the '{0.04067191854119301, ..., -0.012967484071850777}' format to import them to the AnalyticDB for PostgreSQL vector database. vector_str = "{" + ", ".join(str(x) for x in vector.tolist()) + "}" vectors.append(vector_str) # Generate the ID column. ids = [i + 1 for i in range(len(sentences))] # Combine the values of ID, Sentences, and Vectors into a DataFrame. df = pd.DataFrame({'ID': ids, 'Sentences': sentences, 'Vectors': vectors}) df.to_csv('sentences_vectors.csv', index=False)
Create a document library and a vector index
Configure temporary environment variables, connect to the database, and then execute a simple statement to check whether the connection is successful.
import os import psycopg2 # Run the following code to configure temporary environment variables: # os.environ["PGHOST"] = "xxx.xxx.xxx.xxx" # os.environ["PGPORT"] = "58894" # os.environ["PGDATABASE"] = "postgres" # os.environ["PGUSER"] = "vector_test" # os.environ["PGPASSWORD"] = "password" connection = psycopg2.connect( host=os.environ.get("PGHOST", "localhost"), port=os.environ.get("PGPORT", "5432"), database=os.environ.get("PGDATABASE", "postgres"), user=os.environ.get("PGUSER", "user"), password=os.environ.get("PGPASSWORD", "password") ) cursor = connection.cursor() # Execute a simple statement to check whether the connection is successful. cursor.execute("SELECT 1;") result = cursor.fetchone() if result == (1,): print("Connection successful!") else: print("Connection failed.")
If the
Connection successful!
message is returned, the connection to the database is successful.Create a table to store texts and feature vectors, and then create a vector index.
When you create a vector index, make sure that the value of the
dim
parameter is the same as the feature vector dimension that is produced by the all-MiniLM-L6-v2 model. The feature vector dimension is 384.# Specify SQL statements to create a table and change the storage format of the vector column to PLAIN. create_table_sql = ''' CREATE TABLE IF NOT EXISTS public.articles ( id INTEGER NOT NULL, sentence TEXT, vector REAL[], PRIMARY KEY(id) ) DISTRIBUTED BY(id); ALTER TABLE public.articles ALTER COLUMN vector SET STORAGE PLAIN; ''' # Specify an SQL statement to create a vector index. create_indexes_sql = ''' CREATE INDEX ON public.articles USING ann (vector) WITH (dim = '384', hnsw_m = '100', pq_enable='0'); ''' # Execute the preceding SQL statements. cursor.execute(create_table_sql) cursor.execute(create_indexes_sql) connection.commit()
Import data from the sentences_vectors.csv file to the table.
import io # Define a generator function to process the data in the file line by line. def process_file(file_path): with open(file_path, 'r') as file: for line in file: yield line # Specify a COPY statement to import data. copy_command = ''' COPY public.articles (id, sentence, vector) FROM STDIN WITH (FORMAT CSV, HEADER true, DELIMITER ','); ''' # Execute the COPY statement. modified_lines = io.StringIO(''.join(list(process_file('sentences_vectors.csv')))) cursor.copy_expert(copy_command, modified_lines) connection.commit()
Query data
After you import data, you can use the AnalyticDB for PostgreSQL vector database to query the text that is the most semantically similar to the query text from the document library. To query texts, perform the following steps:
Obtain the vector representation of the query text.
Use the vector engine to query the texts whose feature vectors are the most similar to the query vector from the document library.
Examples
Randomly select a data entry as the query text. In this example, "What courses must be taken along with CA course?" is used. The system can find the most similar texts from the AnalyticDB for PostgreSQL vector database in an accurate manner.
def query_analyticdb(collection_name, query, query_embedding, top_k=20):
# Specify a vector index column.
vector_col="vector"
# Specify a query SQL statement.
query_sql = f"""
SELECT id, sentence, dp_distance({vector_col},Array{query_embedding}::real[]) AS similarity
FROM {collection_name}
ORDER BY {vector_col} <-> Array{query_embedding}::real[]
LIMIT {top_k};
"""
# Execute the query to display the query results.
connection = psycopg2.connect(
host=os.environ.get("PGHOST", "localhost"),
port=os.environ.get("PGPORT", "5432"),
database=os.environ.get("PGDATABASE", "postgres"),
user=os.environ.get("PGUSER", "user"),
password=os.environ.get("PGPASSWORD", "password")
)
cursor = connection.cursor()
cursor.execute(query_sql)
results = cursor.fetchall()
connection.close()
return results
# Specify a query text.
query = "What courses must be taken along with CA course?"
# Obtain the feature vectors of the query text.
query_vector=model.encode(query)
print('query: {}'.format(query))
query_results = query_analyticdb('articles', query, query_vector.tolist(), 10)
for i, result in enumerate(query_results):
print(f"{i + 1}. {result[1]} (Score: {round(result[2], 2)})")
Sample result:
query: What courses must be taken along with CA course?
1. What courses must be taken along with CA course? (Score: 1.0)
2. What is the best combination of courses I can take up along with CA to enhance my career? (Score: 0.81)
3. Is it possible to do CA after 12th Science? (Score: 0.66)
4. What are common required and elective courses in philosophy? (Score: 0.56)
5. What are common required and elective courses in agriculture? (Score: 0.56)
6. Which course is better in NICMAR? (Score: 0.53)
7. Suggest me some free online courses that provides certificates? (Score: 0.52)
8. I have only 2 months for my CA CPT exams how do I prepare? (Score: 0.51)
9. I want to crack CA CPT in 2 months. How should I study? (Score: 0.5)
10. How one should know that he/she completely prepare for CA final exam? (Score: 0.48)
You can continue to select other texts that you want to query. After you query texts, you can delete the document library and the index to release resources.
# Specify an SQL statement to delete the document library.
drop_table_sql = '''drop table public.articles;'''
cursor.execute(drop_table_sql)
connection.commit()
connection.close()