Perform document search and LLM integration - AnalyticDB

This topic describes how to perform text search by using Python code and how to implement a conversational search system in LangChain in AnalyticDB for PostgreSQL.

Document search

In this example, plain text search is performed. Sample code:

def query_content(namespace, namespace_password, collection, top_k,
                  content,
                  filter_str: str = None,
                  metrics: str = None,
                  use_full_text_retrieval: bool = None):
      request = gpdb_20160503_models.QueryContentRequest(
          region_id=ADBPG_INSTANCE_REGION,
          dbinstance_id=ADBPG_INSTANCE_ID,
          namespace=namespace,
          namespace_password=namespace_password,
          collection=collection,
          content=content,
          filter=filter_str,
          top_k=top_k,
          metrics=metrics,
          use_full_text_retrieval=use_full_text_retrieval,
      )
      response = get_client().query_content(request)
      print(f"query_content response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    query_content('ns1', 'Ns1password', 'dc1', 10, 'What is AnalyticDB for PostgreSQL? ')


# output: body:
# {
#   "Matches": 
#      {
#        "MatchList": 
#          [{
#            "Content": "ADBPG...",
#            "FileName": "test.pdf", 
#            "Id": "9368a9aa-8a26-4200-b84b-cab4e06dbbd4_20"
#            "LoaderMetadata": "\"page\":1.0,\"total_pages\":15.0,\"format\":\"PDF 1.4\",\"title\":\"\",\"author\":\"\",\"subject\":\"\",\"keywords\":\"\",\"creator\":\"Chromium\",\"producer\":\"Skia/PDF m93\",\"creationDate\":\"D:20231213060903+00\\u002700\\u0027\",\"modDate\":\"D:20231213060903+00\\u002700\\u0027\",\"trapped":\"\"}", 
#            "Metadata": {}, 
#            "RetrievalSource": 1, 
#            "Score": 0.7038057130604151
#          },....]
#      }
# }

Descriptions of the parameters in the query_content function:

namespace: the name of the namespace where the document collection is located.
namespace_password: the password of the namespace.
collection: the name of the document collection.
top_k: indicates the number of search results with the highest similarity.
content: the text content to be searched.
filter_str: the statement that is used to filter search results.
metrics: the vector distance algorithm. We recommend that you do not specify this parameter. The vector distance is calculated by using the algorithm that is used to create the index.
use_full_text_retrieval: indicates whether to use full-text search. Valid values:
- true: uses full-text search.
- false (default): does not use full-text search.

The returned search result list contains the following information:

Id: the UUID of the split chunk.
FileName: the name of the document.
Content: the search content, which is a chunk after splitting.
LoaderMetadata: the metadata generated during document uploading.
Metadata: custom metadata.
RetrievalSource: the source of the search result. Valid values:
- 1: hit by vector search.
- 2: hit by full-text search.
- 3: hit by both vector search and full-text search.
Score: the similarity score obtained by using the specified similarity algorithm.

Integrate LangChain

LangChain is an open source framework for developing applications powered by large language models (LLMs). It enables you to connect models to external data through a set of interfaces and tools. The following section shows how to integrate the search capabilities of AnalyticDB for PostgreSQL into LangChain to implement a conversational search system.

Install modules

pip install --upgrade langchain openai tiktoken

Build AdbpgRetriever

from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document


class AdbpgRetriever(BaseRetriever):
    namespace: str = None
    namespace_password: str = None
    collection: str = None
    top_k: int = None
    use_full_text_retrieval: bool = None

    def query_content(self, content) -> List[gpdb_20160503_models.QueryContentResponseBodyMatchesMatchList]:
        request = gpdb_20160503_models.QueryContentRequest(
            region_id=ADBPG_INSTANCE_REGION,
            dbinstance_id=ADBPG_INSTANCE_ID,
            namespace=self.namespace,
            namespace_password=self.namespace_password,
            collection=self.collection,
            content=content,
            top_k=self.top_k,
            use_full_text_retrieval=self.use_full_text_retrieval,
        )
        response = get_client().query_content(request)
        return response.body.matches.match_list

    def _get_relevant_documents(
            self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        match_list = self.query_content(query)
        return [Document(page_content=i.content) for i in match_list]

Create a chain

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI()


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

retriever = AdbpgRetriever(namespace='ns1', namespace_password='Ns1password', collection='dc1', top_k=10, use_full_text_retrieval=True)
chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

Conversational search

chain.invoke("What is AnalyticDB for PostgreSQL?")

# Answer:
# AnalyticDB for PostgreSQL is a cloud-native online analytical processing (OLAP) service provided by Alibaba Cloud. It is developed based on the open source PostgreSQL database and provides a high-performance, high-capacity data warehousing solution. 
# It combines the flexibility and compatibility of PostgreSQL with high-concurrency and high-speed search capabilities for data analysis and reporting. 
#
# AnalyticDB for PostgreSQL is especially suitable for processing large-scale data sets and supports real-time analysis and decision-making. It is a powerful tool for enterprises to perform data mining, business intelligence (BI), reporting, and data visualization. 
# As a managed service, it simplifies the management and O&M of data warehouses, allowing users to focus on data analysis instead of the underlying infrastructure. 
# Key benefits:
# 
# High-performance analytics: It uses column-oriented storage and a massively parallel processing (MPP) architecture to quickly query and analyze large amounts of data. 
# Easy to scale: It is easy to scale resources horizontally and vertically based on the data volume and search performance requirements. 
# Compatible with PostgreSQL: It supports the PostgreSQL language and most tools in the ecosystem to facilitate migration and adaptation for existing PostgreSQL users. 
# Secure and reliable: It provides features such as data backup, recovery, and encryption to ensure data security and reliability. 
# Cloud-native integration: It is closely integrated with other Alibaba Cloud services such as Data Integration and data visualization tools. 
# In summary, AnalyticDB for PostgreSQL is a high-performance, scalable cloud data warehousing service that allows enterprises to perform complex data analysis and reporting in cloud environments.