Manage documents - AnalyticDB - Alibaba Cloud Documentation Center

After you create a document collection, you can upload documents as described in this topic.

Upload documents

In this example, local documents are uploaded in asynchronous mode. Sample code:

import time
import io
from typing import Dict, List, Any
from alibabacloud_tea_util import models as util_models


def upload_document_async(
        namespace,
        namespace_password,
        collection,
        file_name,
        file_path,
        metadata: Dict[str, Any] = None,
        chunk_overlap: int = None,
        chunk_size: int = None,
        document_loader_name: str = None,
        text_splitter_name: str = None,
        dry_run: bool = None,
        zh_title_enhance: bool = None,
        separators: List[str] = None):
    with open(file_path, 'rb') as f:
        file_content_bytes = f.read()
    request = gpdb_20160503_models.UploadDocumentAsyncAdvanceRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        namespace=namespace,
        namespace_password=namespace_password,
        collection=collection,
        file_name=file_name,
        metadata=metadata,
        chunk_overlap=chunk_overlap,
        chunk_size=chunk_size,
        document_loader_name=document_loader_name,
        file_url_object=io.BytesIO(file_content_bytes),
        text_splitter_name=text_splitter_name,
        dry_run=dry_run,
        zh_title_enhance=zh_title_enhance,
        separators=separators,
    )
    response = get_client().upload_document_async_advance(request, util_models.RuntimeOptions())
    print(f"upload_document_async response code: {response.status_code}, body:{response.body}")
    return response.body.job_id


def wait_upload_document_job(namespace, namespace_password, collection, job_id):
    def job_ready():
        request = gpdb_20160503_models.GetUploadDocumentJobRequest(
            region_id=ADBPG_INSTANCE_REGION,
            dbinstance_id=ADBPG_INSTANCE_ID,
            namespace=namespace,
            namespace_password=namespace_password,
            collection=collection,
            job_id=job_id,
        )
        response = get_client().get_upload_document_job(request)
        print(f"get_upload_document_job response code: {response.status_code}, body:{response.body}")
        return response.body.job.completed
    while True:
        if job_ready():
            print("successfully load document")
            break
        time.sleep(2)


if __name__ == '__main__':
    job_id = upload_document_async("ns1", "Ns1password", "dc1",
                                   "test.pdf", "/root/test.pdf")
    wait_upload_document_job("ns1", "Ns1password", "dc1", job_id)


# upload_document_async output:
# {
#    "JobId":"95de2856-0cd4-44bb-b216-ea2f0ebcc57b",
#    "Message":"Successfully create job",
#    "RequestId":"9F870770-C402-19EC-9E26-ED7E4F539C3E",
#    "Status":"success"
# }

# get_upload_document_job output:
# {
#    "ChunkResult":{
#        "ChunkFileUrl":"http://knowledge-base-gp-xx.oss-cn-beijing.aliyuncs.com/ns1/dc1/produce-files/test.pdf/chunks.jsonl?Expires=1706530707&OSSAccessKeyId=ak&Signature=6qUSwBtuthr0L9OxKoTh7kEohxQ%3D",
#        "PlainChunkFileUrl":"http://knowledge-base-gp-xx.oss-cn-beijing.aliyuncs.com/ns1/dc1/produce-files/test.pdf/plain_chunks.txt?Expires=1706530707&OSSAccessKeyId=ak&Signature=sxc5iiGUDE2M%2FV0JikFvQE7FdBM%3D"
#    },
#    "Job":{
#        "Completed":true,
#        "CreateTime":"2024-01-29 18:15:27.364484",
#        "Id":"95de2856-0cd4-44bb-b216-ea2f0ebcc57b",
#        "Progress":100,
#        "Status":"Success",
#        "UpdateTime":"2024-01-29 18:15:53.78808"
#    },
#    "Message":"Success get job info",
#    "RequestId":"64487F02-5A02-1CD9-BA5C-B59E9D3A68CC",
#    "Status":"success"
# }

Descriptions of the parameters in the upload_document_async function:

namespace: the name of the namespace where the document collection is located.
namespace_password: the password of the namespace.
collection: the name of the document collection to which you want to store the document.
file_name: the name of the document, with a type suffix.
file_path: the local document path. The maximum size of the document is 200 MB.
metadata: the metadata of the document, which must be the same as the metadata specified when the document collection is created.
chunk_overlap: the splitting policy that is used if you split large amounts of data into chunks. The maximum size of data that is overlapped between consecutive chunks cannot exceed chunk_size.
chunk_size: the size of each chunk if you split large amounts of data into chunks. The maximum value is 2048.
document_loader_name: the name of the document loader. You do not need to specify this parameter. A document loader is automatically specified based on the file extension. For more information, see Document understanding.
text_splitter_name: the name of the splitter. For more information about document splitting, see Document splitting.
dry_run: specifies whether to perform only document understanding and chunking, but not vectorization and storage. Valid values:
- true: performs only document understanding and splitting.
- false (default): performs document understanding and splitting first, and then performs vectorization and storage.
zh_title_enhance: specifies whether to enable title enhancement. Valid values:
- true: enables title enhancement.
- false: disables title enhancement.
separators: the separators that are used to split large amounts of data. In most cases, you do not need to specify this parameter.

Document understanding

You do not need to specify document_loader_name. A document loader is automatically specified based on the file extension.

UnstructuredHTMLLoader: .html
UnstructuredMarkdownLoader: .md
PyMuPDFLoader: .pdf
PyPDFLoader: .pdf
RapidOCRPDFLoader: .pdf
JSONLoader: .json
CSVLoader: .csv
RapidOCRLoader: .png, .jpg, .jpeg, or .bmp
UnstructuredFileLoader: .eml, .msg, .rst, .txt, .xml, .docx, .epub, .odt, .pptx, or .tsv

When multiple loaders are suitable for a document type, such as pdf, you can specify any of them. If you need to recognize the text in images, RapidOCRPDFLoader is recommended.

Document splitting

The results of document splitting are determined by chunk_overlap, chunk_size, and text_splitter_name. The text_splitter_name parameter has the following values:

ChineseRecursiveTextSplitter: inherits from RecursiveCharacterTextSplitter, uses ["\n\n","\n", ".|!|?", "\.\s|\!\s|\?\s", ";|;\s", ",|,\s"] as separators by default, and uses regular expressions to match text.
SpacyTextSplitter: uses ["\n\n", "\n", " ", ""] as separators by default. Splitting by using the following programming languages is supported: c++, go, java, js, php, proto, python, rst, ruby, rust, scala, swift, markdown, latex, html, sol, and csharp.
RecursiveCharacterTextSplitter: uses \n\n as separators by default and uses the en_core_web_sm model of spaCy. The splitter can obtain better splitting effect.
MarkdownHeaderTextSplitter: splits text in the [("#", "head1"), ("##", "head2"), ("###", "head3"), ("####", "head4")] format. The splitter is suitable for Markdown text.

View the document list

def list_documents(namespace, namespace_password, collection):
    request = gpdb_20160503_models.ListDocumentsRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        namespace=namespace,
        namespace_password=namespace_password,
        collection=collection,
    )
    response = get_client().list_documents(request)
    print(f"list_documents response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    list_documents("ns1", "Ns1password", "dc1")


# output: body:
# {
#    "Items":{
#        "DocumentList":[
#            {
#                "FileName":"test.pdf",
#                "Source":"OSS"
#            }
#        ]
#    },
#    "RequestId":"08D5E2D6-81E1-1D8A-B864-830538B04991",
#    "Status":"success"
# }

Descriptions of the parameters in the list_documents function:

namespace: the name of the namespace where the document collection is located.
namespace_password: the password of the namespace.
collection: the name of the document collection.

View document details

def describe_document(namespace, namespace_password, collection, file_name):
    request = gpdb_20160503_models.DescribeDocumentRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        namespace=namespace,
        namespace_password=namespace_password,
        collection=collection,
        file_name=file_name
    )
    response = get_client().describe_document(request)
    print(f"describe_document response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    describe_document("ns1", "Ns1password", "dc1", "test.pdf")


# output: body:
# {
#    "DocsCount":24,
#    "DocumentLoader":"PyMuPDFLoader",
#    "FileExt":"pdf",
#    "FileMd5":"ce16fa68025ebf41649810f0335caf49",
#    "FileMtime":"2024-01-29 11:37:27.270611",
#    "FileName":"test.pdf",
#    "FileSize":8332620,
#    "FileVersion":1,
#    "RequestId":"D05B4CF1-64F0-1D77-AD9C-C54CAB065571",
#    "Source":"OSS",
#    "Status":"success",
#    "TextSplitter":"ChineseRecursiveTextSplitter"
# }

Descriptions of the parameters in the describe_document function:

namespace: the name of the namespace where the document collection is located.
namespace_password: the password of the namespace.
collection: the name of the document collection.
file_name: the name of the document.

Returned document details:

DocsCount: the number of chunks into which the document is split.
TextSplitter: the name of the document splitter.
DocumentLoader: the name of the document loader.
FileExt: the file extension of the document.
FileMd5: the MD5 hash value of the document.
FileMtime: the latest upload time of the document.
FileSize: the size of the document. Unit: bytes.
FileVersion: the version of the document. The value is of the INT type. The value indicates how many times the document has been uploaded and updated.

Delete a document

def delete_document(namespace, namespace_password, collection, file_name):
    request = gpdb_20160503_models.DeleteDocumentRequest(
        region_id=ADBPG_INSTANCE_REGION,
        dbinstance_id=ADBPG_INSTANCE_ID,
        namespace=namespace,
        namespace_password=namespace_password,
        collection=collection,
        file_name=file_name
    )
    response = get_client().delete_document(request)
    print(f"delete_document response code: {response.status_code}, body:{response.body}")


if __name__ == '__main__':
    delete_document("ns1", "Ns1password", "dc1", "test.pdf")


# output: body:
# {
#    "Message":"success",
#    "RequestId":"DC735368-02DD-48A4-8A26-C8DEB53C5B56",
#    "Status":"success"
# }

Descriptions of the parameters in the delete_document function:

namespace: the name of the namespace where the document collection is located.
namespace_password: the password of the namespace.
collection: the name of the document collection.
file_name: the name of the document.