全部产品
Search
文档中心

云原生数据仓库AnalyticDB:UploadDocumentAsync - Asynchronous Document Upload

更新时间:Dec 20, 2024

Asynchronous Document Upload。

接口说明

The server loads and chunks a document based on the file extension, performs vectorization by using the embedding model that is specified when you call the CreateDocumentCollection operation, and then writes the document to the specified document collection. This operation supports multi-modal embedding for various formats of text and images.

Related operations:

  • You can call the GetUploadDocumentJob operation to query the progress and result of a document upload job.
  • You can call the CancelUploadDocumentJob operation to cancel a document upload job.
说明
  • After a document upload request is submitted, the request is queued for processing. Up to 20 documents in the Pending and Running states can be processed within a Resource Access Management (RAM) user or Alibaba Cloud account.

  • A text document can be split into up to 100,000 chunks.

  • If a document collection uses the OnePeace model, each RAM user or Alibaba Cloud account can upload and query up to 10,000 images.

调试

您可以在OpenAPI Explorer中直接运行该接口,免去您计算签名的困扰。运行成功后,OpenAPI Explorer可以自动生成SDK代码示例。

授权信息

下表是API对应的授权信息,可以在RAM权限策略语句的Action元素中使用,用来给RAM用户或RAM角色授予调用此API的权限。具体说明如下:

  • 操作:是指具体的权限点。
  • 访问级别:是指每个操作的访问级别,取值为写入(Write)、读取(Read)或列出(List)。
  • 资源类型:是指操作中支持授权的资源类型。具体说明如下:
    • 对于必选的资源类型,用背景高亮的方式表示。
    • 对于不支持资源级授权的操作,用全部资源表示。
  • 条件关键字:是指云产品自身定义的条件关键字。
  • 关联操作:是指成功执行操作所需要的其他权限。操作者必须同时具备关联操作的权限,操作才能成功。
操作访问级别资源类型条件关键字关联操作
gpdb:UploadDocumentAsynccreate
*Document
acs:gpdb:{#regionId}:{#accountId}:document/{#DBInstanceId}

请求参数

名称类型必填描述示例值
DBInstanceIdstring

Instance ID with vector engine optimization acceleration enabled. You can call the DescribeDBInstances API to view details of all AnalyticDB PostgreSQL instances in the target region, including the instance ID.

gp-bp12ga6v69h86****
Collectionstring

The name of the document library.

说明 Created by the CreateDocumentCollection API. You can call the ListDocumentCollections API to view the document libraries that have already been created.
document
Namespacestring

Namespace, defaults to public. You can create one through the CreateNamespace interface and view the list via the ListNamespaces interface.

mynamespace
NamespacePasswordstring

Password corresponding to the namespace. > This value is specified by the CreateNamespace interface.

testpassword
RegionIdstring

The region ID of the instance.

cn-hangzhou
FileNamestring

The file name of the document.

说明
  • We recommend that you add an extension to the file name. Examples: .json, .md, and .pdf. If you do not add an extension, the default loader designed for unstructured data is used.

  • If an image file is involved, the file name must contain an extension. The following extensions are supported: .bmp, .jpg, .jpeg, .png, and .tiff.

  • You can use a compressed package to upload images. The package file name must contain an extension. Supported package file extensions: .tar, .gz, and .zip.

mydoc.txt
FileUrlstring

The URL of the publicly accessible document.

说明
  • It is recommended to call this interface using the SDK, which provides a method called UploadDocumentAsyncAdvance for directly uploading local files. - If it's an image archive URL, the number of images in the current archive should not exceed 100.
  • https://xx/mydoc.txt
    Metadataobject

    The metadata. The value of this parameter must be the same as the Metadata parameter that is specified when you call the CreateDocumentCollection operation.

    any

    元数据信息,需和创建文档库(CreateDocumentCollection)时指定的 Metadata 字段一致。

    {"title":"mytitle","page":1}
    ChunkSizeinteger

    Strategy for processing large data: the size of each chunk when the data is split into smaller parts. Maximum value is 2048.

    250
    ChunkOverlapinteger

    The size of data that is overlapped between consecutive chunks. The maximum value of this parameter cannot be greater than the value of the ChunkSize parameter.

    说明 This parameter is used to prevent context missing that may occur due to data truncation. For example, when you upload a long text, you can retain specific overlapped text content between consecutive chunks to better understand the context.
    50
    Separatorsarray

    The separators that are used to split large amounts of data.

    说明
    • This is an important parameter that determines the chunking effect. This parameter is related to the splitter that is specified by the TextSplitterName parameter.

    • In most cases, you do not need to specify this parameter. The server assigns separators based on the value of the TextSplitterName parameter.

    string

    The separator.

    .
    DryRunboolean

    Specifies whether to perform only document understanding and chunking, but not vectorization and storage. Default value: false.

    说明 You can set this parameter to true, check the chunking effect, and then perform optimization if needed.
    false
    ZhTitleEnhanceboolean

    Specifies whether to enable title enhancement.

    说明 You can determine the title text, mark the text in the metadata, and then combine the text with the upper-level title to implement text enhancement.
    false
    TextSplitterNamestring

    The name of the splitter. Valid values:

    • ChineseRecursiveTextSplitter: inherits from RecursiveCharacterTextSplitter, uses ["\n\n","\n", "。|!|?", "\.\s|\!\s|\?\s", ";|;\s", ",|,\s"] as separators by default, and uses regular expressions to match text.
    • RecursiveCharacterTextSplitter: uses ["\n\n", "\n", " ", ""] as separators by default. The splitter supports splitting code in languages such as C++, Go, Java, JS, PHP, Proto, Python, RST, Ruby, Rust, Scala, Swift, Markdown, LaTeX, HTML, Sol, and C Sharp.
    • SpacyTextSplitter: uses \n\n as separators by default and uses the en_core_web_sm model of spaCy. The splitter can obtain better splitting effect.
    • MarkdownHeaderTextSplitter: splits text in the [("#", "head1"), ("##", "head2"), ("###", "head3"), ("####", "head4")] format. The splitter is suitable for Markdown text.
    ChineseRecursiveTextSplitter
    DocumentLoaderNamestring

    The name of the document loader. You do not need to specify this parameter. A document loader is automatically specified based on the file extension. Valid values:

    • UnstructuredHTMLLoader: .html
    • UnstructuredMarkdownLoader: .md
    • PyMuPDFLoader: .pdf
    • PyPDFLoader: .pdf
    • RapidOCRPDFLoader: .pdf
    • PDFWithImageRefLoader: .pdf (with the text-image association feature)
    • JSONLoader: .json
    • CSVLoader: .csv
    • RapidOCRLoader: .png, .jpg, .jpeg, and .bmp
    • UnstructuredFileLoader: .eml, .msg, .rst, .txt, .docx, .epub, .odt, .pptx, and .tsv
    PyMuPDFLoader

    返回参数

    名称类型描述示例值
    object
    RequestIdstring

    Request ID.

    ABB39CC3-4488-4857-905D-2E4A051D0521
    Messagestring

    Return information.

    success
    Statusstring

    Creation status, value description: - success: Success - fail: Fail

    success
    JobIdstring

    Task ID, used for checking the task status or canceling the task later.

    231460f8-75dc-405e-a669-0c5204887e91

    示例

    正常返回示例

    JSON格式

    {
      "RequestId": "ABB39CC3-4488-4857-905D-2E4A051D0521",
      "Message": "success",
      "Status": "success",
      "JobId": "231460f8-75dc-405e-a669-0c5204887e91"
    }

    错误码

    访问错误中心查看更多错误码。