UploadDocumentAsync - Asynchronous Document Upload - 云原生数据仓库AnalyticDB

Asynchronous Document Upload。

接口说明

The server loads and chunks a document based on the file extension, performs vectorization by using the embedding model that is specified when you call the CreateDocumentCollection operation, and then writes the document to the specified document collection. This operation supports multi-modal embedding for various formats of text and images.

Related operations:

You can call the GetUploadDocumentJob operation to query the progress and result of a document upload job.
You can call the CancelUploadDocumentJob operation to cancel a document upload job.

说明

After a document upload request is submitted, the request is queued for processing. Up to 20 documents in the Pending and Running states can be processed within a Resource Access Management (RAM) user or Alibaba Cloud account.
A text document can be split into up to 100,000 chunks.
If a document collection uses the OnePeace model, each RAM user or Alibaba Cloud account can upload and query up to 10,000 images.

调试

您可以在OpenAPI Explorer中直接运行该接口，免去您计算签名的困扰。运行成功后，OpenAPI Explorer可以自动生成SDK代码示例。

调试

授权信息

下表是API对应的授权信息，可以在RAM权限策略语句的Action元素中使用，用来给RAM用户或RAM角色授予调用此API的权限。具体说明如下：

操作：是指具体的权限点。
访问级别：是指每个操作的访问级别，取值为写入（Write）、读取（Read）或列出（List）。
资源类型：是指操作中支持授权的资源类型。具体说明如下：
- 对于必选的资源类型，用背景高亮的方式表示。
- 对于不支持资源级授权的操作，用全部资源表示。
条件关键字：是指云产品自身定义的条件关键字。
关联操作：是指成功执行操作所需要的其他权限。操作者必须同时具备关联操作的权限，操作才能成功。

操作	访问级别	资源类型	条件关键字	关联操作
gpdb:UploadDocumentAsync	create	*Document `acs:gpdb:{#regionId}:{#accountId}:document/{#DBInstanceId}`	无	无

请求参数

名称	类型	必填	描述	示例值
DBInstanceId	string	是	Instance ID with vector engine optimization acceleration enabled. You can call the DescribeDBInstances API to view details of all AnalyticDB PostgreSQL instances in the target region, including the instance ID.	gp-bp12ga6v69h86****
Collection	string	是	The name of the document library. 说明 Created by the CreateDocumentCollection API. You can call the ListDocumentCollections API to view the document libraries that have already been created.	document
Namespace	string	否	Namespace, defaults to public. You can create one through the CreateNamespace interface and view the list via the ListNamespaces interface.	mynamespace
NamespacePassword	string	是	Password corresponding to the namespace. > This value is specified by the CreateNamespace interface.	testpassword
RegionId	string	是	The region ID of the instance.	cn-hangzhou
FileName	string	是	The file name of the document. 说明 We recommend that you add an extension to the file name. Examples: `.json`, `.md`, and `.pdf`. If you do not add an extension, the default loader designed for unstructured data is used. If an image file is involved, the file name must contain an extension. The following extensions are supported: `.bmp`, `.jpg`, `.jpeg`, `.png`, and `.tiff`. You can use a compressed package to upload images. The package file name must contain an extension. Supported package file extensions: `.tar`, `.gz`, and `.zip`.	mydoc.txt
FileUrl	string	是	The URL of the publicly accessible document. 说明 It is recommended to call this interface using the SDK, which provides a method called UploadDocumentAsyncAdvance for directly uploading local files. - If it's an image archive URL, the number of images in the current archive should not exceed 100.	https://xx/mydoc.txt
Metadata	object	否	The metadata. The value of this parameter must be the same as the Metadata parameter that is specified when you call the CreateDocumentCollection operation.
	any	否	元数据信息，需和创建文档库（CreateDocumentCollection）时指定的 Metadata 字段一致。	{"title":"mytitle","page":1}
ChunkSize	integer	否	Strategy for processing large data: the size of each chunk when the data is split into smaller parts. Maximum value is 2048.	250
ChunkOverlap	integer	否	The size of data that is overlapped between consecutive chunks. The maximum value of this parameter cannot be greater than the value of the ChunkSize parameter. 说明 This parameter is used to prevent context missing that may occur due to data truncation. For example, when you upload a long text, you can retain specific overlapped text content between consecutive chunks to better understand the context.	50
Separators	array	否	The separators that are used to split large amounts of data. 说明 This is an important parameter that determines the chunking effect. This parameter is related to the splitter that is specified by the TextSplitterName parameter. In most cases, you do not need to specify this parameter. The server assigns separators based on the value of the TextSplitterName parameter.
	string	否	The separator.	.
DryRun	boolean	否	Specifies whether to perform only document understanding and chunking, but not vectorization and storage. Default value: false. 说明 You can set this parameter to true, check the chunking effect, and then perform optimization if needed.	false
ZhTitleEnhance	boolean	否	Specifies whether to enable title enhancement. 说明 You can determine the title text, mark the text in the metadata, and then combine the text with the upper-level title to implement text enhancement.	false
TextSplitterName	string	否	The name of the splitter. Valid values: ChineseRecursiveTextSplitter: inherits from RecursiveCharacterTextSplitter, uses `["\n\n","\n", "。\|!\|?", "\.\s\|\!\s\|\?\s", ";\|;\s", ",\|,\s"]` as separators by default, and uses regular expressions to match text. RecursiveCharacterTextSplitter: uses `["\n\n", "\n", " ", ""]` as separators by default. The splitter supports splitting code in languages such as `C++, Go, Java, JS, PHP, Proto, Python, RST, Ruby, Rust, Scala, Swift, Markdown, LaTeX, HTML, Sol, and C Sharp`. SpacyTextSplitter: uses `\n\n` as separators by default and uses the en_core_web_sm model of spaCy. The splitter can obtain better splitting effect. MarkdownHeaderTextSplitter: splits text in the `[("#", "head1"), ("##", "head2"), ("###", "head3"), ("####", "head4")]` format. The splitter is suitable for Markdown text.	ChineseRecursiveTextSplitter
DocumentLoaderName	string	否	The name of the document loader. You do not need to specify this parameter. A document loader is automatically specified based on the file extension. Valid values: UnstructuredHTMLLoader: `.html` UnstructuredMarkdownLoader: `.md` PyMuPDFLoader: `.pdf` PyPDFLoader: `.pdf` RapidOCRPDFLoader: `.pdf` PDFWithImageRefLoader: `.pdf` (with the text-image association feature) JSONLoader: `.json` CSVLoader: `.csv` RapidOCRLoader: `.png`, `.jpg`, `.jpeg`, and `.bmp` UnstructuredFileLoader: `.eml`, `.msg`, `.rst`, `.txt`, `.docx`, `.epub`, `.odt`, `.pptx`, and `.tsv`	PyMuPDFLoader

返回参数

名称	类型	描述	示例值
	object
RequestId	string	Request ID.	ABB39CC3-4488-4857-905D-2E4A051D0521
Message	string	Return information.	success
Status	string	Creation status, value description: - success: Success - fail: Fail	success
JobId	string	Task ID, used for checking the task status or canceling the task later.	231460f8-75dc-405e-a669-0c5204887e91

示例

正常返回示例

JSON格式

{
  "RequestId": "ABB39CC3-4488-4857-905D-2E4A051D0521",
  "Message": "success",
  "Status": "success",
  "JobId": "231460f8-75dc-405e-a669-0c5204887e91"
}

错误码

访问错误中心查看更多错误码。