Use OpenSearch LLM-Based Conversational Search Edition SDKs to push unstructured documents - OpenSearch

To upload data in push mode, you must first generate datasets in the valid format and upload the datasets to the client buffer. Then, call the push method to submit the datasets to the application at a time.

Dependencies

To use OpenSearch SDKs to upload files, you must specify the following dependencies:

For information about BaseRequest, see Sample code for the Python client.

Java

<dependency>
 <groupId>com.aliyun.opensearch</groupId>
 <artifactId>aliyun-sdk-opensearch</artifactId>
 <version>4.0.0</version>
</dependency>

Python

pip install alibabacloud_tea_util 
pip install alibabacloud_opensearch_util
pip install alibabacloud_credentials

Configure environment variables

Configure the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables.

Important

The AccessKey pair of an Alibaba Cloud account can be used to access all API operations. We recommend that you use a Resource Access Management (RAM) user to call API operations or perform routine O&M. For information about how to use a RAM user, see Create a RAM user.
For information about how to create an AccessKey pair, see Create an AccessKey pair.
If you use the AccessKey pair of a RAM user, make sure that the required permissions are granted to the AliyunServiceRoleForOpenSearch role by using your Alibaba Cloud account. For more information, see AliyunServiceRoleForOpenSearch and Access authorization rules.
We recommend that you do not include your AccessKey pair in materials that are easily accessible to others, such as the project code. Otherwise, your AccessKey pair may be leaked and resources in your account become insecure.

Linux and macOS
Run the following commands. Replace <access_key_id> and <access_key_secret> with the AccessKey ID and AccessKey secret of the RAM user that you use.
```
export ALIBABA_CLOUD_ACCESS_KEY_ID=<access_key_id> 
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=<access_key_secret>
```
Windows
1. Create an environment variable file, add the ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variables to the file, and then set the environment variables to your AccessKey ID and AccessKey secret.
2. Restart Windows for the AccessKey pair to take effect.

Demo code

Java

package com.leiyu.push;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Base64;
import java.util.HashMap;
import java.util.Map;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;

import com.aliyun.opensearch.OpenSearchClient;
import com.aliyun.opensearch.sdk.generated.OpenSearch;
import com.aliyun.opensearch.sdk.generated.commons.OpenSearchClientException;
import com.aliyun.opensearch.sdk.generated.commons.OpenSearchException;
import com.aliyun.opensearch.sdk.generated.commons.OpenSearchResult;


public class PushNonStructuralLLM {
    private static String appName = "The name of the OpenSearch application to which you want to push documents";
    private static String host = "The API endpoint of the OpenSearch application";
    private static String path = "/apps/%s/actions/knowledge-bulk";

    public static void main(String[] args) throws IOException {
        // Specify your AccessKey pair.
      	// Obtain the AccessKey ID and AccessKey secret from environment variables. You must configure environment variables before you run this code.
      	String accesskey = System.getenv("ALIBABA_CLOUD_ACCESS_KEY_ID");
      	String secret = System.getenv("ALIBABA_CLOUD_ACCESS_KEY_SECRET");
        
        String appPath = String.format(path, appName);

        // Create an OpenSearch object.
        OpenSearch openSearch = new OpenSearch(accesskey, secret, host);
        // Use the OpenSearch object as a parameter to create an OpenSearchClient object.
        OpenSearchClient openSearchClient = new OpenSearchClient(openSearch);

        // Create a JSON object for adding a single document.
        Path path = Paths.get("C:/Users/LEIYU/Desktop/Word/test.docx");
        JSONObject oneRequest = new JSONObject();
        oneRequest.put("cmd", "BASE64");
      	// Set the cmd parameter to BASE64 to upload an unstructured document such as a PDF, WORD, or HTML file.
        JSONObject fields = new JSONObject();
        fields.put("id", "50"); 
      	// The primary key of the document. The value must be unique. 
        fields.put("title", "test.docx"); 
      	// The name of the file with the file extension.
        fields.put("url", "www.baidu.com");
      	// The document URL.
        fields.put("content", Base64.getEncoder().encodeToString(Files.readAllBytes(path)));
        fields.put("category", "docs");
        oneRequest.put("fields",fields);

        // Create a JSON array. You can use the JSON array to add multiple documents at a time.
        final JSONArray request = new JSONArray();
        request.add(oneRequest);
        //request.add(twoRequest);

        Map<String, String> params = new HashMap<String, String>() {{
            put("format", "full_json");
            put("_POST_BODY", request.toString());
        }};
        try {
            OpenSearchResult openSearchResult = openSearchClient.callAndDecodeResult(appPath, params, "POST");
            // Display the returned result.
            System.out.println(openSearchResult.getResult());
        } catch (OpenSearchException e) {
            e.printStackTrace();
        } catch (OpenSearchClientException e) {
            e.printStackTrace();
        }
    }

}

Python

# -*- coding: utf-8 -*-

import time, os
import base64
from Tea.exceptions import TeaException
from Tea.request import TeaRequest
from alibabacloud_tea_util import models as util_models
from BaseRequest import Config, Client


class knowledge:
    def __init__(self, config: Config):
        self.Clients = Client(config=config)
        self.runtime = util_models.RuntimeOptions(
            connect_timeout=10000,
            read_timeout=10000,
            autoretry=False,
            ignore_ssl=False,
            max_idle_conns=50,
            max_attempts=3
        )
        self.header = {}

    def docBulk(self, app_name: str,doc_content: list):
        try:
            response = self.Clients._request(method="POST",
                                             pathname=f'/v3/openapi/apps/{app_name}/actions/knowledge-bulk',
                                             query={}, headers=self.header,
                                             body=doc_content, runtime=self.runtime)
            return response
        except Exception as e:
            print(e)


if __name__ == "__main__":
    # Specify the endpoint of the OpenSearch API. The value does not contain the http:// prefix.
    endpoint = "<endpoint>"
    # Specify the request protocol. Valid values: HTTPS and HTTP.
    endpoint_protocol = "HTTP"
    # Specify your AccessKey pair.
    # Obtain the AccessKey ID and AccessKey secret from environment variables. 
    # You must configure environment variables before you run this code. For more information, see the "Configure environment variables" section of this topic.
    access_key_id = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_ID")
    access_key_secret = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_SECRET")
    # Specify the authentication method. Default value: access_key. A value of sts specifies authentication based on Resource Access Management (RAM) and Security Token Service (STS).
    # Valid values: sts and access_key.
    auth_type = "access_key"
    # If you use authentication based on RAM and STS, you must specify the security_token parameter. You can call the AssumeRole operation of Alibaba Cloud RAM to obtain an STS token.
    security_token = "<security_token>"
    # Specify common request parameters.
    # Note: The security_token and type parameters are required only if you use the SDK as a RAM user.
    Configs = Config(endpoint=endpoint, access_key_id=access_key_id, access_key_secret=access_key_secret,
                     security_token=security_token, type=auth_type, protocol=endpoint_protocol)
    # Create an OpenSearch LLM-Based Conversational Search Edition instance.
    # Replace <Application name> with the name of your OpenSearch LLM-Based Conversational Search Edition instance.
    ops = knowledge(Configs)
    app_name = "<Application name>"

    # ---------------Push unstructured documents to an OpenSearch LLM-Based Conversational Search Edition instance---------------
    # Modify the paths of local files.
    with open('/Users/liu/Downloads/test.docx', 'rb') as file:
        data = file.read()
        data_b64 = base64.b64encode(data)

        document = [
        {
            "fields": {
                "id": "1",
                "title": "test.docx",
                "url": "www.baidu.com",
                "content": data_b64,
                "category": "opensearch",
                "timestamp": 1691722088645,
                "score": 0.8821945219723084
            },
            "cmd": "BASE64"
        }
    ]


        # Delete documents.
        deletedocument = {"cmd": "DELETE", "fields": {"id": 2}}
        documents = document
        res5 = ops.docBulk(app_name=app_name, doc_content=documents)
        print(res5)

Note

The cmd parameter must be set to BASE64.
The content parameter specifies the unstructured content to be pushed. For more information, see the preceding demo code.
The title parameter specifies the name of the document to be pushed.