Use MetaSearch to search for OSS objects based on metadata attributes - Object Storage Service

Provided by Object Storage Service (OSS), Metasearch is an indexing feature based on the metadata of objects, which can be specified as index conditions to query objects. This way, you can manage and learn about data structures, perform queries, collect statistics, and manage objects in an efficient manner.

Scenarios

Data audit

You can use MetaSearch to quickly find objects to meet data audit or regulatory requirements. For example, in the financial industry, you can filter objects by using metadata such as custom tags and access control lists (ACLs). This way, you can search for objects that have a specific sensitivity level or specific ACLs to improve the efficiency of data audits.

Enterprise data backup and archiving

When an enterprise wants to back up and archive data, the enterprise can use MetaSearch to quickly search for objects of a specific creation date or storage class based on object metadata such as the creation date, storage class, and custom tags. This way, the enterprise can quickly restore historical data or archived objects.

Usage notes

Supported regions
MetaSearch is supported for buckets that are located in the China (Hangzhou), China (Shanghai), China (Qingdao), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Guangzhou), China (Chengdu), China (Hong Kong), Singapore, Indonesia (Jakarta), Germany (Frankfurt), US (Virginia), US (Silicon Valley), and UK (London) regions.
Object quantity
By default, MetaSearch is supported only for a bucket that contains up to 10 billion objects.

Billing rules

MetaSearch is in public preview, during which you can use MetaSearch free of charge. To use the MetaSearch feature, you must enable the metadata management feature. After the public preview, you will be charged for metadata management and metadata retrievals. For more information about the billable items of the metasearch feature, see Data indexing fees.

In addition to the aforementioned billable items, you are charged additional fees based on the number of API operation calls when you use AISearch. The following table describes the related API operations:

Description	API	Number of API operation calls
Build indexes for objects in buckets	HeadObject	One call for each object
Objects in buckets are tagged	GetObjectTag	One call for each tagged object
Symbolic links in buckets	GetSymlink	One call for each symbolic link object
Scan the bucket	ListObjects	One call for every batch of 1,000 files that are scanned

For more information, see API operation calling fees.

Time required for indexing
After you enable MetaSearch, OSS creates an index. The time required to create the index is proportional to the number of objects stored in the bucket. If a larger number of objects are stored in the bucket, a longer period of time is required to create the index. In most cases, the first time you create an index for 10 million objects, approximately 1 hour is required. The first time you create an index for 1 billion objects, approximately 1 day is required. The first time you create an index for 10 billion objects, approximately 2 to 3 days are required. The preceding time is provided only for reference.
Multipart upload
If a bucket contains objects that are uploaded by using multipart upload, the search results include only the complete objects combined by calling the CompleteMultipartUpload operation. Parts that are uploaded by multipart upload tasks that are initiated but are not completed or canceled are not included in the search results.

Methods

Use the OSS console

In this example, the following search conditions are used to search for objects: 1. Object size: less than 500 KB; 2. Last modified time: from 00:00:00 on September 11, 2024 to 00:00:00 on September 12, 2024; 3. Sort order: sort objects by object size in the ascending order; 4. Data aggregation: display the maximum size of the objects that meet the preceding requirements.

Buckets in the China (Guangzhou) region

Log on to the OSS console
In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.
In the left-side navigation tree, choose Object Management > Data Indexing
On the Data Indexing page, click Enable Now.
In the Data Indexing dialog box, select MetaSearch and click Enable.
Note
The amount of time required for MetaSearch to take effect varies based on the number of objects in the bucket.
Specify the parameters based on your business requirements in the OSS Metadata Condition section and retain the default settings for other parameters.
- Set Start Date to 00:00:00 on September 11, 2024 and End Date to 00:00:00 on September 12, 2024 for the Last Modified At parameter.
- Select Less Than from the drop-down list and enter 500 in the second field for the Object Size parameter.
Specify the search result display method in the Search Result Settings section.
- Set Object Sort Order to Ascending and select Object Size from the Sorted By drop-down list.
- Select Object Size from the Output drop-down list and Maximum from the By drop-down list for the Data Aggregation parameter.
Click Query Now.

For more information, see Search conditions and search result settings.

Buckets in the China (Hangzhou), China (Shanghai), China (Qingdao), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Chengdu), China (Hong Kong), Singapore, Indonesia (Jakarta), Germany (Frankfurt), US (Virginia), US (Silicon Valley), and UK (London) regions

Log on to the OSS console
In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.
In the left-side navigation tree, choose Object Management > Data Indexing.
Enable Metadata Management.
Note
The amount of time required for MetaSearch to take effect varies based on the number of objects in the bucket.
Configure the Basic Filtering Conditions according to the following instructions. Retain the default settings for other parameters.
- Set the Start Time to 00:00:00 on October 20, 2024 and the End Time to 00:00:00 on October 21, 2024 for the Last Modified At parameter.
- Select Less Than from the drop-down list and enter 1600 in the second field for the Object Size parameter.
Show more filtering conditions
- Set Object Sort Order to Ascending and select Object Size from the Sorted By drop-down list.
- Select Object Size from the Output drop-down list and Maximum from the By drop-down list for the Data Aggregation parameter.
Two objects that meet the search conditions are returned as shown in the following figure. The maximum size of the objects is 1.54 MB.
For more information about the search conditions and search result settings, see Search conditions and search result settings.

Use OSS SDKs

Currently, only OSS SDK for Java, OSS SDK for Python, and OSS SDK for Go allow you to use MetaSearch to query objects that meet specific conditions. Before you use MetaSearch to search for objects in a bucket, you must enable the metadata management feature for the bucket. For more information about how to use MetaSearch to search for objects by using OSS SDKs for other programming languages, see Overview.

Java

import com.aliyun.oss.ClientException;
import com.aliyun.oss.OSS;
import com.aliyun.oss.common.auth.*;
import com.aliyun.oss.OSSClientBuilder;
import com.aliyun.oss.OSSException;
import com.aliyun.oss.model.*;
import java.util.ArrayList;
import java.util.List;

public class Demo {

    // In this example, the endpoint of the China (Hangzhou) region is used. Specify your actual endpoint. 
    private static String endpoint = "https://oss-cn-hangzhou.aliyuncs.com";
    // Specify the name of the bucket. Example: examplebucket. 
    private static String bucketName = "examplebucket";

    public static void main(String[] args) throws Exception {
        // Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
        EnvironmentVariableCredentialsProvider credentialsProvider = CredentialsProviderFactory.newEnvironmentVariableCredentialsProvider();
        // Specify the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the region to cn-hangzhou.
        String region = "cn-hangzhou";

        // Create an OSSClient instance. 
        ClientBuilderConfiguration clientBuilderConfiguration = new ClientBuilderConfiguration();
        clientBuilderConfiguration.setSignatureVersion(SignVersion.V4);        
        OSS ossClient = OSSClientBuilder.create()
        .endpoint(endpoint)
        .credentialsProvider(credentialsProvider)
        .clientConfiguration(clientBuilderConfiguration)
        .region(region)               
        .build();

        try {
            // Query objects that meet specific conditions and list information about the objects based on specific fields and sorting methods. 
            int maxResults = 20;
            // Query objects that are smaller than 1,048,576 bytes in size, return up to 20 objects at a time, and sort the objects in ascending order. 
            String query = "{\"Field\": \"Size\",\"Value\": \"1048576\",\"Operation\": \"lt\"}";
            String sort = "Size";
            DoMetaQueryRequest doMetaQueryRequest = new DoMetaQueryRequest(bucketName, maxResults, query, sort);
            Aggregation aggregationRequest = new Aggregation();
            Aggregations aggregations = new Aggregations();
            List<Aggregation> aggregationList = new ArrayList<Aggregation>();
            // Specify the name of the field that is used in the aggregate operation. 
            aggregationRequest.setField("Size");
            // Specify the operator that is used in the aggregate operation. max indicates the maximum value. 
            aggregationRequest.setOperation("max");
            aggregationList.add(aggregationRequest);
            aggregations.setAggregation(aggregationList);

            // Specify the aggregate operation. 
            doMetaQueryRequest.setAggregations(aggregations);
            doMetaQueryRequest.setOrder(SortOrder.ASC);
            DoMetaQueryResult doMetaQueryResult = ossClient.doMetaQuery(doMetaQueryRequest);
            if(doMetaQueryResult.getFiles() != null){
                for(ObjectFile file : doMetaQueryResult.getFiles().getFile()){
                    System.out.println("Filename: " + file.getFilename());
                    // Query the ETag values that are used to identify the content of the objects. 
                    System.out.println("ETag: " + file.getETag());
                    // Query the access control list (ACL) of the objects.
                    System.out.println("ObjectACL: " + file.getObjectACL());
                    // Query the type of the objects. 
                    System.out.println("OssObjectType: " + file.getOssObjectType());
                    // Query the storage class of the objects. 
                    System.out.println("OssStorageClass: " + file.getOssStorageClass());
                    // Query the number of tags of the objects. 
                    System.out.println("TaggingCount: " + file.getOssTaggingCount());
                    if(file.getOssTagging() != null){
                        for(Tagging tag : file.getOssTagging().getTagging()){
                            System.out.println("Key: " + tag.getKey());
                            System.out.println("Value: " + tag.getValue());
                        }
                    }
                    if(file.getOssUserMeta() != null){
                        for(UserMeta meta : file.getOssUserMeta().getUserMeta()){
                            System.out.println("Key: " + meta.getKey());
                            System.out.println("Value: " + meta.getValue());
                        }
                    }
                }
            } else if(doMetaQueryResult.getAggregations() != null){
                for(Aggregation aggre : doMetaQueryResult.getAggregations().getAggregation()){
                    // Query the name of the aggregation field. 
                    System.out.println("Field: " + aggre.getField());
                    // Query the aggregation operator. 
                    System.out.println("Operation: " + aggre.getOperation());
                    // Query the values of the aggregate operations. 
                    System.out.println("Value: " + aggre.getValue());
                    if(aggre.getGroups() != null && aggre.getGroups().getGroup().size() > 0){
                        // Query the values of the aggregation operations by group. 
                        System.out.println("Groups value: " + aggre.getGroups().getGroup().get(0).getValue());
                        // Query the total number of the aggregation operations by group. 
                        System.out.println("Groups count: " + aggre.getGroups().getGroup().get(0).getCount());
                    }
                }
            } else {
                System.out.println("NextToken: " + doMetaQueryResult.getNextToken());
            }
        } catch (OSSException oe) {
            System.out.println("Error Message:" + oe.getErrorMessage());
            System.out.println("Error Code:" + oe.getErrorCode());
            System.out.println("Request ID:" + oe.getRequestId());
            System.out.println("Host ID:" + oe.getHostId());
        } catch (ClientException ce) {
            System.out.println("Error Message: " + ce.getMessage());
        } finally {
            // Shut down the OSSClient instance. 
            ossClient.shutdown();
        }
    }

Python

# -*- coding: utf-8 -*-
import oss2
from oss2.credentials import EnvironmentVariableCredentialsProvider
from oss2.models import MetaQuery, AggregationsRequest
# Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
auth = oss2.ProviderAuthV4(EnvironmentVariableCredentialsProvider())

# Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. 
endpoint = "https://oss-cn-hangzhou.aliyuncs.com"
# Specify the ID of the region that maps to the endpoint. Example: cn-hangzhou. This parameter is required if you use the signature algorithm V4.
region = "cn-hangzhou"

# Specify the name of the bucket. Example: examplebucket. 
bucket = oss2.Bucket(auth, endpoint, "examplebucket", region=region)

# Query objects that meet specific conditions and list the object information based on specific fields and sorting methods. 
# Query objects that are smaller than 1 MB, return up to 10 objects at a time, and sort the objects in ascending order. 
do_meta_query_request = MetaQuery(max_results=10, query='{"Field": "Size","Value": "1048576","Operation": "lt"}', sort='Size', order='asc')
result = bucket.do_bucket_meta_query(do_meta_query_request)

# Display the object names. 
print(result.files[0].file_name)
# Display the ETags of the objects. 
print(result.files[0].etag)
# Display the types of the objects. 
print(result.files[0].oss_object_type)
# Display the storage classes of the objects. 
print(result.files[0].oss_storage_class)
# Display the CRC-64 values of the objects. 
print(result.files[0].oss_crc64)
# Display the access control lists (ACLs) of the objects. 
print(result.files[0].object_acl)

package main

import (
	"fmt"
	"os"

	"github.com/aliyun/aliyun-oss-go-sdk/oss"
)

func main() {
	// Obtain access credentials from environment variables. Before you run the sample code, make sure that the OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables are configured. 
	provider, err := oss.NewEnvironmentVariableCredentialsProvider()
	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(-1)
	}

	// Create an OSSClient instance. 
        // Specify the endpoint of the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the endpoint to https://oss-cn-hangzhou.aliyuncs.com. Specify your actual endpoint. 
	// Specify the region in which the bucket is located. For example, if the bucket is located in the China (Hangzhou) region, set the region to cn-hangzhou. Specify the actual region.
	clientOptions := []oss.ClientOption{oss.SetCredentialsProvider(&provider)}
	clientOptions = append(clientOptions, oss.Region("yourRegion"))
	// Specify the version of the signature algorithm.
	clientOptions = append(clientOptions, oss.AuthVersion(oss.AuthV4))
	client, err := oss.New("yourEndpoint", "", "", clientOptions...)
	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(-1)
	}
	// Query objects that are larger than 30 bytes in size, return up to 10 objects at the same time, and then sort the objects in ascending order. 
	query := oss.MetaQuery{
		NextToken:  "",
		MaxResults: 10,
		Query:      `{"Field": "Size","Value": "30","Operation": "gt"}`,
		Sort:       "Size",
		Order:      "asc",
	}
	// Query objects that match the specified conditions and list object information based on the specified fields and sorting methods. 
	result, err := client.DoMetaQuery("examplebucket", query)
	if err != nil {
		fmt.Println("Error:", err)
		os.Exit(-1)
	}
	fmt.Printf("NextToken:%s\n", result.NextToken)
	for _, file := range result.Files {
		fmt.Printf("File name: %s\n", file.Filename)
		fmt.Printf("size: %d\n", file.Size)
		fmt.Printf("File Modified Time:%s\n", file.FileModifiedTime)
		fmt.Printf("Oss Object Type:%s\n", file.OssObjectType)
		fmt.Printf("Oss Storage Class:%s\n", file.OssStorageClass)
		fmt.Printf("Object ACL:%s\n", file.ObjectACL)
		fmt.Printf("ETag:%s\n", file.ETag)
		fmt.Printf("Oss CRC64:%s\n", file.OssCRC64)
		fmt.Printf("Oss Tagging Count:%d\n", file.OssTaggingCount)
		for _, tagging := range file.OssTagging {
			fmt.Printf("Oss Tagging Key:%s\n", tagging.Key)
			fmt.Printf("Oss Tagging Value:%s\n", tagging.Value)
		}
		for _, userMeta := range file.OssUserMeta {
			fmt.Printf("Oss User Meta Key:%s\n", userMeta.Key)
			fmt.Printf("Oss User Meta Key Value:%s\n", userMeta.Value)
		}
	}
}

Use the OSS API

If your business requires a high level of customization, you can directly call RESTful APIs. To directly call an API, you must include the signature calculation in your code. For more information, see DoMetaQuery.

Search conditions and search result settings

Search conditions

The following table describes all search conditions. You can specify one or more search conditions based on your business requirements.

OSS metadata conditions

Search condition	Description
Storage Class	By default, the following storage classes supported by OSS are selected: Standard, Infrequent Access (IA), Archive, Cold Archive, and Deep Cold Archive. You can specify the storage class based on your business requirements.
ACL	By default, the following ACLs supported by OSS are selected: Inherited from Bucket, Private, Public Read, and Public Read/Write. You can specify the ACL based on your business requirements.
Object Name	You can select Fuzzy Match or Equal To. If you want to display the name of an object in the search results, such as exampleobject.txt, you can use one of the following methods to match the object name: Select Equal To and enter the full name of the object. Example: `exampleobject.txt`. Select Fuzzy Match and enter the prefix or suffix of the object name. Example: `example` or `.txt`. Important Fuzzy match can match all object names that contain specific characters. For example, if you enter `test` next to Fuzzy Match, localfolder/test/.example.jpg and localfolder/test.jpg meet the search condition, and are displayed in the search results.
Upload Type	By default, the following upload types are selected. You can specify the upload type based on your business requirements. Normal: returns objects uploaded by using simple upload in the search results. Multipart: returns objects uploaded by using multipart upload in the search results. Appendable: returns objects uploaded by using append upload in the search results. Symlink: returns symbolic links.
Last Modified At	You can specify Start Date and End Date for Last Modified At. The values of Start Date and End Date are accurate to seconds.
Object Size	You can select Equal To, Greater Than, Greater Than or Equal To, Less Than, or Less Than or Equal To for Object Size. Unit: KB.
Object Versions	You can search for only the current versions of objects.

Object ETag and tag conditions

If you want to search for objects based on their ETags and tags, you can enter the ETags or tags of the objects that you want to display in the search results.

ETags support only exact match. An ETag must be enclosed in quotation marks (“). Example: "5B3C1A2E0563E1B002CC607C6689". If you want to specify multiple ETags, separate them with line feeds.
Specify Object Tags by using key-value pairs. The keys and values of object tags are case-sensitive. For more information about tag rules, see Add tags to an object.

Search result settings

You can sort the search results and view statistics on search results based on specific conditions.

Object Sort Order: You can sort the search results in the Ascending, Descending, or Default order based on the Last Modified Time, Object Name, and Object Size based on your business requirements.
Data Aggregation: You can view statistics on the search results based on specific conditions, such as de-duplication, group count, maximum, minimum, average, and sum. This facilitates efficient data analysis and management.

FAQ

When hundreds of millions of objects are stored in a bucket, why are data indexes not created for a long period of time?

Approximately 1 second is required to create indexes for 600 objects. You can estimate the period of time required to create indexes based on the number of objects in the bucket.

References

MetaSearch supports multiple filtering conditions, such as the last modified time, storage class, ACL, and size of objects. If you want to search for OSS objects whose last modified time is within a specific period of time from a large number of objects in a bucket, see How to filter OSS objects whose last modified time is within a specific period of time.