All Products
Search
Document Center

Tablestore:Parallel scan

Last Updated:Aug 20, 2024

If you do not have requirements on the order of query results, you can use the parallel scan feature to obtain query results in an efficient manner.

Prerequisites

Parameters

Parameter

Description

table_name

The name of the data table.

index_name

The name of the search index.

scan_query

query

The type of the query. The operation supports term query, fuzzy query, range query, geo query, and nested query, which are similar to those supported by the Search operation.

limit

The maximum number of rows that can be returned by each ParallelScan call.

max_parallel

The maximum number of parallel scan tasks per request. The maximum number of parallel scan tasks per request varies based on the data volume. A larger volume of data requires more parallel scan tasks per request. You can use the ComputeSplits operation to query the maximum number of parallel scan tasks per request.

current_parallel_id

The ID of the parallel scan task in the request. Valid values: [0, max_parallel).

token

The token that is used to paginate query results. The results of the ParallelScan request contain the token for the next page. You can use the token to retrieve the next page.

alive_time

The validity period of the current parallel scan task. This validity period is also the validity period of the token. Unit: seconds. Default value: 60. We recommend that you use the default value. If the next request is not initiated within the validity period, no more data can be queried. The validity time of the token is refreshed each time you send a request.

Note

Sessions expire ahead of time if the schemas of the source index and canary index are switched, a single server fails, or a server-end load balancing is performed. In this case, you must recreate sessions.

columns_to_get

The names of the columns that you want to return for each row that meets the query conditions.

To return all columns in the search index, set the return_type parameter to RETURN_ALL_FROM_INDEX.

session_id

The session ID of the parallel scan task. You can call the ComputeSplits operation to create a session and query the maximum number of parallel scan tasks that are supported by the parallel scan request.

Examples

The following sample code provides an example on how to perform parallel scan:

def fetch_rows_per_thread(query, session_id, current_thread_id, max_thread_num):
    token = None

    while True:
        try:
            scan_query = ScanQuery(query, limit = 20, next_token = token, current_parallel_id = current_thread_id,
                                   max_parallel = max_thread_num, alive_time = 30)

            response = client.parallel_scan(
                table_name, index_name, scan_query, session_id,
                columns_to_get = ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

            for row in response.rows:
                print("%s:%s" % (threading.currentThread().name, str(row)))

            if len(response.next_token) == 0:
                break
            else:
                token = response.next_token
        except OTSServiceError as e:
            print (e)
        except OTSClientError as e:
            print (e)

def parallel_scan(table_name, index_name):
    response = client.compute_splits(table_name, index_name)

    query = TermQuery('d', 0.1)

    params = []
    for i in range(response.splits_size):
        params.append((([query, response.session_id, i, response.splits_size], None)))

    pool = threadpool.ThreadPool(response.splits_size)
    requests = threadpool.makeRequests(fetch_rows_per_thread, params)
    [pool.putRequest(req) for req in requests]
    pool.wait()