Optimizing Random Key Value Queries for Large Data Sets

11.11 The Biggest Deals of the Year. 40% OFF on selected cloud servers with a free 100 GB data transfer! Click here to learn more.

After indexes are created for records in a data table, even if the number of records in the table is huge (hundreds of millions or even billions), it is fast to locate a single record by using the key value query method because the times of comparison is only logN (with 2 as the base) after indexes are created. That is, for one billion data records, the system has to compare data for only 30 times (one billion is approximately equal to 2^30), which takes only tens of milliseconds on a modern computer.

However, if the number of key values to be queried is large (thousands or even tens of thousands) and each key value is searched independently, the read count and comparison count will accumulate to tens or hundreds of thousands, and the latency will increase to dozens of minutes or even hours. In this case, simply using the database indexes for query will be an unbearable experience for users.

Case Study: esProc

The following introduces the combination table function of esProc, which can effectively cope with the preceding problem by searching based on high performance indexes and mass key values. The function is described in the following order:

Single-field key
Multi-field key
Multi-thread query
Data appending

Note that this article discusses only the implementation on a single machine, and more articles will be published later to further discuss the implementation on a cluster.

Single-Field Key

The following table with a typical data structure is used as an example:

Create a Combination Table

Create a combination table, and import the source data into the combination table.

A1: Creates and opens the specified file single.ctx, of which the extension of the combination table is ctx. For detailed definition and use of a combination table, see the esProc tutorials.

A2: Sets the data structure of the combination table to (id, data). The field starting with "#" is called a dimension field, based on which data is sequenced in the combination table. That is, the combination table A2 is ordered by "id". A combination table is stored by column by default.

A3: Opens the data file (assume that the source data is stored by text). In this txt file, the table header and data in the front few rows are shown in the following figure. The source data can also come from other types of data sources, such as databases.

A4: Creates a cursor for the source data, where, "@t" indicates that the first row of the file are the field names.

A5: Appends the records pointed by the cursor A4 to the combination table.

The preceding script assumes that the primary key "id" is already ordered in the original data. If the primary key is not ordered in practice, you must sequence the primary key before creating a combination table. In this case, use the cs.sortx() function for sequencing.

In the esProc designer, you can use three lines of code as shown in the following figure to view the first 10 data records:

A1: Opens the combination table file object with the specified file name.

A2: Directly opens the combination table if no parameter is specified in the f.create() function.

A3: Uses the cursor to access the first 10 data records in A2, as shown in the following figure.

Create an Index

Create an index for the combination table to improve the search performance.

A1: Opens the combination table file object with the specified file name.

A2: Uses the create function without any parameter specified to open the combination table.

A3: Creates an index. In the T.index() function, "id_idx" is the index name, "id" is the dimension, and "data" is the data column. Generally, no data column is needed when you create an index. When searching data using an index, the system accesses the original data table and then finds the data. In this example, however, the data column is also included in the index when the index is being created. In this way, the data column is no longer referenced in the search process. This method consumes more space, but increases the search speed.

When an index is created based on the dimension field, sequencing is not performed again because the combination table is already ordered. If "#" is not used when you start to create a combination table, you need to perform sequencing again when creating the index.

Query Data

Complete the query by calling the main program and subprogram.

The query subprogram script is search.dfx.

A3: "keys" indicate parameters that are transferred when the following main program is called.

A4: In the icursor() function of the combination table, uses the index "id_idx" and the condition "A3.contain(id)" to filter the combination table. esProc automatically identifies that the condition "A3.contain(id)" can reference the index, sequences the A3 content, and searches data from the front to the back.

A5: Exports the query result of A4 to result.txt. "@t" specifies that the field names are output when the query result is exported.

Main program script:

A1: Obtains the queried key-value sequence from keys.txt. As the result contains only one column, you can use "@i" to return the result as a sequence:

This sequence is the random key-value set to be queried. In this example, keys.txt is used to store the random key values in advance. In practice, you can use other data sources to store the random key values.

A2: Calls the subprogram serach.dfx to transfer the key-value set obtained in A1 to the subprogram.

The following figure shows part of the content in result.txt:

Moreover, you can embed esProc in a Java application to provide the flexible and convenient query capability for this application. After embedding esProc, you can access the esProc script like using JDBC to access a database. For more information about coding, see "Calling esProc in Java" in the tutorial.

In this single-field key query example, the data structure is simple. The queried key value field is "id", and the data to be obtained is in a single column. If there are some other columns, for example:

More data column fields should be included in the index creation step. The data column parameters are written as follows:

In the subsequent multi-field key scenario, you need to create multiple index fields, and write the parameters as follows: index(id_idx;id1,id2,…,idN;data1,data2,…,dataN).

Multi-Field Key

A multi-field key refers to a composite primary key. For example,

The two fields, "type" and "id", serve as the composite primary key to determine a record.

Method 1 (Common Method)

Create a combination table

In this example, A2 needs to specify two dimensions, "type" and "id". Other part of the code is the same as that in the single-field key scenario.

Create an index

As there are two dimensions "type" and "id" in this example, you need to include them when creating the index, as shown in A3.

Query data

A3 prepares two data records as the key-value sets to be searched, which are 2-tuples composed of "type" and "id". The structures of the two records are as follows:

A4: "A3.contain([type,id])" indicates that data is filtered based on the sequence of 2-tuples. Therefore, parameters in the contain() function must be changed to 2-tuples.

The following shows the content of the result file that is finally exported:

Method 2 (merged primary key)

Although the multi-field keys can be directly used, the storage speed and comparison speed are slow if these keys are used. To achieve a high performance, a multiple-field key is often changed into a single-field key by means of merging.

For the data structure in this example, although "type" is a string, it is enumerated. Therefore, "type" can be digitalized and merged with "id" as a new primary key field. As the maximum long-type value is (2^63 - 1), the wide value range can accommodate the result after "id" is merged with the digitalized "type". The new primary key obtained after "type" and "id" are merged is called "nid". Based on the data size, you need to determine how many digits in "nid" can be used to represent "type" and "id", respectively.

For example, the value of "id" contains nine digits, and three digits are sufficient to represent the number of enumerated "type" values. Therefore, "nid" only needs 13 digits. To prevent the case that the first few digits are 0s, you can set the first digit to 1 for alignment purpose. In this way, the composite primary key can be changed into a unique single-field primary key. For the next 12 digits after the first digit "1", the first three digits represent the digitalized "type" and the remaining nine digits represent the original "id".

The code is as follows:

A1: sequence composed of the enumerated "type" values. In practice, the enumerated list may come from a file or a database.

A2: Allocates a "tid" for every "type" value in the enumerated value sequence to prepare for the subsequent merging of the digitalized primary key.

A3 to A6: Obtain data from multi_source.txt and change the enumerated strings in the "type" column into numbers based on the mapping in A2, and merge "type" and "id" into the new primary key "nid".

A7 and A8: Check the data after primary key merging. After skipping the first 99,999,995 records before the cursor A4, you can obtain the following 10 records:

In this way, you can obtain the following data structure of the new single-field key:

Next, proceed using the same method described in the single-field key scenario. Ensure that "nid" must be ordered.

Multi-Thread Query

On the basis of the preceding methods, you can use the multi-thread concurrent method to further improve the query performance.

The multi-thread concurrent method means that data is divided into N portions and queried by N threads. However, dividing data into N portions at random may fail to truly improve the performance. This is because a key-value set to be queried is unknown and theoretically it cannot be ensured that the target data can be evenly distributed to every combination table. A better way is to observe the characteristics of the key-value set, on the basis of which data is divided as evenly as possible.

After changing a multi-key field into a single-field key by means of merging in the preceding example, you can use the merged primary key "nid" to perform a modulo operation on 4, store the data with the same remainder in the same combination table, and eventually use four combination tables to load all the existing data. With this file splitting method, you can distribute data to be queried more evenly.

If the key-value data has obvious business characteristics, you can split files based on the fields used in the actual business scenarios, such as the date and department. For example, if you distribute 1,000 records of department A into 10 files evenly, each file has 100 records. When you use multiple threads to query records that belong to department A, each thread extracts these 100 records from the corresponding file.

Let's take a look at actual examples.