Data compression test - Lindorm - Alibaba Cloud Documentation Center

This topic describes the data compression tests that are performed between Lindorm, open source HBase, open source MySQL, and open source MongoDB in different scenarios.

Background information

Lindorm is a multi-model hyper-converged database service that is compatible with the multiple open source services and provides cloud-native scalability. In addition, Lindorm allows you to efficiently compress data in your database. Lindorm supports the zstd algorithm for data compression and optimizes dictionary sampling to reduce storage costs.

The following database services of specific versions are used in the data compression test described in this topic:

Lindorm: The latest version of Lindorm is used. By default, the optimized zstd compression algorithm is used and dictionary-based compression is supported.
Open source HBase: HBase V2.3.4 is used. HBase supports the zstd compression algorithm when a later version of Hadoop is used. However, the compression of data is not stable and prone to core dumps. Most open source HBase users use the SNAPPY compression algorithm.
Open source MySQL: MySQL 8.0 is used. By default, data compression is disabled. MySQL supports the ZLIB compression algorithm. However, the query performance of MySQL databases is significantly degraded after data compression is enabled. Therefore, MySQL users rarely enable data compression for their MySQL databases.
Open source MongoDB: MongoDB 5.0 is used. By default, the SNAPPY compression algorithm is used. Users can also choose the zstd algorithm instead of the SNAPPY algorithm for data compression.

In this topic, the data compression test is performed in the following scenarios where Lindorm is commonly used: orders, Internet of Vehicles (IoV), logs, and user behaviors. In each test scenario, the data compression capability of the following database services that use different compression algorithms is tested and compared: Lindorm that uses the default zstd algorithm, Lindorm for which dictionary-based compression is enabled, open source HBase that uses the SNAPPY algorithm, open source MySQL for which data compression is disabled, open source MongoDB that uses the SNAPPY algorithm, and open source MongoDB that uses the zstd algorithm.

For more information about the test results and conclusions in different scenarios, see Summary.

Orders

Prepare data

In this scenario, a TPC-H dataset is used for the data compression test. TPC-H is a benchmark that is commonly used in the industry. TPC-H is defined and released by the Transaction Processing Performance Council to evaluate the capability of a database engine to analyze queries.

Important

During this test, we only follow parts of instructions in the TPC benchmark test specifications. The results of the test are not equivalent to and cannot be compared with those obtained from tests in which the TPC benchmark test specifications are completely followed.

Download the TPC-H tool

Download the following file: TPC-H_Tools_v3.0.0.zip.

Generate test data

# unzip TPC-H_Tools_v3.0.0.zip
# cd TPC-H_Tools_v3.0.0/dbgen
# cp makefile.suite makefile
# vim makefile
################Generate scripts and data for Oracle databases. Modify the values of the following fields:
CC = gcc
DATABASE = ORACLE
MACHINE = LINUX
WORKLOAD = TPCH
################
# make  -- Generate the dbgen program.
# ./dbgen -s 10  -- Generate 10 GB of test data.

After the commands are executed, eight TBL files are generated in the current directory. Each file is a table that contains test data. In this test, the ORDERS.tbl file is used. This file contains 15 million rows of data and is 1.76 GB in size. The following table describes the fields in the file and the data types of the fields.

Field	Type
O_ORDERKEY	INT
O_CUSTKEY	INT
O_ORDERSTATUS	CHAR(1)
O_TOTALPRICE	DECIMAL(15,2)
O_ORDERDATE	DATE
O_ORDERPRIORITY	CHAR(15)
O_CLERK	CHAR(15)
O_SHIPPRIORITY	INT
O_COMMENT	VARCHAR(79)

Create a test table

HBase

create 'ORDERS', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}

MySQL

CREATE TABLE ORDERS  ( O_ORDERKEY       INTEGER NOT NULL,
                       O_CUSTKEY        INTEGER NOT NULL,
                       O_ORDERSTATUS    CHAR(1) NOT NULL,
                       O_TOTALPRICE     DECIMAL(15,2) NOT NULL,
                       O_ORDERDATE      DATE NOT NULL,
                       O_ORDERPRIORITY  CHAR(15) NOT NULL,
                       O_CLERK          CHAR(15) NOT NULL,
                       O_SHIPPRIORITY   INTEGER NOT NULL,
                       O_COMMENT        VARCHAR(79) NOT NULL);

MongoDB

db.createCollection("ORDERS")

Lindorm

# lindorm-cli
CREATE TABLE ORDERS  ( O_ORDERKEY       INTEGER NOT NULL,
                      O_CUSTKEY        INTEGER NOT NULL,
                      O_ORDERSTATUS    CHAR(1) NOT NULL,
                      O_TOTALPRICE     DECIMAL(15,2) NOT NULL,
                      O_ORDERDATE      DATE NOT NULL,
                      O_ORDERPRIORITY  CHAR(15) NOT NULL,
                      O_CLERK          CHAR(15) NOT NULL,
                      O_SHIPPRIORITY   INTEGER NOT NULL,
                      O_COMMENT        VARCHAR(79) NOT NULL,
                      primary key(O_ORDERKEY));

Data compression result

Database

Lindorm

(zstd)

Lindorm

(dictionary-based compression)

HBase

(SNAPPY)

MySQL

MongoDB

(SNAPPY)

MongoDB

(zstd)

Table size

784 MB

639 MB

1.23 GB

2.10 GB

1.63 GB

1.32 GB

IoV

In this scenario, a Next Generation Simulation (NGSIM) dataset is used for the data compression test. NGSIM is a data collection project initiated by the Federal Highway Administration of the United States. This project is widely implemented in the study of driving behavior such as vehicle follow-up and lane change, traffic flow analysis, micro-traffic model construction, vehicle trajectory prediction, and autonomous driving decision planning. All NGSIM data is collected from the trajectories of actual vehicles on U.S. Route 101.

Prepare data

Download the following dataset file: NGSIM_Data.csv. The file contains 11.85 million rows of data and is 1.54 GB in size. Each row contains 25 columns. For more information about the structure of NGSIM datasets, see NGSIM dataset.

Create a test table

HBase

create 'NGSIM', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}

MySQL

CREATE TABLE NGSIM ( ID								 INTEGER NOT NULL,
                     Vehicle_ID				 INTEGER NOT NULL,
                     Frame_ID					 INTEGER NOT NULL,
                     Total_Frames			 INTEGER NOT NULL,
                     Global_Time			 BIGINT NOT NULL,
                     Local_X					 DECIMAL(10,3) NOT NULL,
                     Local_Y					 DECIMAL(10,3) NOT NULL,
                     Global_X					 DECIMAL(15,3) NOT NULL,
                     Global_Y					 DECIMAL(15,3) NOT NULL,
                     v_length					 DECIMAL(10,3) NOT NULL,
                     v_Width					 DECIMAL(10,3) NOT NULL,
                     v_Class					 INTEGER NOT NULL,
                     v_Vel						 DECIMAL(10,3) NOT NULL,
                     v_Acc						 DECIMAL(10,3) NOT NULL,
                     Lane_ID					 INTEGER NOT NULL,
                     O_Zone						 CHAR(10),
                     D_Zone						 CHAR(10),
                     Int_ID						 CHAR(10),
                     Section_ID				 CHAR(10),
                     Direction				 CHAR(10),
                     Movement					 CHAR(10),
                     Preceding				 INTEGER NOT NULL,
                     Following				 INTEGER NOT NULL,
                     Space_Headway		 DECIMAL(10,3) NOT NULL,
                     Time_Headway			 DECIMAL(10,3) NOT NULL,
                     Location					 CHAR(10) NOT NULL,
                     PRIMARY KEY(ID));

MongoDB

db.createCollection("NGSIM")

Lindorm

# lindorm-cli
CREATE TABLE NGSIM ( ID								 INTEGER NOT NULL,
                     Vehicle_ID				 INTEGER NOT NULL,
                     Frame_ID					 INTEGER NOT NULL,
                     Total_Frames			 INTEGER NOT NULL,
                     Global_Time			 BIGINT NOT NULL,
                     Local_X					 DECIMAL(10,3) NOT NULL,
                     Local_Y					 DECIMAL(10,3) NOT NULL,
                     Global_X					 DECIMAL(15,3) NOT NULL,
                     Global_Y					 DECIMAL(15,3) NOT NULL,
                     v_length					 DECIMAL(10,3) NOT NULL,
                     v_Width					 DECIMAL(10,3) NOT NULL,
                     v_Class					 INTEGER NOT NULL,
                     v_Vel						 DECIMAL(10,3) NOT NULL,
                     v_Acc						 DECIMAL(10,3) NOT NULL,
                     Lane_ID					 INTEGER NOT NULL,
                     O_Zone						 CHAR(10),
                     D_Zone						 CHAR(10),
                     Int_ID						 CHAR(10),
                     Section_ID				 CHAR(10),
                     Direction				 CHAR(10),
                     Movement					 CHAR(10),
                     Preceding				 INTEGER NOT NULL,
                     Following				 INTEGER NOT NULL,
                     Space_Headway		 DECIMAL(10,3) NOT NULL,
                     Time_Headway			 DECIMAL(10,3) NOT NULL,
                     Location					 CHAR(10) NOT NULL,
                     PRIMARY KEY(ID)) ;

Data compression result

Database

Lindorm

(zstd)

Lindorm

(dictionary-based compression)

HBase

(SNAPPY)

MySQL

MongoDB

(SNAPPY)

MongoDB

(zstd)

Table size

995 MB

818 MB

1.72 GB

2.51 GB

1.88 GB

1.50 GB

Logs

In this scenario, the following dataset of web server logs is used for the data compression test: Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1.

Prepare data

Download the log data file access.log from the Online Shopping Store - Web Server Logs page. The data file contains 10.36 million rows of data and is 3.51 GB in size. The following example shows a row of data in the log file.

54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"

Create a test table

HBase

create 'ACCESS_LOG', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}

MySQL

CREATE TABLE ACCESS_LOG  ( ID        INTEGER NOT NULL,
                           CONTENT   VARCHAR(10000),
                           PRIMARY KEY(ID));

MongoDB

db.createCollection("ACCESS_LOG")

Lindorm

# lindorm-cli
CREATE TABLE ACCESS_LOG  ( ID        INTEGER NOT NULL,
                           CONTENT   VARCHAR(10000),
                           PRIMARY KEY(ID));

Data compression result

Database

Lindorm

(zstd)

Lindorm

(dictionary-based compression)

HBase

(SNAPPY)

MySQL

MongoDB

(SNAPPY)

MongoDB

(zstd)

Table size

646 MB

387 MB

737 MB

3.99 GB

1.17 GB

893 MB

User behaviors

In this scenario, the following dataset obtained from Alibaba Cloud Tianchi is used for the data compression test: Shop Info and User Behavior data from IJCAI-15.

Prepare data

Download the data_format1.zip package from the user behavior dataset page and use the user_log_format1.csv file for the test. This file contains 54.92 million rows of data and is 1.91 GB in size. The following table shows the data structure of the file.

user_id	item_id	cat_id	seller_id	brand_id	time_stamp	action_type
328862	323294	833	2882	2661	829	0
328862	844400	1271	2882	2661	829	0
328862	575153	1271	2882	2661	829	0

Create a test table

HBase

create 'USER_LOG', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}

MySQL

CREATE TABLE USER_LOG  ( ID            INTEGER NOT NULL,
                         USER_ID       INTEGER NOT NULL,
                         ITEM_ID       INTEGER NOT NULL,
                         CAT_ID        INTEGER NOT NULL,
                         SELLER_ID     INTEGER NOT NULL,
                         BRAND_ID      INTEGER,
                         TIME_STAMP    CHAR(4) NOT NULL,
                         ACTION_TYPE   CHAR(1) NOT NULL,
                         PRIMARY KEY(ID));

MongoDB

db.createCollection("USER_LOG")

Lindorm

# lindorm-cli
CREATE TABLE USER_LOG  ( ID            INTEGER NOT NULL,
                         USER_ID       INTEGER NOT NULL,
                         ITEM_ID       INTEGER NOT NULL,
                         CAT_ID        INTEGER NOT NULL,
                         SELLER_ID     INTEGER NOT NULL,
                         BRAND_ID      INTEGER,
                         TIME_STAMP    CHAR(4) NOT NULL,
                         ACTION_TYPE   CHAR(1) NOT NULL,
                         PRIMARY KEY(ID));

Data compression result

Database

Lindorm

(zstd)

Lindorm

(dictionary-based compression)

HBase

(SNAPPY)

MySQL

MongoDB

(SNAPPY)

MongoDB

(zstd)

Table size

805 MB

721 MB

1.48 GB

2.90 GB

3.33 GB

2.74 GB

Summary

Based on the test results in the preceding scenarios, Lindorm can compress large amounts of data of different types at a higher compression ratio. Compared with other open source databases, Lindorm can provide a compression ratio that is significantly higher even if dictionary-based compression is not enabled. If you enable dictionary-based compression, Lindorm can compress data at a compression ratio that is 1 to 2 times as high as that of open source HBase, 2 to 4 times as high as that of open source MongoDB, and 3 to 10 times as high as that of open source MySQL.

The following table shows the results of the test in different scenarios.

Data	Data size	Lindorm (zstd)	Lindorm (dictionary-based compression)	HBase (SNAPPY)	MySQL	MongoDB (SNAPPY)	MongoDB (zstd)
Order data (TPC-H dataset)	1.76 GB	784 MB	639 MB	1.23 GB	2.10 GB	1.63 GB	1.32 GB
IoV data (NGSIM dataset)	1.54 GB	995 MB	818 MB	1.72 GB	2.51 GB	1.88 GB	1.50 GB
Log data (server log dataset)	3.51 GB	646 MB	387 MB	737 MB	3.99 GB	1.17 GB	893 MB
User behavior (IJCAI-15 dataset)	1.91 GB	805 MB	721 MB	1.48 GB	2.90 GB	3.33 GB	2.74 GB