This topic describes the data compression tests that are performed between Lindorm, open source HBase, open source MySQL, and open source MongoDB in different scenarios.
Background information
Lindorm is a multi-model hyper-converged database service that is compatible with the multiple open source services and provides cloud-native scalability. In addition, Lindorm allows you to efficiently compress data in your database. Lindorm supports the zstd algorithm for data compression and optimizes dictionary sampling to reduce storage costs.
The following database services of specific versions are used in the data compression test described in this topic:
Lindorm: The latest version of Lindorm is used. By default, the optimized zstd compression algorithm is used and dictionary-based compression is supported.
Open source HBase: HBase V2.3.4 is used. HBase supports the zstd compression algorithm when a later version of Hadoop is used. However, the compression of data is not stable and prone to core dumps. Most open source HBase users use the SNAPPY compression algorithm.
Open source MySQL: MySQL 8.0 is used. By default, data compression is disabled. MySQL supports the ZLIB compression algorithm. However, the query performance of MySQL databases is significantly degraded after data compression is enabled. Therefore, MySQL users rarely enable data compression for their MySQL databases.
Open source MongoDB: MongoDB 5.0 is used. By default, the SNAPPY compression algorithm is used. Users can also choose the zstd algorithm instead of the SNAPPY algorithm for data compression.
In this topic, the data compression test is performed in the following scenarios where Lindorm is commonly used: orders, Internet of Vehicles (IoV), logs, and user behaviors. In each test scenario, the data compression capability of the following database services that use different compression algorithms is tested and compared: Lindorm that uses the default zstd algorithm, Lindorm for which dictionary-based compression is enabled, open source HBase that uses the SNAPPY algorithm, open source MySQL for which data compression is disabled, open source MongoDB that uses the SNAPPY algorithm, and open source MongoDB that uses the zstd algorithm.
For more information about the test results and conclusions in different scenarios, see Summary.
Orders
Prepare data
In this scenario, a TPC-H dataset is used for the data compression test. TPC-H is a benchmark that is commonly used in the industry. TPC-H is defined and released by the Transaction Processing Performance Council to evaluate the capability of a database engine to analyze queries.
During this test, we only follow parts of instructions in the TPC benchmark test specifications. The results of the test are not equivalent to and cannot be compared with those obtained from tests in which the TPC benchmark test specifications are completely followed.
Download the TPC-H tool
Download the following file: TPC-H_Tools_v3.0.0.zip.
Generate test data
# unzip TPC-H_Tools_v3.0.0.zip
# cd TPC-H_Tools_v3.0.0/dbgen
# cp makefile.suite makefile
# vim makefile
################Generate scripts and data for Oracle databases. Modify the values of the following fields:
CC = gcc
DATABASE = ORACLE
MACHINE = LINUX
WORKLOAD = TPCH
################
# make -- Generate the dbgen program.
# ./dbgen -s 10 -- Generate 10 GB of test data.
After the commands are executed, eight TBL
files are generated in the current directory. Each file is a table that contains test data. In this test, the ORDERS.tbl
file is used. This file contains 15 million rows of data and is 1.76 GB in size. The following table describes the fields in the file and the data types of the fields.
Field | Type |
O_ORDERKEY | INT |
O_CUSTKEY | INT |
O_ORDERSTATUS | CHAR(1) |
O_TOTALPRICE | DECIMAL(15,2) |
O_ORDERDATE | DATE |
O_ORDERPRIORITY | CHAR(15) |
O_CLERK | CHAR(15) |
O_SHIPPRIORITY | INT |
O_COMMENT | VARCHAR(79) |
Create a test table
HBase
create 'ORDERS', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}
MySQL
CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL,
O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS CHAR(1) NOT NULL,
O_TOTALPRICE DECIMAL(15,2) NOT NULL,
O_ORDERDATE DATE NOT NULL,
O_ORDERPRIORITY CHAR(15) NOT NULL,
O_CLERK CHAR(15) NOT NULL,
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT VARCHAR(79) NOT NULL);
MongoDB
db.createCollection("ORDERS")
Lindorm
# lindorm-cli
CREATE TABLE ORDERS ( O_ORDERKEY INTEGER NOT NULL,
O_CUSTKEY INTEGER NOT NULL,
O_ORDERSTATUS CHAR(1) NOT NULL,
O_TOTALPRICE DECIMAL(15,2) NOT NULL,
O_ORDERDATE DATE NOT NULL,
O_ORDERPRIORITY CHAR(15) NOT NULL,
O_CLERK CHAR(15) NOT NULL,
O_SHIPPRIORITY INTEGER NOT NULL,
O_COMMENT VARCHAR(79) NOT NULL,
primary key(O_ORDERKEY));
Data compression result
Database | Lindorm (zstd) | Lindorm (dictionary-based compression) | HBase (SNAPPY) | MySQL | MongoDB (SNAPPY) | MongoDB (zstd) |
Table size | 784 MB | 639 MB | 1.23 GB | 2.10 GB | 1.63 GB | 1.32 GB |
IoV
In this scenario, a Next Generation Simulation (NGSIM) dataset is used for the data compression test. NGSIM is a data collection project initiated by the Federal Highway Administration of the United States. This project is widely implemented in the study of driving behavior such as vehicle follow-up and lane change, traffic flow analysis, micro-traffic model construction, vehicle trajectory prediction, and autonomous driving decision planning. All NGSIM data is collected from the trajectories of actual vehicles on U.S. Route 101.
Prepare data
Download the following dataset file: NGSIM_Data.csv. The file contains 11.85 million rows of data and is 1.54 GB in size. Each row contains 25 columns. For more information about the structure of NGSIM datasets, see NGSIM dataset.
Create a test table
HBase
create 'NGSIM', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}
MySQL
CREATE TABLE NGSIM ( ID INTEGER NOT NULL,
Vehicle_ID INTEGER NOT NULL,
Frame_ID INTEGER NOT NULL,
Total_Frames INTEGER NOT NULL,
Global_Time BIGINT NOT NULL,
Local_X DECIMAL(10,3) NOT NULL,
Local_Y DECIMAL(10,3) NOT NULL,
Global_X DECIMAL(15,3) NOT NULL,
Global_Y DECIMAL(15,3) NOT NULL,
v_length DECIMAL(10,3) NOT NULL,
v_Width DECIMAL(10,3) NOT NULL,
v_Class INTEGER NOT NULL,
v_Vel DECIMAL(10,3) NOT NULL,
v_Acc DECIMAL(10,3) NOT NULL,
Lane_ID INTEGER NOT NULL,
O_Zone CHAR(10),
D_Zone CHAR(10),
Int_ID CHAR(10),
Section_ID CHAR(10),
Direction CHAR(10),
Movement CHAR(10),
Preceding INTEGER NOT NULL,
Following INTEGER NOT NULL,
Space_Headway DECIMAL(10,3) NOT NULL,
Time_Headway DECIMAL(10,3) NOT NULL,
Location CHAR(10) NOT NULL,
PRIMARY KEY(ID));
MongoDB
db.createCollection("NGSIM")
Lindorm
# lindorm-cli
CREATE TABLE NGSIM ( ID INTEGER NOT NULL,
Vehicle_ID INTEGER NOT NULL,
Frame_ID INTEGER NOT NULL,
Total_Frames INTEGER NOT NULL,
Global_Time BIGINT NOT NULL,
Local_X DECIMAL(10,3) NOT NULL,
Local_Y DECIMAL(10,3) NOT NULL,
Global_X DECIMAL(15,3) NOT NULL,
Global_Y DECIMAL(15,3) NOT NULL,
v_length DECIMAL(10,3) NOT NULL,
v_Width DECIMAL(10,3) NOT NULL,
v_Class INTEGER NOT NULL,
v_Vel DECIMAL(10,3) NOT NULL,
v_Acc DECIMAL(10,3) NOT NULL,
Lane_ID INTEGER NOT NULL,
O_Zone CHAR(10),
D_Zone CHAR(10),
Int_ID CHAR(10),
Section_ID CHAR(10),
Direction CHAR(10),
Movement CHAR(10),
Preceding INTEGER NOT NULL,
Following INTEGER NOT NULL,
Space_Headway DECIMAL(10,3) NOT NULL,
Time_Headway DECIMAL(10,3) NOT NULL,
Location CHAR(10) NOT NULL,
PRIMARY KEY(ID)) ;
Data compression result
Database | Lindorm (zstd) | Lindorm (dictionary-based compression) | HBase (SNAPPY) | MySQL | MongoDB (SNAPPY) | MongoDB (zstd) |
Table size | 995 MB | 818 MB | 1.72 GB | 2.51 GB | 1.88 GB | 1.50 GB |
Logs
In this scenario, the following dataset of web server logs is used for the data compression test: Zaker, Farzin, 2019, "Online Shopping Store - Web Server Logs", https://doi.org/10.7910/DVN/3QBYB5, Harvard Dataverse, V1
.
Prepare data
Download the log data file access.log
from the Online Shopping Store - Web Server Logs page. The data file contains 10.36 million rows of data and is 3.51 GB in size. The following example shows a row of data in the log file.
54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
Create a test table
HBase
create 'ACCESS_LOG', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}
MySQL
CREATE TABLE ACCESS_LOG ( ID INTEGER NOT NULL,
CONTENT VARCHAR(10000),
PRIMARY KEY(ID));
MongoDB
db.createCollection("ACCESS_LOG")
Lindorm
# lindorm-cli
CREATE TABLE ACCESS_LOG ( ID INTEGER NOT NULL,
CONTENT VARCHAR(10000),
PRIMARY KEY(ID));
Data compression result
Database | Lindorm (zstd) | Lindorm (dictionary-based compression) | HBase (SNAPPY) | MySQL | MongoDB (SNAPPY) | MongoDB (zstd) |
Table size | 646 MB | 387 MB | 737 MB | 3.99 GB | 1.17 GB | 893 MB |
User behaviors
In this scenario, the following dataset obtained from Alibaba Cloud Tianchi is used for the data compression test: Shop Info and User Behavior data from IJCAI-15.
Prepare data
Download the data_format1.zip
package from the user behavior dataset page and use the user_log_format1.csv
file for the test. This file contains 54.92 million rows of data and is 1.91 GB in size. The following table shows the data structure of the file.
user_id | item_id | cat_id | seller_id | brand_id | time_stamp | action_type |
328862 | 323294 | 833 | 2882 | 2661 | 829 | 0 |
328862 | 844400 | 1271 | 2882 | 2661 | 829 | 0 |
328862 | 575153 | 1271 | 2882 | 2661 | 829 | 0 |
Create a test table
HBase
create 'USER_LOG', {NAME => 'f', DATA_BLOCK_ENCODING => 'DIFF', COMPRESSION => 'SNAPPY', BLOCKSIZE => '32768}
MySQL
CREATE TABLE USER_LOG ( ID INTEGER NOT NULL,
USER_ID INTEGER NOT NULL,
ITEM_ID INTEGER NOT NULL,
CAT_ID INTEGER NOT NULL,
SELLER_ID INTEGER NOT NULL,
BRAND_ID INTEGER,
TIME_STAMP CHAR(4) NOT NULL,
ACTION_TYPE CHAR(1) NOT NULL,
PRIMARY KEY(ID));
MongoDB
db.createCollection("USER_LOG")
Lindorm
# lindorm-cli
CREATE TABLE USER_LOG ( ID INTEGER NOT NULL,
USER_ID INTEGER NOT NULL,
ITEM_ID INTEGER NOT NULL,
CAT_ID INTEGER NOT NULL,
SELLER_ID INTEGER NOT NULL,
BRAND_ID INTEGER,
TIME_STAMP CHAR(4) NOT NULL,
ACTION_TYPE CHAR(1) NOT NULL,
PRIMARY KEY(ID));
Data compression result
Database | Lindorm (zstd) | Lindorm (dictionary-based compression) | HBase (SNAPPY) | MySQL | MongoDB (SNAPPY) | MongoDB (zstd) |
Table size | 805 MB | 721 MB | 1.48 GB | 2.90 GB | 3.33 GB | 2.74 GB |
Summary
Based on the test results in the preceding scenarios, Lindorm can compress large amounts of data of different types at a higher compression ratio. Compared with other open source databases, Lindorm can provide a compression ratio that is significantly higher even if dictionary-based compression is not enabled. If you enable dictionary-based compression, Lindorm can compress data at a compression ratio that is 1 to 2 times as high as that of open source HBase, 2 to 4 times as high as that of open source MongoDB, and 3 to 10 times as high as that of open source MySQL.
The following table shows the results of the test in different scenarios.
Data | Data size | Lindorm (zstd) | Lindorm (dictionary-based compression) | HBase (SNAPPY) | MySQL | MongoDB (SNAPPY) | MongoDB (zstd) |
Order data (TPC-H dataset) | 1.76 GB | 784 MB | 639 MB | 1.23 GB | 2.10 GB | 1.63 GB | 1.32 GB |
IoV data (NGSIM dataset) | 1.54 GB | 995 MB | 818 MB | 1.72 GB | 2.51 GB | 1.88 GB | 1.50 GB |
Log data (server log dataset) | 3.51 GB | 646 MB | 387 MB | 737 MB | 3.99 GB | 1.17 GB | 893 MB |
User behavior (IJCAI-15 dataset) | 1.91 GB | 805 MB | 721 MB | 1.48 GB | 2.90 GB | 3.33 GB | 2.74 GB |