A Comprehensive Analysis of Open-Source Time Series Databases (3)

By Zhaofeng Zhou (Muluo)

InfluxDB

InfluxDB ranks first in the time series databases on DB-Engines, which is really well deserved. From the perspective of its functional richness, usability, and bottom-layer implementation, it has many highlights and is worthy of an in-depth analysis.

First, I'll briefly summarize several important attributes:

Simple architecture: For a stand-alone InfluxDB, only a binary needs to be installed and it can be used without any external dependencies. Here are a few negative examples. The bottom layer of OpenTSDB is HBase, so ZooKeeper, HDFS and others must be used together. If you are not familiar with the Hadoop technology stack, it is generally difficult to operate and maintain, which is also one of the most common complaints. KairosDB is slightly better. It depends on Cassandra and ZooKeeper, and H2 can be used for stand-alone testing. In general, a TSDB, which depends on an external distributed database, is a little more complex in architecture than a fully self-contained TSDB. After all, a mature distributed database is itself very complex, which, however, has been completely eliminated in the era of cloud computing.
TSM engine: The self-developed TSM storage engine is used in the bottom layer. TSM is also based on the idea of LSM, and provides extremely powerful writing ability and high compaction ratio. A more detailed analysis is made in the following sections.
InfluxQL: A SQL-Like query language is provided, which greatly facilitates the use of the database. The ultimate goal of database evolution in usability is to provide query languages.
Continuous query: With CQ, the database can support auto-rollup and pre-aggregation. For common query operations, CQ can be used to accelerate through pre-compute.
TimeSeries index: Tags are indexed for efficient retrievals. Compared with OpenTSDB, KairosDB and others, this function makes InfluxDB more efficient for tag retrieval. OpenTSDB has made a lot of query optimizations on tags retrieval. However, it is restricted by HBase functions and data models, and these optimizations do not work. However, the implementation in the current stable version uses the memory-based index, which is relatively simple in implementation and has the highest query efficiency. However, this also comes with many problems, which are described in detail in the following sections.
Plugin support: The database supports custom plug-ins and can be scaled to support a variety of protocols, such as Graphite, collectd, and OpenTSDB.

The following sections mainly analyze the basic concepts, the TSM storage engine, continuous queries and TimeSeries indexes in detail.

Basic Concepts

First, let's learn some basic concepts in InfluxDB. The following are specific examples:

INSERT machine_metric,cluster=Cluster-A,hostname=host-a cpu=10 1501554197019201823

This is a command line to write a data entry into InfluxDB. The components of this data are as follows:

Measurement: The concept of a measurement is similar to a metric of OpenTSDB, and it represents the name of the monitoring indicator for the data. For example, the example above is the monitoring of machine indicators, so its Measurement is named machine_metric.
Tag: Similar to the concept of tags in OpenTSDB, tags are used to describe the different dimensions of the subject. One or more tags are allowed, and each tag is also composed of a tag key and a tag value.
Field: In the logical data model of OpenTSDB, a row of metric data corresponds to a value. In InfluxDB, one row of measurement data can correspond to multiple values, and each value is distinguished by field.
Timestamp: A required attribute of time series data. It represents the time point for the data. You can see that the time granularity of InfluxDB can be accurate to nanoseconds.
TimeSeries: A combination of Measurement + Tags, which is called TimeSeries in InfluxDB. TimeSeries is the time series. A certain time point can be located based on time, so a certain value can be located using TimeSeries + Field + Timestamp . This is an important concept and is mentioned in subsequent sections.

Finally, the data in each measurement is logically organized into a large data table, as shown in the following figure:

When querying, InfluxDB supports criteria queries of any dimensions in the measurement. You can specify any tag or field to perform a query. Following the data case above, you can construct the following query criteria:

SELECT * FROM "machine_metric" WHERE time > now() - 1h;  
SELECT * FROM "machine_metric" WHERE "cluster" = "Cluster-A" AND time > now() - 1h;
SELECT * FROM "machine_metric" WHERE "cluster" = "Cluster-A" AND cpu > 5 AND time > now() - 1h;

From the data model and query criteria, the Tag and Field have no difference. In terms of semantics, the tag is used to describe the measurement, while the field is used to describe the value. In terms of internal implementation, the tag is fully indexed, while the field is not, so a criteria query based on tags is much more efficient than that based on fields.

TSM

The bottom-layer storage engine of InfluxDB has gone through the process from LevelDB to BlotDB, and then to the self-developed TSM. The consideration of the whole selection and transformation can be seen in the documents on the official website. The whole consideration process is worth learning from. The consideration on technology selection and transformation is always more impressive than simply describing the attributes of a product.

I briefly summarize the entire process of selecting and transforming the storage engine. The first stage is LevelDB. The main reason for selecting LevelDB is that its bottom-layer data structure adopts LSM, which is very write-friendly, provides high write throughput, and is relatively consistent with the attributes of time series data. In LevelDB, data is stored in KeyValue and sorted by Key. The Key used by InfluxDB is a combination of SeriesKey + Timestamp, so the data of the same SeriesKey are sorted and stored by the Timestamp, which can provide very efficient scanning by the time range.

However, one of the biggest problems with using LevelDB is that InfluxDB supports the automatic deletion of historical data (Retention Policy). In time series data scenarios, automatic deletion of data is usually the deletion of large blocks of historical data in continuous time periods. LevelDB does not support Range Delete or TTL, so deletion can only be done one key at a time, which causes a large amount of deletion traffic pressure. And in the data structure of LSM, the real physical deletion is not instantaneous, and only takes effect when compaction is enabled. The practice of deleting data for various TSDBs can be broadly divided into two categories:

Data sharding: Data is divided into different shards according to different time ranges. Time series data writes are generated linearly over time, so the generated shards also increases linearly over time. Writes are usually made in the latest partition, without being hashed to multiple shards. The advantage of sharding is that the physical deletion of retention is very simple. You can simply delete the entire shard. The disadvantage is that the precision of retention is relatively large, which is the whole shard, while the time granularity of retention depends on the time span of the shard. Sharding can be implemented either at the application layer or at the storage engine layer. For example, a column family of RocksDB can be used as a data shard. InfluxDB adopts this model, and the data under the default retention policy forms a shard in a 7-day time span.
TTL: The bottom-layer data engine directly provides the function of automatic data expiration. You can set a time to live for each data entry, and the storage engine automatically deletes the physical data when the time is reached. The advantage of this method is that the precision of retention is very high, reaching the second-level and row-level of retention. The disadvantage is that the physical deletion occurs at the time of compaction on the implementation of LSM, which is less timely. RocksDB, HBase, Cassandra and Alibaba Cloud Table Store all provide TTL functions.

InfluxDB adopts the first policy, which divides the data into multiple different shards in a 7-day cycle, and each shard is an independent database instance. As the running time increases, the number of shards increases. As each shard is an independent database instance and the bottom layer is an independent LevelDB storage engine, the problem is that each storage engine opens more files, but with the increase of shards, the number of file handles opened by the final process will soon reach the upper limit. LevelDB uses the level compaction policy at the bottom layer, which is one of the reasons for the large number of files. In fact, the level compaction policy is not suitable for writing time series data, and InfluxDB doesn't mention why.

Because of significant customer feedback regarding excessive file handles, InfluxDB chose BoltDB to replace LevelDB in the selection of new storage engines. The bottom-layer data structure of BoltDB is the mmap B+ tree. The reasons for choosing it include: 1. APIs with the same semantics as LevelDB; 2. Pure Go implementation for easy integration and cross-platform functionality; 3. Only one file is used in a single database, which solves the problem of excessive consumption of file handles. This is the most important reason for choosing BoltDB. However, the BoltDB B+ tree structure isn't as good as LSM in terms of write capability, and B+ tree generates a large number of random writes. Therefore, after using BoltDB, InfluxDB quickly encountered IOPS problems. When the database size reaches several GBs, it often encounters an IOPS bottleneck, which greatly affects the writing ability. Although InfluxDB has also adopted some write optimization measures subsequently, such as adding a WAL layer before BoltDB, writing data to the WAL first, and the WAL can ensure that the data is written to disk in sequence. But eventually writing to BoltDB still consumes a large amount of IOPS resources.

After several minor versions of BoltDB, they finally decided to internally develop TSM for InfluxDB. The design goal of TSM was to solve the problem of excessive file handles in LevelDB, and the second is to solve the write performance problem of BoltDB. TSM is short for Time-Structured Merge Tree. Its idea is similar to LSM, but it has made some special optimizations based on the attributes of time series data. The following are some important components of TSM:

Write Ahead Log (WAL): Data is written to WAL first, and then flows into the memory-index and the cache. Data written to WAL is flushed to the disk synchronously to ensure data persistence. The data in the cache is asynchronously flushed into the TSM file. If the process crashes before the data in the cache is persisted to the TSM File, the data in the WAL is used to restore the data in the cache, this behavior is similar to that of LSM.
Cache: The cache of TSM is similar to the MemoryTable of LSM. The internal data is the data in the WAL that is not persisted to the TSM file. If a failover occurs for the process, the data in the cache is rebuilt based on the data in WAL. The data in the cache is stored in a SortedMap, and the Key of the map is composed of TimeSeries + Timestamp. Therefore, the data in the memory is organized by TimeSeries, and the data in TimeSeries is stored in time sequence.
TSM files: The TSM file is similar to LSM's SSTable. The TSM file consists of four parts: header, block, index, and footer. The most important parts are block and index:
- Block: Each block stores the values of a TimeSeries over a period of time, that is, all the values of a field corresponding to a tag set of a certain measurement in a certain period of time. Different compaction policies are adopted within the block based on the types of different values of the field to achieve the optimal compaction efficiency.
- Index: The index information in the file stores the location information of all data blocks under each TimeSeries. The index data is sorted according to the lexicographic order of the TimeSeries keys. The complete index data, which is very large, is not loaded into memory. Instead, only some keys are indexed, which is called indirectIndex. The indirectIndex contains some auxiliary positioning information, such as the minimum and maximum time, and the minimum and maximum keys in the file. The most important is that the file offset information of some keys and its index data is saved. To locate the index data of a TimeSeries, you need to first find the most similar index offset based on some Key information in the memory, scan the file contents sequentially from the starting point, and then locate the index data position of the key accurately.
Compaction: Compaction is a process of optimizing the write-optimized data storage format to read-optimized data storage format. It is an important function of LSM structure storage engine for storage and query optimization. The quality of the storage engine is determined by the quality of the compaction strategy and algorithm. In the time series data scenario, update or delete operations are rarely performed, and the data is generated in time sequence, so there is basically no overlap. The compaction mainly plays a role in the compaction and index optimizations.
- LevelCompaction: InfluxDB divides the TSM file into 4 levels (Level 1-4). The compaction only occurs within files of the same level. Files of the same level are promoted to the next level after being compacted. From this rule, based on the attributes of time series data generation, the higher the level, the earlier the data generation time and the lower the access heat. The TSM file generated by the data in the cache for the first time is called the snapshot. The Level1 TSM file is generated after multiple snapshots are compacted, the Level2 TSM file is generated after the Level1 files are compacted, and so on. Different algorithms are used for the compaction of low-level and high-level files. The low CPU consumption method is used for the compaction of low-level files. For example, block-decompaction and block-merging are not performed. Block-decompaction and block-merging are performed for the compaction of high-level files to further improve the compaction ratio. I understand that this design is a trade-off. The comparison usually works in the background. To avoid affecting real-time data writing, the resources consumed by the compaction are strictly controlled, but the speed of the compaction is bound to be affected under the condition of limited resources. However, the lower the level, the newer and hotter the data is, which requires a compaction that accelerates the query faster. Therefore, InfluxDB adopts a compaction policy with low resource consumption at the lower level, which is completely designed according to the writing and query attributes of time series data.
- IndexOptimizationCompaction: When Level4 files are accumulated to a certain number, the index becomes very large and the query efficiency becomes relatively low. The main factor for the low efficiency of the query is that the same TimeSeries data is contained by multiple TSM files, so data integration across multiple files is inevitable. Therefore, IndexOptimizationCompaction is mainly used to integrate data under the same TimeSeries into the same TSM file, to minimize the overlap ratio of the TimeSeries between different TSM files.
- FullCompaction: After InfluxDB determines that a shard will not have data written for a long time, it performs a full compaction on the data. FullCompaction is the integration of LevelCompaction and IndexOptimization. After a full compaction, no more compactions will be performed for the shard, unless new data is written or deletion occurs. This policy is a collation for the cold data, mainly aimed at improving the compaction ratio.

Continuous Query

Two policies are recommended for the pre-aggregation and precision reduction of data in InfluxDB. One is to use Kapacitor, the data computation engine in InfluxDB, and the other is to use continuous queries, which comes with InfluxDB.

CREATE CONTINUOUS QUERY "mean_cpu" ON "machine_metric_db"
BEGIN
SELECT mean("cpu") INTO "average_machine_cpu_5m" FROM "machine_metric" GROUP BY time(5m),cluster,hostname
END

The above is a simple CQL that configures a continuous query. It enables InfluxDB to start a timing task to aggregate all the data under the measurement "machine_metric" by the dimension of cluster + hostname every 5 minutes, compute the average value of the field "cpu", and write the final result into the new measurement "average_machine_cpu_5m".

The continuous queries of InfluxDB are similar to the auto-rollup function of KairosDB. They are all scheduled on a single node. Data aggregation is the delayed rather than the real-time StreamCompute, during which the storage will be under significant read pressure.

TimeSeries Index

In addition to supporting the storage and computation of time series data, time series databases also need to be able to provide multi-dimensional queries. InfluxDB indexes TimeSeries for faster multi-dimensional queries. Regarding data and indexes, InfluxDB is described as follows:

InfluxDB actually looks like two databases in one: a time series data store and an inverted index for the measurement, tag, and field metadata.

Prior to InfluxDB 1.3, TimeSeries indexes (hereinafter referred to as TSI) are only supported the memory-based method, that is, all TimeSeries indexes are stored in the memory, which is beneficial but also has many problems. However, in the latest InfluxDB 1.3, another indexing method is provided for selection. The new indexing method stores the indexes on disk, which is less efficient than the memory-based index, but solves many problems existing in the memory-based index.

Memory-Based Index

    // Measurement represents a collection of time series in a database. It also
    // contains in memory structures for indexing tags. Exported functions are
    // goroutine safe while un-exported functions assume the caller will use the
    // appropriate locks.
    type Measurement struct {
     database string
     Name     string `json:"name,omitempty"`
     name     []byte // cached version as []byte

     mu         sync.RWMutex
     fieldNames map[string]struct{}

     // in-memory index fields
     seriesByID          map[uint64]*Series              // lookup table for series by their id
     seriesByTagKeyValue map[string]map[string]SeriesIDs // map from tag key to value to sorted set of series ids

     // lazyily created sorted series IDs
     sortedSeriesIDs SeriesIDs // sorted list of series IDs in this measurement
    }

    // Series belong to a Measurement and represent unique time series in a database.
    type Series struct {
     mu          sync.RWMutex
     Key         string
     tags        models.Tags
     ID          uint64
     measurement *Measurement
     shardIDs    map[uint64]struct{} // shards that have this series defined
    }

The above is the definition of the memory-based index data structure in the source code of InfluxDB 1.3. It is mainly composed of two important data structures:

Series: It corresponds to a TimeSeries, and stores some basic attributes related to the TimeSeries and the shard to which it belongs.

Key: the serialized string of the corresponding measurement + tags.
Tag: all the tag keys and tag values under the Timeseries.
ID: a unique integer ID.
measurement: the measurement to which the Series belongs.
shardIDs: a list of all ShardIDs that contain the Series.

Measurement: Each measurement corresponds to a measurement structure in the memory, with some indexes inside the structure to speed up the query.

seriesByID: a map that queries the Series through the SeriesID.
seriesByTagKeyValue: A two-layer map. The first layer is all tag values corresponding to the tag key, and the second layer is the ID of all Series corresponding to the tag values. As you can see, when the base of the TimeSeries becomes large, this map takes up quite a lot of memory.
sortedSeriesIDs: a list of sorted SeriesIDs.

The advantage of the full memory-based index structure is that it can provide highly efficient multi-dimensional queries, but there are also some problems:

The base of TimeSeries is limited mainly by the memory size. If the number of TimeSeries exceeds the upper limit, the entire database is unavailable. This type of problem is generally caused by the incorrect tagkey design. For example, a tag key is a random ID. Once this problem occurs, recovery is difficult. You can only manually delete the data.
If the process restarts, it takes a long time to recover the data, because the full TimeSeries information needs to be loaded from all TSM files to build an index in the memory.

Disk-Based Index

For the problems of the full memory-based index, an additional index implementation is provided in the latest InfluxDB 1.3 Thanks to good scalability in the code design, the index module and the storage engine module are plug-ins. You can choose which index to use in the configuration.

In InfluxDB, a special storage engine is implemented to store index data. Its structure is also similar to that of LSM. As shown in the figure above, it is a disk-based index structure. For details, see Design Documents.

The index data is written to the Write-Ahead-Log (WAL) first. The data in the WAL is organized by LogEntry, and each LogEntry corresponds to a TimeSeries and contains the information about the measurement, tag, and checksum. After being successfully written to the WAL, the data enters into a memory-based index structure. After the WAL accumulates to a certain size, the LogFile is flushed into an IndexFile. The logical structure of the IndexFile is consistent with the structure of the memory-based index, indicating the map structure of the measurement to the tagkey, the tagkey to the tagvalue, and the tagvalue to TimeSeries. InfluxDB uses mmap to access the file, and a HashIndex is saved for each map in the file to speed up the query.

When IndexFiles accumulate to a specified amount, InfluxDB also provides a compaction mechanism to merge multiple IndexFiles into one, saving storage space and speeding up queries.

Summary

All InfluxDB components are self-developed. The advantage of self-developement is that each component can be designed based on the attributes of time series data to maximize performance. The whole community is also very active, but large function upgrades often occur, such as changing the storage format or changing the index implementation, which is quite inconvenient for users. In general, I am more optimistic about InfluxDB development. Unfortunately, the cluster version of InfluxDB is not open-source.

Community

A Comprehensive Analysis of Open-Source Time Series Databases (3)

InfluxDB

Basic Concepts

TSM

Continuous Query

TimeSeries Index

Memory-Based Index

Disk-Based Index

Summary

Read previous post:

Read next post:

Alibaba Cloud Storage

You may also like

Comments

Alibaba Cloud Storage

Related Products

Tablestore

Time Series Database (TSDB)

Time Series Database for InfluxDB®

Data Lake Storage Solution