Kudu is a distributed, scalable, and column-oriented data storage system that makes fast analytics on rapidly changing data.
Scenarios
Kudu is suitable for the following scenarios:
Near real-time computing
Time series data
Prediction modeling
Tremendous historical data
In most cases, a large volume of historical data exists in the production environment. The historical data can be stored in Hadoop Distributed File System (HDFS), relational database management system (RDBMS), or Kudu. If you need to only access or query historical data, you can use Impala to perform the operation and do not need to migrate the data to Kudu.
Components
Kudu consists of the following components:
Master servers: manage metadata. The metadata includes the server and tablet information of tablet servers. Master servers work in high availability (HA) mode by using the Raft algorithm.
Tablet servers: store tablets. Each tablet has multiple replicas, which ensure high availability by using the Raft algorithm.
Terms
Term | Description |
master server | Manages metadata of the entire cluster. The metadata includes tablet server information, table information, tablet information, and other metadata-related information. |
tablet server | Stores and provides tablets for clients. |
column-oriented storage | Kudu is a column-oriented data storage system. Data in the same column is stored in adjacent locations in the underlying storage. |
table | Kudu stores data in tables. A table has a schema and a globally ordered primary key. A table can be split into multiple segments that are called tablets. |
tablet | A contiguous segment of a table. A specific tablet is replicated on multiple tablet servers. One of these replicas is considered to be the leader tablet. |
Raft | A consensus algorithm that is used to ensure high availability of master servers and data consistency among tablet replicas. |
catalog table | The central location for metadata in Kudu. The catalog table stores information about tables and tablets. |