By Qinxia
Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统. All rights reserved to the original author.
I have repeatedly mentioned the phrase divide and conquer in the previous articles.
The purpose of divide and conquer is to solve the problems of data that is too large to store or too slow to calculate. This is the core problem to be solved by the basic distributed system.
We need to solve the problem of (horizontal) scalability from the perspective of procedures and services.
The simplest problem of distributed systems, scalability, is solved with the general method of partitioning.
Partitions may have other names in different systems, such as block, region, bucket, shard, stage, and task. They are the same.
As we mentioned in the fifth article, dividing the computing logic is equivalent to dividing the data. Therefore, we focus on data scalability to understand the common partitioning scheme.
Typically, we can divide the data into three types to study:
Specifically, we focus on the following points for scalability:
File data is the most flexible data in the lowest layer. For other types of data, the bottom layer must eventually exist in the form of files (not to struggle with pure memory data that does not land).
Because of this feature, the partition of file data can only be made at the bottom where there has nothing to do with the application.
Therefore, whether it is various file systems supported by Linux or HDFS in the distributed field, they adopt the method of the fixed-size block to divide data.
For a distributed system, you must add metadata to record the correspondence between files and blocks and the correspondence between blocks and machines.
Blocks do not have any meaning in the application layer. Therefore, the localization of file data is close to random. We only give more consideration to the perspective of the size of the storage space.
This is easy to understand, and there are several benefits:
After you add or delete nodes, Rebalance is easy to implement.
Essentially, you need to replicate the data to the target machine and modify the mapping relationship on the metadata.
The modification of the metadata is very lightweight. However, the movement of data will bring a large number of IO. You can avoid the business peak time or limit the movement speed to reduce resource competition for business programs.
For example, HDFS provides the setting of trigger thresholds and the automatic rebalance function, which facilitates scaling to a greater extent.
We often deal with this type of data. The biggest difference between the key-value and file data is that the key-value structure gives the application layer meaning so that we can get rid of the shackles of the underlying layer and do many things at the application layer.
We don't have to do partitioning at the block level. The key-value regards the key as the core, so we do partitioning in units of keys.
The first way is to divide by key range.
For example, data with a mobile phone number as the key can be divided easily this way.
HBase uses this partitioning method.
However, this can easily lead to uneven data distribution. For example, number segments like 135 will have a lot of data, while number segments like 101 may have no data at all.
The root cause is that the key of partitioning has business implications, and the business itself may be unbalanced.
Then, make the partition key irrelevant to business.
The effect can be achieved through some simple transformations in some scenarios.
For example, in the example of a mobile phone number, turning over the mobile phone number and using it as the partitioning key can solve the data imbalance.
If you want to scatter the data in more general scenarios, the following two methods are more common:
Therefore, the key is often hashed, and you can divide the key by range.
Broadly speaking, turning over mobile phone numbers can be regarded as a hash function. More commonly used standard hash algorithms include md5 and sha.
Hash solves the problem of uneven distribution but also loses one of the benefits of range: querying by range.
After the hash is completed, range queries have to be spread to multiple partitions, which significantly affects the query performance and loses the order of data.
As a compromise, the compound primary key is available. Similar to key1_key2, only key1 is used as the hash partition, while key2 is used to meet the requirements of the range query.
For example, consider a forum scenario. You can design a primary key such as (user_id, timestamp) to query all posts of a user on a certain day. Then, a range query of scan (user_id, start_timestamp, end_timestamp) can easily obtain results.
Cassandra uses this partitioning method.
Since the partition of key-value data has business implications, you can no longer only consider storage space (like file data).
Our localization strategy cannot break the rule that the same range of data is in the same partition.
Typically, there are several options:
First, the hash mod N determines which partition the data should be placed in, where N is the number of nodes.
The benefit of this approach is that there is no cost to metadata management, as metadata changes from data to computing logic.
The disadvantage is that it is very inflexible. Once the number of nodes changes, we may have to move a large amount or even all of the data to rebalance.
The root cause is that partition has the variable N. When the number of nodes changes, rebalance affects localization.
Then, don't bring variables, so there is a second method - a fixed number of partitions*. The number of partitions remains the same regardless of whether the number of nodes increases or decreases.
As such, when nodes increase or decrease, only a small number of partitions need to be moved. The downside is the overhead associated with metadata management.
However, metadata is usually not large, so Elasticsearch and Couchbase have adopted this scheme.
There is another problem with fixing the number of partitions, which needs to be appropriate, or you have to redistribute the data.
How much is appropriate?
In the beginning, it may be 100000. It was dragged down by excessive metadata management and synchronization costs from the first day.
In the beginning, it may be 100. A year later, the amount of data has increased ten times, but that is not enough.
There is no standard answer, depending on the business scenarios. The same business scenario may have to change over time.
If you can't find the right value, what should you do?
The third method is the dynamic number of partitions.
When nodes are added, the old partition is split, and some are distributed to new nodes. When nodes are reduced, the old partition is merged and moved to the available node.
The scope of data movement is only part of the affected partition.
You will not be afraid of rebalance.
However, there are some problems:
HBase uses the method.
Although document data does not have a primary key in the business sense, it usually has a unique internal doc_id, just as we often have an auto-increment id as the primary key in a relational database.
As such, the document data becomes a key-value structure, such as {doc_id: doc}. You can use the partitioning method mentioned in the key-value data.
It is more common to use mod N methods, such as Elasticsearch.
Similar to relational databases, document data is also queried through secondary indexes (like searching).
Then, the data of the same secondary index may appear in different partitions, so you can only use a method similar to map and reduce to query data.
It is not friendly to a large number of read scenarios. Each query has to be broadcasted to all partitions.
Therefore, there is a method that uses a secondary index as a routing key to read and write data. This optimization of localization will affect partitioning.
As such, the data that agrees with the secondary index value is written to a fixed partition, and the problem of read amplification is solved.
You may have to write multiple replications of data with different routing keys in scenarios with multiple secondary indexes.
As for the access to the secondary index, there can be two implementations, document local index and term global index, which will not be talked about here.
This article, which summarizes the previous articles, is essentially solving one of the core problems of distributed systems - the scalability problem. The solution is partitioning.
The next article will discuss another core issue of distributed systems: availability.
I have been talking about the advantages of a distributed system. Now, it is time to discuss its problems.
This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!
Learning about Distributed Systems - Part 6: Saving Costs through Resource Scheduling
Learning about Distributed Systems - Part 8: Improve Availability with Replications
64 posts | 53 followers
FollowAlibaba Cloud_Academy - June 26, 2023
Alibaba Cloud Native - October 12, 2024
amap_tech - April 20, 2020
Alibaba EMR - June 22, 2021
ApsaraDB - April 10, 2024
ApsaraDB - November 26, 2024
64 posts | 53 followers
FollowBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreSecure and easy solutions for moving you workloads to the cloud
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreMore Posts by Alibaba Cloud_Academy