Interview Questions We've Learned Over the Years: MySQL – Part 2

By Taosu

Interview Questions

Sharding

How to perform database and table sharding?

The user ID of the table is used to split the table, and each table is limited to 3 million data. Database sharding is based on business scenarios and regions. Each database has no more than 2,000 concurrent operations.

Sharding-jdbc: The client layer solution has the advantages of no deployment, low O&M cost, no proxy layer secondary forwarding requests, but high performance. However, each system needs to couple Sharding-jdbc dependencies, which makes upgrading troublesome.

Mycat: The proxy layer solution has the disadvantages of deployment, a set of middleware by self O&M, and high O&M costs. However, the advantage is that it is transparent to each project. If you need to upgrade, you can do it by the middleware.

Horizontal splitting: A table is placed in multiple databases to share high concurrency and speed up queries.

• ID ensures that the business can operate on the same database when multiple tables are associated.
• Range is convenient for expansion and data statistics.
• Hash can make the data more even.

Vertical Splitting

One table is split into multiple tables, and some cold data can be split into redundant databases.

When it is not a write bottleneck, prioritize table sharding.

• Data between database shards can no longer be directly queried through the database. The problem of deep paging will occur.
• With more database shards, the possibility of problems becomes higher and the maintenance cost is also higher.
• Cross-database transactions cannot be guaranteed after the database sharding. You can only use other middleware to implement the eventual consistency.

The following requirements to meet the core business scenarios should be firstly considered when doing database sharding:

Order data is split by users to improve user experience.
Because super customers will cause data skew, you can use the unique identifier with the finest granularity for hash sharding.
After sharding according to the finest granularity such as order numbers, the database cannot be rearranged in a single database.

Three questions:

• Rich query: How do I meet the needs of queries across database shards after sharding databases and tables? You can use ElasticSearch (ES) wide tables.

You can use the database shard gateway and the database shard service to implement multi-dimensional query capabilities. However, the overall performance is poor, and normal write requests are affected. The most common way to respond to multi-dimensional real-time queries in the industry is to use ES.

• Data skew: Data is sharded into tables based on the database shards.

• Distributed transactions: Distributed transactions are caused by modifications across multiple databases and write operations between multiple microservices.

• Deep paging: Query by cursor, or call each query with the last query after sorting the maximum ID.

How to Migrate Previous Data?

Dual-write non-disruptive migration

• Every place in the online system where database writing occurs, adding, deleting, or modifying operations, in addition to actions being performed on the old database, should also be carried out on the new database.

• After the system is deployed, it is also necessary to run a program to read the data of the previous database and write the data in the new database. When writing, you need to judge the update time.

• Loop until the data of the two databases are the same, and finally redeploy the code of the sharded databases and tables.

Evaluation and Expansion of System Performance

A customer has 100 million users now with 100,000 write concurrency, 1 million read concurrency, and 6 billion data volume.

The extreme situations should be considered in the design: 32 databases × 32 tables ~ 64 tables, totaling 1,000 to 2,000 tables.

• 30,000 write concurrency can be supported and it works with MQ to achieve a write speed of 100,000 per second.

• In the read-write splitting mode, 60,000 read concurrency is supported and it reaches 100 read concurrency per second with distributed cache.

• If each table supports 3 million data, you can write up to 6 billion pieces of data in 2,000 tables.

• Thirty-two user tables can support hundreds of millions of users, and you only need to expand the capability once subsequently.

Steps of Dynamic Expansion

We recommend that you use 32 databases × 32 tables, which may be enough for our company for several years.
The recommended rules for configuring routing are uid % 32 = database and uid / 32% 32 = table.
When expanding, apply to add more database servers and expand them in multiples.
The DBA is responsible for migrating the databases of the original database server to the new database server.
Modify the configuration, republish the system, and go online. You do not need to change the original routing rules.
You can continue to provide services for the online system directly based on N times the resources of the database server.

How to generate an auto-increment ID primary key?

• You can use Redis.

• If the concurrency is not high, you can start a separate service to generate an auto-increment ID.

• Set the database auto-increment step to support horizontal scaling.

• UUID is suitable for file names and numbers, but not suitable for primary keys.

• Snowflake algorithm integrates 41 ms, 10 machines, and 12 serial numbers (auto-increment within milliseconds).

The 10 bits the machine reserved can be configured based on your business scenario.

Online Faults and Optimization

Update Failure | Primary-Secondary Synchronization Latency

In the past, online bugs caused by primary-secondary synchronization latency were indeed handled online, which was a small production accident.

The fact is as follows. Someone wrote code logic like this: firstly inserting a piece of data, finding it out, and then updating the data. During the peak period of the production environment, the write concurrency reached 2,000 per second. At this time, the primary-secondary synchronization latency was about tens of milliseconds. We will find online that some important data status is expected to be updated, but there are always some data that are not updated during peak hours. Users give feedback to customer service, and customer service will give feedback to us.

We use the MySQL command:

show slave status

After viewing the Seconds_Behind_Master, we can see that the data the secondary database replicates from the primary database delay for several milliseconds.

Usually, if the primary-second latency is severe, we use the following solutions:

• Sharding the database: After sharding the database into multiple primary databases, the write concurrency of each primary database is reduced several times, and the primary-second latency is negligible.

• Rewriting the code: Those who write code should be careful. The data may not be found if you query immediately after you insert the data.

• If you do need to insert the data, find it out, and immediately perform some operations, you can set up a direct connection to the main database or query latency. The primary-secondary replication latency generally does not exceed 50 milliseconds.

Application Crash | Optimization of Database and Table Sharding

We have a table with online traffic records. Due to the large number of data, we have sharded databases and tables. At that time, some problems often occurred in the initial stage of database and table shards. The typical case was that deep paging was used in traffic record queries. By using some tools such as MAT and Jstack, we tracked that it was caused by sharding-jdbc internal references.

The traffic record data is stored in two databases. If the sharding key is not provided, the query statement will be distributed to all databases. For example, if the query statement is limit 10 and offset 1000 and only 10 records need to be returned in the final result, the database middleware needs (1000 + 10) × 2 = 2020 records to complete the calculation process. If the value of the offset is too large, the memory used will skyrocket. Although sharding-jdbc uses the merge algorithm to make some optimizations, deep paging still causes memory and performance problems in real-world scenarios.

Operations of merging and aggregating on intermediate nodes are very common in distributed frameworks. For example, ES has a similar data acquisition logic. The unrestricted deep paging will also cause ES memory problems.

Solutions in the industry:

Method 1: Global vision method

(1) Rewrite the order by time offset X limit Y to order by time offset 0 limit X + Y.

(2) The service layer performs memory sorting on the obtained N*(X + Y) pieces of data and then takes Y records after the offset X after memory sorting.

The performance will become lower with paging by using this method.

Method 2: Business compromise method: forbidding paged queries

(1) Get the data of the first page in a normal way and get the time_max of the first-page record.

(2) Every time you turn the page, rewrite order by time offset X limit Y to order by time where time>$time_max limit Y

In this way, you can ensure that only one page of data is returned at a time and that the performance is constant.

Method 3: Business compromise method: allowing fuzzy data

(1) Rewrite order by time offset X limit Y to order by time offset X/N limit Y/N.

Method 4: Secondary query method

(2) Rewrite order by time offset X limit Y to order by time offset X/N limit Y.

(3) Find the minimum value time_min.

(4) Perform secondary query by between, order by time between timeminandtime_i_max.

(5) Set the virtual time_min and find the offset of time_min in each database shard to obtain the global offset of time_min.

(6) After the global offset of time_min is obtained, the global offset X limit Y is naturally obtained.

Query Exceptions | SQL Tuning

Before database and table sharding, there is an SQL statement that uses the username to query a user:

select * from user where name = "xxx" and community="other";

This SQL statement was modified by a colleague as follows to achieve the effect of dynamic splicing. His original intention was to dynamically remove these query conditions when the name or community was passed in as empty. This writing is also very common in MyBaits configuration files. In most cases, this writing method has no problem because the result set can be controlled. However, as the system ran, the user table had more and more records. When the name and community passed in were all empty, something went wrong:

select * from user where 1=1

All records in the database were queried and then loaded into the JVM memory. Due to too many database records, the memory was directly exploded. Memory overflow due to this reason occurs very frequently. For example, when you import Excel files, memory overflow happens.

The common solution is to add the paging feature or verify the required parameters.

Controller Layer

At present, many projects adopt the front-end separation architecture, so the controller layer method generally uses the @ResponseBody annotation to parse the query results and return JSON data. This consumes a lot of memory resources when the data set is very large. If the result set occupies 10 MB of memory before it is parsed into JSON, 20 MB or more of memory may be used during the parsing process.

Therefore, it is necessary to keep the result set compact, which is also necessary for the existence of the Data Transfer Object (DTO). In the Internet environment, high concurrent requests for small result sets are not worrisome, but time-consuming requests for large result sets are terrifying. This is one of the reasons.

Service Layer

The service layer is used to process specific businesses and better meet the functional requirements of businesses. A service may be used by multiple controller layers, or it may use the query results of multiple DAO structures for computing and assembly.

int getUserSize() {        List<User> users = dao.getAllUser();        return null == users ? 0 : users.size();}

A ticking time bomb is found in the code review. However, it exposes the problem only after the number of data reaches a certain level.

ORM Layer

For example, when using Mybatis, there is a batch import service. When MyBatis performs batch insert, it generates memory overflow. In a common situation, this insert operation will not cause additional memory occupation. Finally, the problem is traced through the source code.

This is because when MyBatis circularly processes a batch, the operation object is an array, while when we define the interface, we use lists. When passing in a very large list, it needs to call the toArray method of the list to convert the list into an array (shallow copy). In the final assembly stage, StringBuilder is used to splice the final SQL, so the actual memory used is much more than the list.

It turns out that both insert operations and query actions are prone to problems as long as the data set involved is very large. Due to the introduction of many frameworks in the project, it becomes very difficult to analyze these specific memory footprints. So keeping small batch operations and clean result sets is a very good habit.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

Community

Interview Questions We've Learned Over the Years: MySQL – Part 2