×
Community Blog Detailed Explanation of MySQL Character Set and Collation

Detailed Explanation of MySQL Character Set and Collation

This article explains the relationship between the Unicode standard and the various character sets in MySQL, with a detailed introduction to the defau.

By Xizhe Zhang

Overview

Have you ever dealt with garbled text when using MySQL? Have you tried to query an English string only to retrieve a special character instead? Noticed that sometimes strings are sorted with lowercase letters first, other times with uppercase? If you overlook the character set, collation, and related settings in MySQL, you might run into these issues. This article begins by illustrating such scenarios with examples.

Example 1

An incorrectly configured client can lead to disordered binary storage of strings, which in turn might cause garbled text in queries from otherwise normal clients, as shown below:

-- 1. Create a table named t1 with two columns. One column uses the utf8mb4 charset and the other column uses the latin1 charset.
mysql> create table t1 (a char(5) character set utf8mb4, b char(5) character set latin1);
-- 2. The client actually uses the utf8mb4 charset. This simulates that the client misconfigures the server and inserts data.
mysql> set names latin1;
-- Even if the latin1 charset of Column b does not have the character '张', it can be inserted successfully. In normal circumstances, the column cannot be inserted. From here on, the error begins.
mysql> insert into t1 values('张','张');
-- 3. Query data on a normal client.
mysql> set names utf8mb4;
-- The content of Column a (utf8mb4 charset) is completely disordered, and the content of Column b (latin1 charset) is utf8mb4 encoding of '张'.
mysql> select a,hex(a),b,hex(b) from t1;
+--------+--------------+--------+--------+
| a      | hex(a)       | b      | hex(b) |
+--------+--------------+--------+--------+
| å¼     | C3A5C2BCC2A0 | å¼     | E5BCA0 |
+--------+--------------+--------+--------+

Example 2

When you query a string column with an equivalent condition, a completely different string is returned. Is there a bug in MySQL? In fact, the answer is no. The comparison of strings is closely related to the collation configuration. If you don't pay attention to it, you may get confusing results.

-- 1. Create a table named t1 that contains one column. By default, the utf8mb4 charset and utf8mb4_0900_ai_ci Collation are used.
mysql> create table t1 (a char(5));
-- 2. Insert two rows of data.
mysql> insert into t1 values ('ß'),('ss');
-- 3. Query the table based on the equivalence condition. Two strings with completely different encoding results are returned!
mysql> select a,hex(a) from t1 where a='ss';
+------+--------+
| a    | hex(a) |
+------+--------+
| ß    | C39F   |
| ss   | 7373   |
+------+--------+
-- 4. When you try another collation, it's normal again.
mysql> select a,hex(a) from t1 where a='ss' collate utf8mb4_0900_as_cs;
+------+--------+
| a    | hex(a) |
+------+--------+
| ss   | 7373   |
+------+--------+

Example 3

This example shows the influence of collation on string sorting. Why is it sometimes lowercase first and sometimes uppercase first? How is the sequence of different strings determined? This is all determined by the collation of the configuration.

-- 1. Create a table named t1 that contains one column. By default, the utf8mb4 charset and utf8mb4_0900_ai_ci Collation are used.
mysql> create table t1 (a char(5));
-- 2. Insert two rows of data.
mysql> insert into t1 values ('abc'),('ABC');
-- 3. Return the sorted result of strings with lowercase first.
mysql> select * from t1 order by a;
+------+
| a    |
+------+
| abc  |
| ABC  |
+------+
-- 4. Change a collation, and it becomes uppercase first.
mysql> select * from t1 order by a collate utf8mb4_0900_bin;
+------+
| a    |
+------+
| ABC  |
| abc  |
+------+

As can be seen from the above three examples, when we use strings in MySQL, the charset and collation are important configuration items, and once the configuration is wrong, it may not be used as expected. When we use MySQL to view table definitions, we may often see words such as CHARSET and COLLATE as shown in the following example, so you may be familiar with them. However, most of the time we might not know the specific meaning of these charsets and collations, or how to configure them. We may either follow the default configuration or copy them from the definition of existing library tables. But are these configurations really suitable for current applications? Reasonable selection of charsets and collations, and an understanding of how strings are compared can greatly help us avoid the problems described in the previous example.

mysql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `a` char(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `b` varchar(10) CHARACTER SET latin2 COLLATE latin2_general_ci DEFAULT NULL,
  `c` char(10) COLLATE utf8mb4_0900_as_cs DEFAULT NULL,
  `d` char(5) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
  `e` varchar(5) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_as_cs

MySQL itself supports many charsets and collations, but these configurations sometimes dazzle users. This article will introduce them progressively, aiming to answer the following questions:

  1. What are a charset and a collation?
  2. How do we configure different levels of charsets and collations? How far are their effects?
  3. How does the most commonly used Unicode charset store and compare strings?
  4. How do we select the_bin Collation of a binary charset and a non-binary charset?

This article is divided into two parts, and the structure is as follows:

● Part 1 (Concept Introduction and MySQL Configuration Items)

• Overview

• What are a charset and a collation?

• View charsets and collations supported by MySQL

• Unicode and UTF-8

• Configure the charset and collation

● Part 2 (Charset Conversion and Sorting Algorithms for Unicode and Binary Charsets)

• Convert charsets

• Sorting algorithms for Unicode strings

• Binary charset and _bin Collation

• Summary

What Are a Charset and a Collation?

A charset is a set of symbols and encodings. A collation is a set of rules for comparing characters in a charset. Let's take the MySQL default utf8mb4 charset and the utf8mb4_0900_ai_ci collation as examples to clarify this difference. The following is the encoding of six characters in the utf8mb4 charset. The utf8mb4 charset allows the encoding of characters to be non-fixed-length, and the length can be 1 to 4 bytes. The specific encoding method will be described in the Unicode and UTF-8 section.

Character utf8mb4 encoding
E 0x45 (1 byte)
a 0x61 (1 byte)
e 0x65 (1 byte)
0xe1baaf (3 bytes)
ế 0xe1babf (3 bytes)
𝔸 0xf09d94b8 (4 bytes)

The above can be thought of as a character-to-encoding mapping, that is, a charset. The role of the charset is to provide the mapping of characters to encodings, but the charset does not define the comparison relationship between characters, and this part of the work is defined by the collation.

For the above 6 characters, if we only take the numerical sizes corresponding to the binary encodings and compare them, then their relationship is E<a<e<ắ<ế< 𝔸. This is the simplest comparison method and is referred to as the binary collation ending with "_bin" in MySQL.

However, most of the time, we hope that a should be ranked in front of e and E regardless of the case. For some European languages or the Chinese phonetic alphabet, a should also be ranked in front of e regardless of whether there are tone symbols above letters. Therefore, more refined collations emerge as the times require, which contain many definitions of rules that are in line with human language habits. For example, in the above example, the size relationship of 6 characters in utf8mb4_0900_ai_ci collation is a= ắ= 𝔸<E=e=ế. In this collation, all characters ignore tones and cases, which is more complicated than the binary collation.

The above example only uses 6 characters in the utf8mb4 charset. In fact, the utf8mb4 charset contains characters from almost all languages in the world. The customized comparison rules between these characters will be more complicated, including not only the tone and case but also sometimes multi-character mapping (for example, ß = ss in utf8mb4_0900_ai_ci). For this reason, Unicode specifically defines the string comparison algorithm and explains how to compare characters. MySQL also implements its own comparison algorithm accordingly.

From the above description, we can see that a charset is a set of one-to-one mappings from symbols to encodings and a set of collation is a series of rules for comparing characters. A charset can have many sets of collations. For ease of description, 'charset' is used to represent a character set and 'collation' to a set of collation rules.

View Charsets and Collations Supported by MySQL

View Charsets

MySQL supports many types of charsets and collations. You can execute the following statements to view the supported charsets. The Column Charset represents the name of the charset, the Column Description represents the description, the Column Default collation represents the default collation of the charset, and the Column Maxlen represents the maximum number of encoded bytes in the charset.

mysql> SHOW CHARACTER SET; -- The corresponding view INFORMATION_SCHEMA.CHARACTER_SETS
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| binary   | Binary pseudo charset           | binary              |      1 |
...
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
...
| ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
...
| utf8mb3  | UTF-8 Unicode                   | utf8mb3_general_ci  |      3 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_0900_ai_ci  |      4 |
...

View Collations

The default charset used by MySQL 8.0 is utf8mb4. In this example, use the following statements to view the collation that can be used by the charset, where the Column Compiled represents whether the collation can be implemented in the MySQL source code:

mysql> SHOW COLLATION WHERE Charset = 'utf8mb4'; -- The corresponding view INFORMATION_SCHEMA.COLLATIONS
+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation                  | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci         | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_ci         | utf8mb4 | 305 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_cs         | utf8mb4 | 278 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_bin           | utf8mb4 | 309 |         | Yes      |       1 | NO PAD        |
| utf8mb4_bin                | utf8mb4 |  46 |         | Yes      |       1 | PAD SPACE     |
...
| utf8mb4_general_ci         | utf8mb4 |  45 |         | Yes      |       1 | PAD SPACE     |
...
| utf8mb4_spanish2_ci        | utf8mb4 | 238 |         | Yes      |       8 | PAD SPACE     |
...
| utf8mb4_unicode_520_ci     | utf8mb4 | 246 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_unicode_ci         | utf8mb4 | 224 |         | Yes      |       8 | PAD SPACE     |
...
| utf8mb4_zh_0900_as_cs      | utf8mb4 | 308 |         | Yes      |       0 | NO PAD        |
+----------------------------+---------+-----+---------+----------+---------+---------------+

In MySQL, charsets and collations follow these rules:

● The same collation cannot be used by multiple charsets, that is, a collation is attached to a charset.

● Each charset has a default collation.

Pad_attribute of Collation

When viewing the collation information, you can find that there is a Column Pad_attribute with the value NO PAD or PAD SPACE, which defines the attitude of collation to the trailing spaces of a string. The attribute of most collations in MySQL is PAD SPACE, and the attribute of collations implemented based on UCA 9.0.0 (with 0900 in the name) is NO PAD:

PAD SPACE: When you compare (excluding LIKE) two strings, trailing spaces will be ignored. For example, "a" is considered equal to " a ".

NO PAD: When comparing two strings, trailing spaces cannot be ignored. For example, "a" is not equal to " a ".

● When the LIKE operator is used, it is not affected by this value and trailing spaces cannot be ignored.

mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE
       FROM INFORMATION_SCHEMA.COLLATIONS
       WHERE COLLATION_NAME LIKE 'utf8mb4%bin';
+------------------+---------------+
| COLLATION_NAME   | PAD_ATTRIBUTE |
+------------------+---------------+
| utf8mb4_bin      | PAD SPACE     |
| utf8mb4_0900_bin | NO PAD        |
+------------------+---------------+
-- 1. utf8mb4_bin with 'PAD SPACE' attribute thinks 'a ' = 'a'
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_bin;
+--------------------------------+
| 'a ' = 'a' COLLATE utf8mb4_bin |
+--------------------------------+
|                              1 |
+--------------------------------+
-- 2. utf8mb4_0900_bin with 'NO PAD' attribute thinks 'a ' != 'a'
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_0900_bin;
+-------------------------------------+
| 'a ' = 'a' COLLATE utf8mb4_0900_bin |
+-------------------------------------+
|                                   0 |
+-------------------------------------+
-- 3. The LIKE operator is not affected by 'PAD SPACE', and trailing spaces cannot be ignored
mysql> SELECT 'a ' LIKE 'a' COLLATE utf8mb4_bin;
+-----------------------------------+
| 'a ' LIKE 'a' COLLATE utf8mb4_bin |
+-----------------------------------+
|                                 0 |
+-----------------------------------+

Naming Rules for Collations

● Collation names begin with their associated charset name, usually followed by one or more suffixes to indicate additional characteristics. For example, utf8mb4_0900_ai_ci and latin1_swedish_ci are the collations of utf8mb4 and latin1, respectively. A binary charset has only one collation, which is also named binary with no suffix.

● Collations for a specific language include a local code or language name. For example, _tr and _hu in the utf8mb4 _tr_0900_ai_ci and utf8mb4 _hu_0900_ai_ci represent sorting the characters in utf8mb4 by using the rules of the Turkish and Hungarian languages, respectively.

● The collation suffix indicates whether the collation distinguishes different cases, tones, hiragana/katakana, or whether they are binary. The following table shows the suffixes used to indicate these characteristics. If the collation name does not contain _ai or _as, the _ci in the name is metaphorical to _ai (for example, utf8mb4_unicode_ci is neither tone-sensitive nor case-sensitive), and the _cs in the name is metaphorical to _as (for example, latin1_general_cs is both tone-sensitive and case-sensitive):

Suffix Description
_ai Do not distinguish between tones. For example, it thinks 'a' = 'á'
_as Distinguish between tones. For example, it thinks 'a' != 'á'
_ci Do not distinguish between cases. For example, it thinks 'a' = 'A'
_cs Distinguish between cases. For example, it thinks 'a' != 'A'
_ks Distinguish between Hiragana/Katakana in Japanese
_bin Compare binary encodings

● The collation name for a Unicode charset may include a version number to indicate which version of the Unicode sorting algorithm (UCA, which will be introduced later) the collation is based on. Collations with no version number in the name are based on the 4.0.0 version of the UCA by default. For example:

• utf8mb4_0900_ai_ci is based on UCA 9.0.0

• utf8mb4 _unicode_520_ci is based on UCA 5.2.0

• utf8mb4_unicode_ci is based on UCA 4.0.0

Unicode and UTF-8

To make the later content easier to understand, here's an introduction to the Unicode standard and UTF-8 encoding.

Unicode Standard

The Unicode Standard is maintained by the Unicode Consortium, a non-profit organization dedicated to organizing and encoding most of the world's writing systems. The latest version of the Unicode Standard, Version 15.0.0, released in September 2022, contains more than 140,000 characters, and each is assigned a unique code point (numeric number). Unicode has become part of the International Standard ISO. The most common Unicode encoding formats are ASCII-compatible UTF-8 and UCS-2-compatible UTF-16.

The Unicode Standard divides the encoding area into 17 planes, numbered from 0 to 16, of which plane 0 is called the Basic Multilingual Plane (BMP, U+0000 to U+FFFF), and planes 1 to 16 are called supplementary planes (U+10000 to U+10FFFF). These planes, together with the BMP plane, require at least 21 bits of encoding area, slightly less than 3 bytes. The code point of the BMP plane can be represented by a single UTF-16 code unit (2 bytes), or encoded by a UTF-8 of 1 to 3 bytes, while the code point of the supplementary planes is encoded by 4 bytes in the UTF-8 and 4 bytes in the UTF-16.

UTF-8 Encoding

To save the area, there are different implementations of encoding the Unicode Standard. The implementation of Unicode is called the Unicode Transformation Format (UTF). The most common one is UTF-8 encoding. Other implementations include UTF-16 (characters are represented by two bytes or four bytes) and UTF-32 (characters are represented by four bytes). Some examples are given below:

Character Unicode code point UTF-8 encoding
a 0x0061 0x61 (1 byte)
0x1EAF 0xe1baaf (3 bytes)
𝔸 0x1D538 0xf09d94b8 (4 bytes)

It can be seen that the original code point (number) assigned by Unicode to characters is quite different from UTF-8 specific codes because UTF-8 encodes Unicode characters in a variable-width encoding form with 8-bit (1 byte) as a code unit. The specific conversion rules are shown in the following table (the first row in the cell is hexadecimal and the second row is binary):

Unicode code point range (represented by 3 bytes) UTF-8 encoding Annotations
0x000000 - 0x00007F
00000000 00000000 0zzzzzzz
0x00 - 0x7F
0zzzzzzz
ASCII character range, 1 byte
0x000080 - 0x0007FF
00000000 00000yyy yyzzzzzz
0xC080 - 0xDFBF
110yyyyy 10zzzzzz
The first byte starts with 110 and the second byte starts with 10, totaling 2 bytes
0x000800 - 0x00D7FF
0x00E000 - 0x00FFFF
00000000 xxxxyyyy yyzzzzzz
0xE08080 - 0xEFBFBF
1110xxxx 10yyyyyy 10zzzzzz
The first byte starts with 1110, and the second to third bytes start with 10, totaling 3 bytes
0x010000 - 0x10FFFF
000wwwxx xxxxyyyy yyzzzzzz
0xF0808080 - 0xF7BFBFBF
11110www 10xxxxxx 10yyyyyy 10zzzzzz
The first byte starts with 11110, and the second to fourth bytes start with 10, totaling 4 bytes

It can be seen that the conversion rule is actually very intuitive, that is, the value corresponding to the Unicode code point is stored in 3 bytes (up to 21 bits), and then the bits are populated in UTF-8 corresponding empty bits according to their own range.

The code point area of Chinese characters is as follows (source). UTF-16 uses 2-byte encoding for BMP characters (Unicode code points only need 2 bytes in the following table), while UTF-8 uses 3-byte encoding for the Chinese characters on the BMP, and both UTF-16 and UTF-8 for the Chinese characters on non-BMP need 4-byte encoding. Therefore, if the stored characters are basically Chinese characters, the encoding length of the UTF-16 charset is always better than or equal to the encoding length of the UTF-8 charset, which can effectively help reduce the storage space, up to 1/3 of the storage space.

Charset Characters Unicode code point
Basic Chinese characters 20,902 characters 4E00-9FA5
Basic Chinese character supplement 38 characters 9FA6-9FCB
Extension A 6,582 characters 3400-4DB5
Extension B 42,711 characters 20000-2A6D6
Extension C 4,149 characters 2A700-2B734
Extension D 222 characters 2B740-2B81D
Kangxi radicals 214 characters 2F00-2FD5
Radical extension 115 characters 2E80-2EF3
Compatible Chinese characters 477 characters F900-FAD9
Compatible extension 542 characters 2F800-2FA1D
PUA (GBK) components 81 characters E815-E86F
Component extension 452 characters E400-E5E8
PUA supplement 207 characters E600-E6CF
Chinese character strokes 36 characters 31C0-31E3
Chinese character structure 12 characters 2FF0-2FFB
Chinese phonetic notation 22 characters 3105-3120
Phonetic notation extension 22 characters 31A0-31BA
1 character 3007

Charsets Supporting Unicode Standard in MySQL

MySQL supports multiple Unicode charsets:

utf8mb4: UTF-8 encoding for the Unicode charset using one to four bytes for each character.

utf8mb3: UTF-8 encoding for the Unicode charset using one to four bytes for each character. Only BMP representation is supported. In MySQL 8.0, this charset has been deprecated and you should switch to utf8mb4 as soon as possible.

utf8: the alias of utf8mb3. In MySQL 8.0, this alias has been deprecated and you should switch to utf8mb4 instead. It is expected that in future releases, utf8 will become an alias for utf8mb4.

ucs2: UCS-2 encoding for each character using two bytes. Only BMP representation is supported. It has been deprecated in MySQL 8.0.28 and is expected to be removed in a future release.

utf16: UTF-16 encoding for each character using two or four bytes. It is similar to ucs2 but it includes extensions to the supplementary plane.

utf16le: similar to utf16, but little-endian instead of big-endian.

utf32: UTF-32 encoding for each character using four bytes.

Configure the Charset and Collation

System Variables

Use the show variables command to easily view related variables:

mysql> show variables like "character\_set\_%";
+--------------------------+---------+
| Variable_name            | Value   |
+--------------------------+---------+
| character_set_client     | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database   | utf8mb4 |
| character_set_filesystem | binary  |
| character_set_results    | utf8mb4 |
| character_set_server     | utf8mb4 |
| character_set_system     | utf8mb3 |
+--------------------------+---------+
mysql> show variables like "collation%";
+-------------------------------+--------------------+
| Variable_name                 | Value              |
+-------------------------------+--------------------+
| collation_connection          | utf8mb4_0900_ai_ci |
| collation_database            | utf8mb4_0900_ai_ci |
| collation_server              | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+

The reason why MySQL has so many variables related to charsets is that these variables will work in different dimensions. The most important of these system variables are the connection-related variables: character_set_client, character_set_results, character_set_connection, and collation_connection, which are described in detail later in the Configure Connection section. The meaning of the remaining variables is as follows:

character_set_server and collation_server: If the charset and collation of the database are not specified in the CREATE DATABASE statement, the default charset and collation of the database are character_set_server and collation_server, and they have no other purpose.

character_set_database and collation_database: This group of variables is actually used to represent the default charset and collation of the current database (switching through use db_name). It is set by the database itself. It can be found that these two variables will change when you switch the database. When there is no database in use, the values of these two variables are the same as those of character_set_server and collation_server. Therefore, it can be found that this group of variables is actually used to represent information. Although it can still be modified by users, it is not recommended. Starting from 8.0.14, only users with permissions can modify it. In future versions, it will become read-only variables.

character_set_filesystem: This variable is used for string literals involving file paths, such as in LOAD DATA and SELECT... INTO OUTFILE statements and the LOAD_FILE() function. Such file names are converted from character_set_client to character_set_filesystem before attempting to open the file. The default value is binary, which means that no conversion is performed. If the file system uses some encodings such as UTF-8 to represent file names, you should set the character_set_filesystem to utf8mb4.

character_set_system: This variable is a global read-only variable and is set to utf8mb3. All metadata uses this charset. The metadata includes the column name, database name, user name, version name, and most of the string results of the SHOW command. Using utf8mb3 to store metadata does not mean that the server returns the column names in the character_set_system charset. When 'SELECT column1 FROM t' is executed, the column name 'column1' itself is converted to the charset determined by the character_set_results system variable and then returned from the server to the client.

The following is an example of an attempt to include non-utf8mb3 characters in the table name, which is unsuccessful:

-- 1. Although utf8mb3 is created successfully, it is converted into a question mark because utf8mb3 does not contain the character '𠀫'. The charset conversion process is described later.
mysql> create table `𠀫` (a int);
Query OK, 0 rows affected, 1 warning (0.26 sec)
mysql> show warnings;
+---------+------+------------------------------------------------------------------+
| Level   | Code | Message                                                          |
+---------+------+------------------------------------------------------------------+
| Warning | 1300 | Cannot convert string '\xF0\xA0\x80\xAB' from utf8mb4 to utf8mb3 |
+---------+------+------------------------------------------------------------------+
mysql> select name, hex(name) from mysql.tables where schema_id=6;
+--------+--------------+
| name   | hex(name)    |
+--------+--------------+
| ?      | 3F           | # 0x3F is the encoding of '?'
+--------+--------------+
-- 2. Try to create another table named "𠀪". The creation is unsuccessful because it is converted to "?" and the table name is duplicate.
mysql> create table `𠀪` (a int);
ERROR 1050 (42S01): Table '?' already exists

Configure Connection

Before really introducing the four variables of character_set_client, character_set_results, character_set_connection, and collation_connection, we will first introduce the concept of string literal. For example, in the simplest example below, the string of 'string' is the string literal, that is, the string in the statement sent by the client. All string literals have their own charsets and collations, which can be specified explicitly (specific methods will be introduced in detail later, and here is a concept introduction first). character_set_connection and collation_connection are the default charset and collation for string literals.

SELECT 'string';

For example:

Here's an extreme example of how these variables work. Assume that the client uses the utf8mb4 charset and sets the character_set_client and character_set_results of this session to utf8mb4 when establishing a connection. The character_set_connection and collation_connection are set to latin1 and latin1_swedish_ci, respectively, because the string literals are expected to use the collation of the latin1 charset. The charset and collation of the columns accessed by the query statement are ucs2 and ucs2_general_ci. Then in the whole process, multiple charset conversions will occur according to the settings of these variables, as shown in the following figure:

1

The following text describes the steps in the above figure:

  1. The string literal 'á' in the query statement is first encoded on the client by the utf8mb4 charset, which is encoded as 0xC3A1.
  2. After the server-side reception, it will choose whether to convert according to the character_set_client and character_set_connection values and because the two variable values in the example are different, the encoding of 'a' is converted to use the latin1 charset and becomes 0xE1.
  3. After parsing the statement, it is found that the queried column uses ucs2 charset, so the encoding of 'á' has changed again to use ucs2 charset, and becomes 0x00E1.
  4. After the statement is actually executed, the result will be obtained. The string 'á' is also included in the result, and the ucs2 charset is used.
  5. Before the result is sent to the client, the part of the result that is not encoded in character_set_results will be converted. In this example, the string 'á' in the result will be converted to utf8mb4 charset and becomes 0xC3A1.
  6. After receiving the result, the client decodes the result according to its own utf8mb4 charset and then performs subsequent processing.

Here are the steps to reproduce the example:

-- 1. Create a table that contains one column and uses the ucs2 charset
mysql> create table t1 (a varchar(5) character set ucs2);
-- 2. Insert a row of data 'á'
mysql> insert into t1 values ('á');
-- 3. Set character_set_connection
mysql> set character_set_connection=latin1;
-- 4. Query the encodings of the 'á' charset with the encoding of ucs2, latin1, and utf8mb4 charset from front to back
mysql> select a, hex(a), hex('á'), hex(_utf8mb4'á') from t1 where a='á';
+------+--------+-----------+-------------------+
| a    | hex(a) | hex('á')  | hex(_utf8mb4'á')  |
+------+--------+-----------+-------------------+
| á    | 00E1   | E1        | C3A1              |
+------+--------+-----------+-------------------+

Specific Meanings of Variables

After the example just now, I believe you have a concept of the general purpose of several variables. The specific meanings of these variables are as follows:

character_set_connection and collation_connection: This group of variables is used for string literals. When the string literal itself does not specify a charset and collation, character_set_connection and collation_connection are used as the charset and collation of the literal. At the same time, this group of variables is also used when the number is converted into a string, and the charset and collation of the target string are also the group of variables.

character_set_client: This variable is the charset used by statements from clients. The session value of this variable is set by the client when the client is connecting to the server (many clients support a default-character-set option to explicitly specify this charset). When the value requested by the client is unknown or unavailable (ucs2, utf16, utf16le, and utf32), or no charset is requested at all, or the server configuration --skip-character-set-client-handshake ignores the client request, the global value of the variable is used to set the session value.

character_set_results: This variable is the charset used when returning query results to the client, including result data of column values, result metadata of column names, and error messages, which does not affect the query process, except that a charset conversion may be performed when the query results are finally sent to the client.

In most cases, garbled characters are caused by the client's failure to set the character_set_client and character_set_results correctly, because the server can only rely on the information from the client to determine the values of these two variables. When the client does not transmit or mis-transmits, it will lead to a huge discrepancy between "what the server thinks" and "what the client actually means". After a series of charset conversions, unexpected garbled characters occur. This is also the reason for the garbled characters in Example 1 mentioned at the beginning of this article.

Command Palette

It can be seen that the variables related to a connection are mainly character_set_client, character_set_results, character_set_connection, and collation_connection. For MySQL's convenience, SET NAMES and SET CHARACTER SET commands of MySQL can set four variables at one time.

SET NAMES

SET NAMES 'charset_name' is equivalent to the following three statements:

• SET character_set_client = charset_name;

• SET character_set_results = charset_name;

• SET character_set_connection = charset_name;

Setting the character_set_connection will implicitly set the collation_connection as the default collation of the charset. If you want to set the collation in more detail, you can use the SET NAMES 'charset_name' COLLATE 'collation_name' to specify it.

SET CHARACTER SET

SET CHARACTER SET 'charset_name' is equivalent to the following three statements:

• SET character_set_client = charset_name;

• SET character_set_results = charset_name;

• SET collation_connection = @@collation_database;

The only difference with SET NAMES is that the last sentence sets the collation_connection. The variable @@collation_database represents the default collation of the current database (switching the database by db_name). The setting of collation_connection here also implicitly changes the character_set_connection to the corresponding charset.

Configure Databases, Tables, and Columns

The following syntax is used to set charsets and collations when you create or modify a database, table, or column:

-- Create a database.
CREATE DATABASE db_name CHARACTER SET charset_name COLLATE collation_name;
-- Modify a database.
ALTER DATABASE db_name CHARACTER SET charset_name COLLATE collation_name;
-- Create a table.
CREATE TABLE tbl_name (column_list) CHARACTER SET charset_name COLLATE collation_name;
--  Modify a table.
ALTER TABLE tbl_name CHARACTER SET charset_name COLLATE collation_name;
-- Create a column.
CREATE TABLE tbl_name(col_name VARCHAR(5) CHARACTER SET charset_name COLLATE collation_name);
-- Modify a column.
ALTER TABLE tbl_name MODIFY col_name VARCHAR(5) CHARACTER SET charset_name COLLATE collation_name;

Follow these rules during configuration:

● If you do not explicitly specify a charset and a collation during creation:

Database: Use character_set_server and collation_server as its default charset and collation

Table: Use the default charset and collation of the database as its default charset and collation

Column: Use the default charset and collation of the table as its default charset and collation

● If you explicitly specify a charset and a collation, use the explicitly specified charset and collation.

● If you explicitly specify a charset but do not explicitly specify a collation, use the explicitly specified charset and its default collation.

● If you explicitly specify a collation but do not explicitly specify a charset, use the explicitly specified collation and the charset to which it belongs.

● All of the CHAR, VARCHAR, TEXT, ENUM, and SET columns support specifying charsets and collations.

● When you modify the charset of a column, MySQL attempts to map the data value. However, if the charset is incompatible before and after the modification, data loss may occur.

String Literals

All string literals have their own charsets and collations, which can be explicitly specified as follows. When specifying a charset, you need to add an underscore before it. At the same time, this syntax can also be used with b and X prefixes:

_charset_name'string' COLLATE collation_name

Here are some examples:

SELECT 'abc';
SELECT _latin1'abc';
SELECT _binary'abc';
SELECT _utf8mb4'abc' COLLATE utf8mb4_danish_ci;
SELECT _latin1 X'4D7953514C';
SELECT _utf8mb4 0x4D7953514C COLLATE utf8mb4_danish_ci;
SELECT _latin1 b'1000001';
SELECT _utf8mb4 0b1000001 COLLATE utf8mb4_danish_ci;

Follow these rules during configuration:

• If the charset and collation are not explicitly specified, character_set_connection and collation_connection will be used as their charset and collation, and for strings prefixed with b and X, charsets and collations will be set to binary.

• If you explicitly specify a charset and a collation, use the explicitly specified charset and collation.

• If you explicitly specify a charset but do not explicitly specify a collation, use the explicitly specified charset and its default collation.

• If you explicitly specify a collation but do not explicitly specify a charset, the explicitly specified collation needs to be a collation of character_set_connection.

• _charset_name is referred to as Introducer. Without the introducer, the parser converts the charset of the string literal into character_set_connection. The role of the introducer is to tell the parser that the following string uses charset_name as the charset, but it does not convert the string to the target charset like the CONVERT() function, nor does it change the binary encoding of the string. It only performs the padding in some cases, and may lead to garbled characters:

-- 1. Create table t1 with one column that uses the latin1 charset.
mysql> create table t1(a char(5) character set latin1);
-- 2. The direct inserting of '张' fails because latin1 does not have this character, but after adding the_latin1 prefix, it is successful.
--   This is because the introducer allows the MySQL parser to mistakenly assume that the utf8mb4 string is the latin1 charset.
--   No check or conversion is required when inserting the corresponding column later.
mysql> insert into t1(a) values('张');
ERROR 1366 (HY000): Incorrect string value: '\xE5\xBC\xA0' for column 'a' at row 1
mysql> insert into t1(a) values(_latin1'张');
Query OK, 1 row affected (0.00 sec)
-- 3. The code 0xE5BCA0 of '张' on utf8mb4 is selected and stored, resulting in garbled characters.
mysql> select a, hex(a) from t1;
+--------+--------+
| a      | hex(a) |
+--------+--------+
| å¼     | E5BCA0 |
+--------+--------+

• In most cases, _binary'11110000' and b'11110000' and X'F0' are equivalent and are considered binary strings. They can all be used, but there are some differences when using bit operations. As shown in the following example, the bit operation results of binary strings prefixed with b and X are numbers, so there is no leading 0. The bit operation result of the binary string with a binary prefix is also a string, so there is a leading 0.

mysql> SET @v1 = X'000D' | X'0BC0';
mysql> SET @v2 = _binary X'000D' | X'0BC0';
mysql> SELECT HEX(@v1), HEX(@v2);
+----------+----------+
| HEX(@v1) | HEX(@v2) |
+----------+----------+
| BCD      | 0BCD     |
+----------+----------+

Charsets of Error Messages

MySQL builds error messages as follows:

• The error message template uses the utf8mb3 charset.

• When the parameters in the template are replaced with some values:

• Meta information such as table names or column names are also used internally by utf8mb3, so they are copied.

• The string values of non-binary charsets are converted to utf8mb3.

• Bytes with binary string values in the range of 0x20 to 0x7E are copied, and bytes outside the range are hexadecimal encoded using x. For example, if a duplicate key error occurs when trying to insert 0x41CF9F into the VARBINARY unique column, the resulting error message becomes "Duplicate entry 'AxCFx9F' for key 1", the last two bytes 0xCF9F are output as a hexadecimal string, and the first byte 0x41 is converted to the character A.

Once the error message is built, it can be written to an error log by MySQL or sent to the client:

● When error messages are written to the error log, they are written in the utf8mb3 charset at build time without conversion.

● When an error message is sent to a client program, the error message is converted from utf8mb3 to the charset specified by the character_set_results system variable. If the value of the character_set_results is NULL or binary, no conversion occurs. If the variable value is utf8mb3 or utf8mb4, no conversion occurs.

● If there are characters that cannot be represented in the character_set_results, some encodings may occur during the conversion process, and the encoding uses Unicode code points:

• Characters in the BMP range (0x0000 to 0xFFFF) are written by the nnnn notation.

• Characters outside the BMP range (0x10000 to 0x10FFFF) are written by the +nnnnnn notation

Convert charsets

In MySQL, there are many cases where the behavior of converting charset occurs, for example:

● Assign one column of data to another column that uses a different charset:

-- Convert latin1 to utf8mb4
UPDATE t1 SET utf8mb4_bin_column=latin1_column;
-- Convert utf8mb4 to latin1
INSERT INTO t1 (latin1_column) SELECT utf8mb4_bin_column FROM t2;

● Use a string literal to INSERT or UPDATE a column that uses a different charset:

SET NAMES latin1; -- The literal uses latin1.
-- Convert latin1 to utf8mb4
INSERT INTO t1 (utf8mb4_bin_column) VALUES ('string-in-latin1');

● The database sends the result to the client:

SET NAMES latin1; -- The client uses latin1.
-- The result string uses utf8mb4, which needs to be converted to latin1 and then sent to the client.
SELECT utf8mb4_bin_column FROM t2;

● Built-in functions CONVERT and CAST:

SELECT CONVERT(_latin1 'Müller' USING utf8mb4);
SELECT CAST('test' AS CHAR CHARACTER SET utf8mb4);

● ALTER TABLE modifies the charset in the column definition, which is described in detail in the subsequent section.

-- Convert utf8mb4 to latin1
ALTER TABLE t1 MODIFY utf8mb4_bin_column VARCHAR(10) CHARACTER SET latin1;
-- Convert the charset values of all columns in the table to latin1.
ALTER TABLE t1 CONVERT TO CHARACTER SET latin1;

● Strings of different charsets exist in the expression, which is described in detail in the subsequent section.

-- You need to determine the priority. In this example, the data in the latin1 column is converted from latin1 to utf8mb4.
SELECT CONCAT(utf8mb4_column, latin1_column) FROM t1;

Modify the Charset of a Column

To convert a binary or non-binary string column to use a specific charset, you need to use ALTER TABLE. To make the conversion successful, one of the following conditions must be met:

• If the data type of a column is BINARY, VARBINARY, or BLOB, all values it contains must be encoded with a single charset, which must match the target charset. If a binary column stores information encoded with multiple charsets, MySQL cannot know which values use which charset and cannot convert the data correctly.

• If the data type of the column is CHAR, VARCHAR, or TEXT, its content should be encoded by the original column definition of the charset instead of other charsets. It should be noted that if a character in the column data is not contained by both the new and old charsets, data loss may occur during the conversion process.

After you use ALTER TABLE to modify a column definition, there are two additional points to note:

• If the original column type is BINARY(50), you can convert it to CHAR(50), but the end of the converted value will be populated with 0x00 bytes, which may not meet your expectations. To delete these bytes, you can use the TRIM() function:

UPDATE t SET col1 = TRIM(TRAILING 0x00 FROM col1);

• If you specified attributes such as the charset and collation when you created the column, you should also specify these attributes when you modify the table by ALTER TABLE. For example, if NOT NULL and an explicit DEFAULT value are specified in the column definition, they should also be provided in the ALTER TABLE statement, otherwise, the final column definition will not include these attributes.

Charset Conversions in Expressions

SELECT x FROM T WHERE x = 'Y';
SELECT concat(x, 'Y', z) FROM T;

In the above example, if column x, literal, and column z all use the same charset and collation, then the above statement will not have any ambiguity, but if their charsets or collations are different, then whose charset or collation prevails? For this reason, MySQL defines some rules to eliminate these ambiguities. You can use the COERCIBILITY function to check the priority. The lower the value is, the higher the priority is.

• The coercibility of an explicit COLLATE clause is 0.

SELECT COERCIBILITY(_utf8mb4'A' COLLATE utf8mb4_bin); -- Return 0

• The coercibility of the string obtained by concatenating two strings with different collations is 1.

• The coercibility of the collation of a column, stored procedure parameter, or local variable is 2.

• The coercibility of a system constant, such as a string returned by the USER() or VERSION() function is 3.

SELECT COERCIBILITY(VERSION()); -- Return 3

• The coercibility of the literal collation is 4.

SELECT COERCIBILITY('A'); -- Return 4

• The coercibility of the collation of numeric or time values is 5. For example, in CONCAT(1, 'abc'), numbers are implicitly converted to strings.

SELECT COERCIBILITY(1000); -- Return 5

• The coercibility of NULL or an expression derived from NULL is 6.

SELECT COERCIBILITY(NULL); -- Return 6

Based on the above rules, the string with the lowest coercibility, that is, the highest priority, can be selected. Its charset and collation will be used for expressions and results, but what if the coercibilities of the two strings are the same? Then you need the following rules:

  1. Return an error if neither or both strings use Unicode charsets.
  2. If one string uses Unicode charsets and the other uses non-Unicode charsets, then the string using Unicode charsets prevails and the string using non-Unicode charsets will be automatically converted to Unicode charsets. The reason is that MySQL considers any character contained in a charset as a subset of the characters contained in a Unicode charset.

Here are some specific examples:

Expression Collation used
column1 = 'A' Use the collation of column1
column1 = 'A' COLLATE x Use collation x
column1 COLLATE x = 'A' COLLATE y Report an error

Source Code Implementation

The handler function is the String::copy function located in the sql-common/sql_string.cc, which mainly does three things:

  1. Check whether the charset needs to be converted, and copy the string directly if not.
  2. Check whether the original charset is binary, and copy and align it at the byte level if so.
  3. Other cases mean that charset conversion is required. Perform the conversion and copy it.
bool String::copy(const char *str, size_t arg_length,
                  const CHARSET_INFO *from_cs, const CHARSET_INFO *to_cs,
                  uint *errors) {
  ...
  /* 1. Copy it directly without conversion. */
  if (!needs_conversion(arg_length, from_cs, to_cs, &offset)) {
    *errors = 0;
    return copy(str, arg_length, to_cs);
  }
  /* 2. The original charset is binary and copy and align it at the byte level. */
  if ((from_cs == &my_charset_bin) && offset) {
    *errors = 0;
    return copy_aligned(str, arg_length, offset, to_cs);
  }
  size_t new_length = to_cs->mbmaxlen * arg_length;
  if (alloc(new_length)) return true;
  /* 3. Convert the charset and copy it. */
  m_length = copy_and_convert(m_ptr, new_length, to_cs, str, arg_length,
                              from_cs, errors);
  m_charset = to_cs;
  return false;
}

The following are the details about how charset conversion is performed. The function copy_and_convert calls the function my_convert through the transmission to check character by character. If it is an ASCII character, it will be copied directly. If not, it will call my_convert_internal for subsequent processing. This function is the key to charset conversion. It does the following things for each character:

1.  Parse the binary code based on the original charset rules and obtain the corresponding Unicode code point based on the parsing result:

a) If the parsing succeeds and the Unicode code point is obtained, go to step 2.

b) If the parsing fails, the corresponding byte is regarded as the '?' character, and its Unicode code point U+003F is used.

c) The parsing is successful, but Unicode does not contain the corresponding character, so the Unicode code point cannot be obtained, the character is regarded as the '?' character, and its Unicode code point U+003F is used. This generally does not happen here, because Unicode characters are usually the most comprehensive.

2.  Convert the Unicode code point obtained in the previous step to the encoding of the target charset. If the target charset does not contain the character, convert it to the '?' character.

static size_t my_convert_internal(char *to, size_t to_length,
                                  const CHARSET_INFO *to_cs, const char *from,
                                  size_t from_length,
                                  const CHARSET_INFO *from_cs, uint *errors) {
  ...
  /* Each charset defines its own mapping function from character encodings to Unicode code points, and Unicode code points to character encodings.
     Here you need the first function of the original charset and the second function of the new charset. */
  my_charset_conv_mb_wc mb_wc = from_cs->cset->mb_wc;
  my_charset_conv_wc_mb wc_mb = to_cs->cset->wc_mb;
  ...
  while (true) { /* The loop process of each character */
    /* 1. Parse the binary code to the Unicode code point. */
    if ((cnvres = (*mb_wc)(from_cs, &wc, pointer_cast<const uchar *>(from),
                           from_end)) > 0) /* a. Succeed */
      from += cnvres;
    else if (cnvres == MY_CS_ILSEQ) { /* b. Parsing fails */
      error_count++;
      from++;
      wc = '?';
    } else if (cnvres > MY_CS_TOOSMALL) { /* c. The parsing succeeds but the Unicode code point fails to be obtained */
      error_count++;
      from += (-cnvres);
      wc = '?';
    } else
      break;
  outp:
    /* 2. Convert the Unicode code point obtained in the previous step to the target charset encoding. */
    if ((cnvres = (*wc_mb)(to_cs, wc, (uchar *)to, to_end)) > 0) /* Succeed */
      to += cnvres;
    else if (cnvres == MY_CS_ILUNI && wc != '?') { /* Fail. Assign wc to '?'(0x3F) and retry the conversion */
      error_count++;
      wc = '?';
      goto outp;
    } else
      break;
  }
  ...
}

In the source code, there are two points to be noted:

• When the original charset is binary, you need to pay special attention to it. It will be copied byte by byte without checking, which may lead to garbled characters, as shown in the following example:

-- 1. Create table t1 with a column that uses the latin1 charset.
mysql> create table t1 (a varchar(5) character set latin1);
-- 2. Insert '张'. Because the latin1 charset does not contain '张', an error is reported directly and the insertion fails.
mysql> insert into t1 values('张');
ERROR 1366 (HY000): Incorrect string value: '\xE5\xBC\xA0' for column 'a' at row 1
-- 3. Use binary literal string insertion, and it is successful, even if the latin1 charset has no character encoding 0xE5BCA0 at all.
mysql> insert into t1 values(X'E5BCA0'); -- 0xE5BCA0 is the encoding of '张' in the utf8mb4 charset
Query OK, 1 row affected (0.00 sec)
-- 4. Query the results and return the garbled characters. The stored content is what is passed.
mysql> select a, hex(a) from t1;
+--------+--------+
| a      | hex(a) |
+--------+--------+
| å¼     | E5BCA0 |
+--------+--------+

• If the original charset is not fully compatible with Unicode or the new charset, the incompatible characters are converted into question marks '?', as shown in the following example:

-- The latin1 charset does not contain the character '张', and it is converted to '?'. The hexadecimal value is 0x3F.
mysql> select hex(cast(_utf8mb4'张' as char character set latin1));
+-------------------------------------------------------+
| hex(cast(_utf8mb4'张' as char character set latin1))  |
+-------------------------------------------------------+
| 3F                                                    |
+-------------------------------------------------------+

Sorting Algorithm for Unicode Strings

For Example

Before really introducing the principle of the Unicode string sorting algorithm, let's have a preliminary concept of string sorting through an example in MySQL. First, create a table with only one VARCHAR column. The charset used for this column is utf8mb4. Insert five strings into the table: "rôle", "Role", "role", "roles", and "rule". The following are the sorting results using utf8mb4_0900_as_cs and utf8mb4_bin collations respectively:

mysql> select * from role_table order by str collate utf8mb4_0900_as_cs;
+-------+
| str   |
+-------+
| role  |
| Role  |
| rôle  |
| roles |
| rule  |
+-------+
mysql> select * from role_table order by str collate utf8mb4_bin;
+-------+
| str   |
+-------+
| Role  |
| role  |
| roles |
| rule  |
| rôle  |
+-------+

It can be seen that the difference in collations affects the sorting result. First, we will explain the result of utf8mb4_bin. In fact, the collation sorting rule is very simple, that is, the characters are converted into the original Unicode code points, and then compared from front to back according to the code point array, so the uppercase letter R is ranked before the lowercase letter r.

utf8mb4_0900_as_cs is both tone-sensitive and case-sensitive. The collation is implemented based on UCA (Unicode Collation Algorithm) 9.0.0. This section takes this version of UCA as an example for the introduction. Most cases of comparing two strings are divided into three levels: original characters, tone, and case. The priorities of the three levels decrease in turn. Only when the previous levels are completely equal will the next level be compared. Therefore:

• When comparing the first level, "role" = "Role" = "rôle" < "roles" < "rule"

• When comparing the second level, "role" = "Role" < "rôle"

• When comparing the third level, "role" < "Role"

So far, after three levels of comparison, the size relationship of the five strings is completely determined. The collations used in this example have the suffixes of_as and_cs. For the collations with the suffixes of_ai and_ci, the comparison of the corresponding levels will be canceled. For example, utf8mb4_0900_ai_ci thinks "role" = "Role" = "rôle" < "roles" < "rule", and utf8mb4_0900_as_ci thinks "role" = "Role" < "rôle" < "roles" < "rule".

Weight Table

From the introduction of the above example, it can be seen that when UCA 9.0.0 compares strings, the keys of each character at three levels will be obtained. Obviously, the Unicode code point of the character can no longer meet the requirements, so a weight table is needed to record the weight of each character at three levels. Each version of UCA will have a corresponding DUCET (Default Unicode Collation Element Table) to record this information, and the DUCET of UCA 9.0.0 can be seen at https://www.unicode.org/Public/UCA/9.0.0/allkeys.txt. Customized collations for different languages are based on DUCET and some specific weight configurations for this language.

A few examples of table entries are given below:

0061  ; [.1C47.0020.0002] # LATIN SMALL LETTER A
0041  ; [.1C47.0020.0008] # LATIN CAPITAL LETTER A
1D434 ; [.1C47.0020.000B] # MATHEMATICAL ITALIC CAPITAL A
00E1  ; [.1C47.0020.0002][.0000.0024.0002] # LATIN SMALL LETTER A WITH ACUTE
249C  ; [*0317.0020.0004][.1C47.0020.0004][*0318.0020.0004] # PARENTHESIZED LATIN SMALL LETTER A

The five characters in the above example are a, A, 𝐴, á, and ⒜. The first column is the Unicode code point of the character. The array of three numbers enclosed by each square bracket ([]) is called a collation element. A character may correspond to one or more collation elements. The number sign (#) is an annotation character, followed by the introduction information of the character.

The three hexadecimal numbers in the collation element represent the weights of the original character, tone, and case respectively from front to back. The numbers beginning with the asterisk (*) generally correspond to punctuation marks and will be ignored in some comparison methods. However, MySQL does not specially handle punctuation marks, so the parentheses before and after ⒜ also participate in sorting.

The following is a brief explanation of the origin of multiple collation elements corresponding to á and ⒜. UCA defines the rule of decomposing a complex character (see the link). Therefore, as shown in the following table, the collation element list of simple characters decomposed are reorganized to form the final collation element in the above example (not necessarily exactly equal, but may be accompanied by some changes in tone and case weight):

Source character After decomposition Collation Elements
á 'a'、' ́' [.1C47.0020.0002]、[.0000.0024.0002]
'('、'a'、')' [*0317.0020.0002]、[.1C47.0020.0002]、[*0318.0020.0002]

Algorithm Process

With the weight table, the next is how to use the weight table, especially when some characters correspond to more than one collation element. UCA calculates the sort key of the entire string for sorting. The following is a string example 'aáA'. The three characters and their collation elements are as follows:

Character Collation Element
a [.1C47.0020.0002]
á .1C47.0020.0002
A [.1C47.0020.0008]

1.  Obtain the collation elements of all the characters in the string and form a collation element list, as shown in the following example.

aáA .1C47.0020.0002.0000.0024.0002

2.  All the first-level weights (skip items with 0) are taken out from front to back and spliced together, as shown in the following example.

• SortKey1=1C471C471C47

3.  If it is tone-sensitive (_as), all the second-level weights (skip items with 0) are taken out from front to back and spliced together, as shown in the following example.

• SortKey2=0020002000240020

4.  If it is case-sensitive (_cs), all the third-level weights (skip items with 0) are taken out from front to back and spliced together, as shown in the following example.

• SortKey3=0002000200020008

5.  Finally, choose whether to splice SortKey2 and SortKey3 according to whether they are tone- and case-sensitive. Assuming that they are both sensitive, splice the final SortKey as follows, and the splice character is 0000. This example is as follows.

• SortKey=SortKey1+0000+SortKey2+0000+SortKey3=1C471C471C470000002000200024002000000002000200020008

After that, the comparison of strings is based on the sort key, which is essentially a 16-bit integer array. The higher the value in the array is, the higher the priority is. So far, a three-level comparison architecture has been implemented. In MySQL, you can use the WEIGHT_STRING function to obtain the sort key of a string, as shown in the following example:

--  Tone- and case-sensitive
mysql> select hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_as_cs));
+---------------------------------------------------------------+
| hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_as_cs))  |
+---------------------------------------------------------------+
| 1C471C471C470000002000200024002000000002000200020008          |
+---------------------------------------------------------------+
-- Tone-sensitive but case-insensitive
mysql> select hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_as_ci));
+---------------------------------------------------------------+
| hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_as_ci))  |
+---------------------------------------------------------------+
| 1C471C471C4700000020002000240020                              |
+---------------------------------------------------------------+
-- Tone- and case-insensitive
mysql> select hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_ai_ci));
+---------------------------------------------------------------+
| hex(weight_string(_utf8mb4'aáA' collate utf8mb4_0900_ai_ci))  |
+---------------------------------------------------------------+
| 1C471C471C47                                                  |
+---------------------------------------------------------------+

Therefore, when we set collation to tone- and case-insensitive, there will be many unexpected situations. At this time, "aaaa" and "aáA𝐴" are considered equal, as follows:

mysql> select _utf8mb4'aáA𝐴' collate utf8mb4_0900_ai_ci = 'aaaa';
+-----------------------------------------------------+
| _utf8mb4'aáA?' collate utf8mb4_0900_ai_ci = 'aaaa'  |
+-----------------------------------------------------+
|                                                   1 |
+-----------------------------------------------------+

Characters Not Included in the Weight Table

Although the weight table provided by UCA contains the mapping of many characters to their collation elements, not all characters have an entry in the table. For these characters, UCA generates their collation elements according to certain rules.

UCA 9.0.0

If a character needs to generate a weight, it corresponds to two collation elements, such as .AAAA.0020.0002, where AAAA and BBBB are calculated according to the Unicode code point of the character. MySQL implementation based on the UCA 9.0.0 rule is as follows:

static void set_implicit_weights(MY_UCA_ITEM *item, int code) {
  int base, aaaa, bbbb;
  if (code >= 0x17000 && code <= 0x18AFF)  // Tangut character
  {
    aaaa = 0xFB00;
    bbbb = (code - 0x17000) | 0x8000;
  } else {
    /* non-Core Han Unified Ideographs */
    if ((code >= 0x3400 && code <= 0x4DB5) ||
        (code >= 0x20000 && code <= 0x2A6D6) ||
        (code >= 0x2A700 && code <= 0x2B734) ||
        (code >= 0x2B740 && code <= 0x2B81D) ||
        (code >= 0x2B820 && code <= 0x2CEA1))
      base = 0xFB80;
    /* Core Han Unified Ideographs */
    else if ((code >= 0x4E00 && code <= 0x9FD5) ||
             (code >= 0xFA0E && code <= 0xFA29))
      base = 0xFB40;
    /* All other characters whose weight is unassigned */
    else
      base = 0xFBC0;
    aaaa = base + (code >> 15);
    bbbb = (code & 0x7FFF) | 0x8000;
  }
  item->weight[0] = aaaa;
  item->weight[1] = 0x0020;
  item->weight[2] = 0x0002;
  item->weight[3] = bbbb;
  item->weight[4] = 0x0000;
  item->weight[5] = 0x0000;
  item->num_of_ce = 2;
}

The collation elements of the Chinese, Japanese, and Korean characters are basically generated in this way. For example, if the Unicode code point of '张' is 0x5F20, and the corresponding code base above is FB40, then aaaa can be calculated as 0xFB40 and bbbb can be calculated as 0xDF20, so the collation element of the character '张' is .FB40.0020.0002. As a string, it has the following sort key which is consistent with our calculation:

mysql> select hex(weight_string(_utf8mb4'张' collate utf8mb4_0900_as_cs));
+--------------------------------------------------------------+
| hex(weight_string(_utf8mb4'张' collate utf8mb4_0900_as_cs))  |
+--------------------------------------------------------------+
| FB40DF200000002000000002                                     |
+--------------------------------------------------------------+

UCA 4.0.0

For utf8mb4 non-0900 and 520 collations, the UCA 4.0.0 they use only supports generating weights for BMP characters (U+0000 to U+FFFF) in this version of UCA. The weights of all non-BMP characters (U+10000 to U+10FFFF) are unified to 0xFFFD, so all non-BMP characters are equal in the view of utf8mb4_unicode_ci, as follows:

--  The Unicode code point of '𠀫' is U+2002B, and its weight under utf8mb4_unicode_ci is 0xFFFD.
mysql> select hex(weight_string(_utf8mb4'𠀫' collate utf8mb4_unicode_ci));
+------------------------------------------------------------+
| hex(weight_string(_utf8mb4'?' collate utf8mb4_unicode_ci)) |
+------------------------------------------------------------+
| FFFD                                                       |
+------------------------------------------------------------+
-- In the view of utf8mb4_unicode_ci based on UCA 4.0.0, '𠀫' and '𠀪' are equal.
mysql> select _utf8mb4'𠀫' collate utf8mb4_unicode_ci = '𠀪';
+----------------------------------------------+
| _utf8mb4'?' collate utf8mb4_unicode_ci = '?' |
+----------------------------------------------+
|                                            1 |
+----------------------------------------------+
-- In the view of utf8mb4_0900_ai_ci based on UCA 9.0.0, '𠀫' and '𠀪' are unequal.
mysql> select _utf8mb4'𠀫' collate utf8mb4_0900_ai_ci = '𠀪';
+----------------------------------------------+
| _utf8mb4'?' collate utf8mb4_0900_ai_ci = '?' |
+----------------------------------------------+
|                                            0 |
+----------------------------------------------+

For BMP characters that do not explicitly specify weights, the rules for generating weights are similar to that of UCA 9.0.0, except that non-BMP characters are not processed, as follows:

if (code >= 0x3400 && code <= 0x4DB5)
  base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0x4E00 && code <= 0x9FA5)
  base= 0xFB40; /* CJK Ideograph */
else
  base= 0xFBC0; /* All other characters */
aaaa= base +  (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;

_general and _unicode in MySQL

For the Unicode charset in MySQL, there are generally a _general collation and a _unicode collation. For the utf8mb4 charset, the two collations are utf8mb4_general_ci and utf8mb4_unicode_ci. These two collations are based on UCA 4.0.0, but utf8mb4_general_ci can be regarded as a simplification of utf8mb4_unicode_ci. utf8mb4_general_ci does not support the mapping of one character to multiple collation elements, so some rules specified in DUCET do not take effect on utf8mb4_general_ci. The following is an example.

Character Collation Element
s [.0FEA.0020.0008.0053]
ß .0FEA.0020.0004.00DF[.0FEA.0020.001F.00DF]

In the weight table of UCA 4.0.0, each collation element has four values, the fourth of which is the Unicode code point. The meaning and usage of the first three values are the same as those of UCA 9.0.0. The following is the difference between utf8mb4_general_ci and utf8mb4_unicode_ci:

-- 1. The sort key of utf8mb4_general_ci is the Unicode code point after the character is restored, and 'ß' is restored to 's'.
mysql> select hex(weight_string(_utf8mb4'ß' collate utf8mb4_general_ci));
+-------------------------------------------------------------+
| hex(weight_string(_utf8mb4'ß' collate utf8mb4_general_ci))  |
+-------------------------------------------------------------+
| 0053                                                        |
+-------------------------------------------------------------+
mysql> select hex(weight_string(_utf8mb4's' collate utf8mb4_general_ci));
+------------------------------------------------------------+
| hex(weight_string(_utf8mb4's' collate utf8mb4_general_ci)) |
+------------------------------------------------------------+
| 0053                                                       |
+------------------------------------------------------------+
-- 2. In the view of utf8mb4_general_ci, 'ß' is equal to 's'.
mysql> select _utf8mb4'ß' collate utf8mb4_general_ci = 's';
+-----------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_general_ci = 's'  |
+-----------------------------------------------+
|                                             1 |
+-----------------------------------------------+
-- 3. The sort key of utf8mb4_unicode_ci is similar to that of UCA 9.0.0, except that the mapping table is different.
--    Allow one character to be mapped to multiple collation elements.
mysql> select hex(weight_string(_utf8mb4'ß' collate utf8mb4_unicode_ci));
+-------------------------------------------------------------+
| hex(weight_string(_utf8mb4'ß' collate utf8mb4_unicode_ci))  |
+-------------------------------------------------------------+
| 0FEA0FEA                                                    |
+-------------------------------------------------------------+
mysql> select hex(weight_string(_utf8mb4's' collate utf8mb4_unicode_ci));
+------------------------------------------------------------+
| hex(weight_string(_utf8mb4's' collate utf8mb4_unicode_ci)) |
+------------------------------------------------------------+
| 0FEA                                                       |
+------------------------------------------------------------+
-- 4. In the view of utf8mb4_unicode_ci, 'ß' is equal to 'ss', not to 's'.
mysql> select _utf8mb4'ß' collate utf8mb4_unicode_ci = 'ss';
+------------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_unicode_ci = 'ss'  |
+------------------------------------------------+
|                                              1 |
+------------------------------------------------+

When the CPU was not very powerful, utf8mb4_general_ci had better performance than utf8mb4_unicode_ci due to its simplification (accompanied by loss of accuracy). However, with the development of the CPU, the performance improvement here is no longer obvious. Therefore, it is not recommended to continue to use utf8mb4_general_ci. As for utf8mb4_unicode_ci, it is based on the UCA 4.0.0 which lags behind the UCA 9.0.0 what the utf8mb4_0900_xx is based on. Therefore, we recommend that you use utf8mb4_0900_xx Collation. You need to consider the suffix of _ai_ci, _as_ci, and _as_cs based on your business.

Particularity of LIKE

In the above example, the regular comparisons =, <, and > are described, and the calculated sort key can be used directly. Therefore, there will be the case of 'ß'='ss' on utf8mb4_unicode_ci. One character can also be equal to two characters. However, the LIKE operator does not allow this case. LIKE only allows one-to-one character matching, as shown in the following example:

-- On utf8mb4_unicode_ci, 'ß' is neither LIKE 'ss' nor LIKE 's'.
mysql> select _utf8mb4'ß' collate utf8mb4_unicode_ci like 'ss';
+---------------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_unicode_ci like 'ss'  |
+---------------------------------------------------+
|                                                 0 |
+---------------------------------------------------+
mysql> select _utf8mb4'ß' collate utf8mb4_unicode_ci like 's';
+--------------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_unicode_ci like 's'  |
+--------------------------------------------------+
|                                                0 |
+--------------------------------------------------+
-- On utf8mb4_general_ci, 'ß' is LIKE 's', similar to the "=" comparator because they are one-to-one equal at the character level.
mysql> select _utf8mb4'ß' collate utf8mb4_general_ci like 's';
+--------------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_general_ci like 's'  |
+--------------------------------------------------+
|                                                1 |
+--------------------------------------------------+

Binary Charset and _bin Collation

This section introduces the most straightforward binary comparison methods: binary collations for binary character sets and _bin collations for non-binary character sets, each with some distinctions.

The BINARY, VARBINARY, and BLOB data types use binary charsets and binary collations. A binary string is a sequence of bytes whose values determine the sort order.

The CHAR, VARCHAR, and TEXT data types use non-binary charsets in most cases. For most non-binary charsets, a binary collation comparison ending with _bin is supported. For example, the binary collations of latin1 and big5 are named latin1_bin and big5_bin, respectively. An exception is utf8mb4 which has two binary collations, utf8mb4_bin and utf8mb4_0900_bin.

Basic Comparison Unit

The basic comparison unit of binary charsets is byte, while the basic comparison unit of non-binary charsets is character. The difference between the byte and the character is that the character may be composed of multiple bytes. When binary collations of binary charsets are compared, their values are compared byte by byte. When _bin Collations of non-binary charsets are compared, their values are compared character by character. For Unicode charsets, the following rules are applied:

• _bin Collations except utf8mb4_0900_bin converts characters into Unicode code points for comparison and may add leading 0.

• utf8mb4_0900_bin will use the original binary of the character for comparison, because the UTF-8 encoding order is the same as the Unicode code point order, so you can get the same result faster.

The following three collations are used as examples: binary, utf8mb4_bin, and utf8mb4_0900_bin.

-- Use the following two characters as example
mysql> select _utf8mb4 X'64';
+----------------+
| _utf8mb4 X'64' |
+----------------+
| d              |
+----------------+
mysql> select _utf8mb4 X'e18080';
+--------------------+
| _utf8mb4 X'e18080' |
+--------------------+
| က                 |
+--------------------+
-- 1. binary
mysql> select WEIGHT_STRING(_binary X'64');
+------------------------------------------------------------+
| WEIGHT_STRING(_binary X'64')                               |
+------------------------------------------------------------+
| 0x64                                                       |
+------------------------------------------------------------+
mysql> select WEIGHT_STRING(_binary X'e18080');
+--------------------------------------------------------------------+
| WEIGHT_STRING(_binary X'e18080')                                   |
+--------------------------------------------------------------------+
| 0xE18080                                                           |
+--------------------------------------------------------------------+
-- 2. utf8mb4_bin
mysql> select WEIGHT_STRING(_utf8mb4 X'64' collate utf8mb4_bin);
+------------------------------------------------------------------------------------------------------+
| WEIGHT_STRING(_utf8mb4 X'64' collate utf8mb4_bin)                                                    |
+------------------------------------------------------------------------------------------------------+
| 0x000064                                                                                             |
+------------------------------------------------------------------------------------------------------+
mysql> select WEIGHT_STRING(_utf8mb4 X'e18080' collate utf8mb4_bin);
+--------------------------------------------------------------------------------------------------------------+
| WEIGHT_STRING(_utf8mb4 X'e18080' collate utf8mb4_bin)                                                        |
+--------------------------------------------------------------------------------------------------------------+
| 0x001000                                                                                                     |
+--------------------------------------------------------------------------------------------------------------+
-- 3. utf8mb4_0900_bin
mysql> select WEIGHT_STRING(_utf8mb4 X'64' collate utf8mb4_0900_bin);
+----------------------------------------------------------------------------------------------------------------+
| WEIGHT_STRING(_utf8mb4 X'64' collate utf8mb4_0900_bin)                                                         |
+----------------------------------------------------------------------------------------------------------------+
| 0x64                                                                                                           |
+----------------------------------------------------------------------------------------------------------------+
mysql> select WEIGHT_STRING(_utf8mb4 X'e18080' collate utf8mb4_0900_bin);
+------------------------------------------------------------------------------------------------------------------------+
| WEIGHT_STRING(_utf8mb4 X'e18080' collate utf8mb4_0900_bin)                                                             |
+------------------------------------------------------------------------------------------------------------------------+
| 0xE18080                                                                                                               |
+------------------------------------------------------------------------------------------------------------------------+

It can be seen that:

● A binary charset does not have the concept of characters. In its view, all strings are byte sequences. The byte sequence it sees is the weight key used by binary collations to compare and sort the strings.

● utf8mb4 charset has the concept of characters, but the performance of the two collations utf8mb4_bin and utf8mb4_0900_bin are somewhat different.

• utf8mb4_bin will convert all characters back to Unicode code points and complete them to 3 bytes (because Unicode characters can be completely represented by 3 bytes). In the example, the one-byte character d is converted to 0x000064, and the three-byte character က is converted to 0x001000, so it can be seen that _bin Collation makes a comparison with characters as the basic unit.

• utf8mb4_0900_bin cannot clearly reflect that characters are taken as the basic unit. Its performance is similar to binary collations. The utf8mb4 original code is directly used for comparison. Since the code design itself has already met the requirement that the first-byte value of multi-byte character binary is larger, it is possible to achieve the same goal by direct comparison without filling the byte here.

Case Conversion

Binary strings do not have the concept of character case, while non-binary strings can use case conversion functions even if _bin Collation is used, as shown in the following example:

mysql> SET NAMES utf8mb4 COLLATE utf8mb4_bin;
mysql> SELECT LOWER('aA'), UPPER('zZ');
+-------------+-------------+
| LOWER('aA') | UPPER('zZ') |
+-------------+-------------+
| aa          | ZZ          |
+-------------+-------------+
mysql> SET NAMES binary;
mysql> SELECT LOWER('aA'), LOWER(CONVERT('aA' USING utf8mb4));
+-------------+------------------------------------+
| LOWER('aA') | LOWER(CONVERT('aA' USING utf8mb4)) |
+-------------+------------------------------------+
| aA          | aa                                 |
+-------------+------------------------------------+

CHAR, VARCHAR, BINARY, and VARBINARY

These four data types are highly correlated with charsets and collations, so here are some key points.

• In CHAR(N) and VARCHAR(N), N represents the number of characters. CHAR(N) reserves bytes with the number of the product of N and the maximum number of characters in the charset. In BINARY(N) and VARBINARY(N), N represents the number of bytes. BINARY(N) reserves bytes with the number of N. For example, suppose a column type is CHAR(5), and the utf8mb4 character set is used. Since the number of bytes of the longest character is 4, each CHAR(5) will reserve 5 × 4 = 20 bytes. However, BINARY(5) will only reserve 5 bytes.

• When a column is defined, CHAR BINARY and VARCHAR BINARY do not mean that the column is of the BINARY type. Instead, they mean that the _bin Collation of the charset is used. For example, when the default charset is utf8mb4, CHAR(5) BINARY is an abbreviation for CHAR(5) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin.

• If CHAR and BINARY insert data that is not long enough, padding is performed. The value of the CHAR type padding is 0x20 (space character), and the value of the BINARY type padding is 0x00. When reading CHAR type columns, MySQL will automatically remove all padding 0x20, so in the view of MySQL, after inserting "a" and "a " into CHAR columns, the reading results are both "a", and the original space is lost. When reading BINARY type columns, the complete N bytes (including the padding 0x00) are returned, as shown in the following example:

mysql> CREATE TABLE t (c BINARY(3));
mysql> INSERT INTO t SET c = 'a';
-- The string 'a' is supplemented with two 0x00. When judging, the data in this column is not equal to 'a', but to 'a\0\0'.
mysql> SELECT HEX(c), c = 'a', c = 'a\0\0' from t;
+--------+---------+-------------+
| HEX(c) | c = 'a' | c = 'a\0\0' |
+--------+---------+-------------+
| 610000 |       0 |           1 |
+--------+---------+-------------+

• VARCHAR and VARBINARY do not have the above-mentioned padding step, so VARCHAR is still read as "a " after it stores "a ". The following is an example:

mysql> CREATE TABLE vc (v VARCHAR(4), c CHAR(4));
mysql> INSERT INTO vc VALUES ('ab  ', 'ab  ');
-- The trailing space character of VARCHAR column data can be returned, but the trailing space character of CHAR column data cannot be returned.
mysql> SELECT CONCAT('(', v, ')'), CONCAT('(', c, ')') FROM vc;
+---------------------+---------------------+
| CONCAT('(', v, ')') | CONCAT('(', c, ')') |
+---------------------+---------------------+
| (ab  )              | (ab)                |
+---------------------+---------------------+

Summary

MySQL supports many charsets and collations and allows users to perform fine-grained configurations at multiple levels of connections, servers, databases, tables, columns, and literals, which may sometimes dazzle users. This article provides a detailed description of the concepts, syntax, system variables, and scope of influence, as well as a list of possible charset conversions for strings and the rules to follow when comparing strings from different charsets.

For the most commonly used Unicode-based charsets, this article introduces the relationship between the Unicode standard and each charset in MySQL, especially the default charset utf8mb4 in the current version (8.0.34). To understand how MySQL compares strings, this article also introduces the multi-level sorting algorithm of strings defined by the Unicode Standard and illustrates some examples that may be confusing to Chinese users.

Finally, this article introduces the special binary charsets and clarifies the similarities and differences between them and the _bin Collation of other non-binary charsets.

After reading this article, you might have a clear understanding of MySQL charsets and collations. The following summarizes some key points.

Garbled Characters

• The charset actually used by the client is inconsistent with the character_set_client and character_set_results variables registered on the server, and "what the client uses" and "what the server thinks the client uses" are inconsistent, resulting in subsequent error handling, as shown in the example at the beginning of the article.

-- 1. Create a table named t1 with two columns. One column uses the utf8mb4 charset and the other column uses the latin1 charset.
mysql> create table t1 (a char(5) character set utf8mb4, b char(5) character set latin1);
-- 2. The client actually uses the utf8mb4 charset, but the latin1 charset is registered on the server.
mysql> set names latin1;
-- 2.a Column a uses the utf8mb4 charset. MySQL thinks character_set_client and
-- the character_set_connection are all latin1. Therefore, there is one more conversion from the latin1 charset to the utf8mb4 charset.
-- Even if the client originally passes the utf8mb4 format encoding, after this conversion,
-- results are unpredictable.
-- 2.b Even if the latin1 charset of column b does not have the character '张', it can be successfully inserted because MySQL thinks that
-- character_set_client, character_set_connection, and column b all have the same charset,
-- that is, latin1. It can be inserted directly without checking or conversion.
mysql> insert into t1 values('张','张').
-- 3. Query data on a normal client.
mysql> set names utf8mb4;
-- The content of column a (utf8mb4 charset) is completely disordered, and the content of column b (latin1 charset) is utf8mb4 encoding of '张'.
mysql> select a,hex(a),b,hex(b) from t1;
+--------+--------------+--------+--------+
| a      | hex(a)       | b      | hex(b) |
+--------+--------------+--------+--------+
| å¼     | C3A5C2BCC2A0 | å¼     | E5BCA0 |
+--------+--------------+--------+--------+

• Even if the character_set_client configuration is correct, garbled data may be inserted without checking and conversion if the introducer is not used correctly, as shown in the following example:

-- 1. Create table t1 with one column that uses the latin1 charset.
mysql> create table t1(a char(5) character set latin1);
-- 2. The direct inserting of '张' fails because latin1 does not have this character, but after adding the_latin1 prefix, it is successful. 
-- This is because the introducer allows the MySQL parser to mistakenly assume that the utf8mb4 string is the latin1 charset. 
-- No check or conversion is required when inserting the corresponding column later.
mysql> insert into t1(a) values('张');
ERROR 1366 (HY000): Incorrect string value: '\xE5\xBC\xA0' for column 'a' at row 1
mysql> insert into t1(a) values(_latin1'张');
Query OK, 1 row affected (0.00 sec)
-- 3. The code 0xE5BCA0 of '张' on utf8mb4 is selected and stored, resulting in garbled characters.
mysql> select a, hex(a) from t1;
+--------+--------+
| a      | hex(a) |
+--------+--------+
| å¼     | E5BCA0 |
+--------+--------+

• The configuration is completely correct, but the binary string is used to insert data. When the binary string is converted to a non-binary string, it is copied byte by byte, and there is no validity check. The insertion of garbled data is allowed, as shown in the following example:

-- 1. Create table t1 with a column that uses the latin1 charset.
mysql> create table t1 (a varchar(5) character set latin1);
-- 2. Insert '张'. Because the latin1 charset does not contain '张', an error is reported directly and the insertion fails.
mysql> insert into t1 values('张');
ERROR 1366 (HY000): Incorrect string value: '\xE5\xBC\xA0' for column 'a' at row 1
-- 3. Use binary literal string insertion, and it is successful, even if the latin1 charset has no character encoding 0xE5BCA0 at all.
mysql> insert into t1 values(X'E5BCA0'); -- 0xE5BCA0 is the encoding of '张' in the utf8mb4 charset
Query OK, 1 row affected (0.00 sec)
-- 4. Query the results and return the garbled characters. The stored content is what is passed.
mysql> select a, hex(a) from t1;
+--------+--------+
| a      | hex(a) |
+--------+--------+
| å¼     | E5BCA0 |
+--------+--------+

Notes When Upgrading the Version

The default collation of the utf8mb4 charset in MySQL 5.7 is utf8mb4_general_ci, and the default collation of the utf8mb4 charset in MySQL 8.0 is utf8mb4_0900_ai_ci. The two types of collations have the following differences:

• utf8mb4_general_ci does not support advanced comparison such as combining (treating two or more characters as one character), but only character-by-character comparison, as shown in the following example:

-- 1. in the view of utf8mb4_general_ci, 'ß' is equal to 's'.
mysql> select _utf8mb4'ß' collate utf8mb4_general_ci = 's';
+-----------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_general_ci = 's'  |
+-----------------------------------------------+
|                                             1 |
+-----------------------------------------------+
-- 2. In the view of utf8mb4_0900_ai_ci, 'ß' is equal to 'ss', not to 's', and it supports more complex combination comparisons.
mysql> select _utf8mb4'ß' collate utf8mb4_0900_ai_ci = 'ss';
+------------------------------------------------+
| _utf8mb4'ß' collate utf8mb4_0900_ai_ci = 'ss'  |
+------------------------------------------------+
|                                              1 |
+------------------------------------------------+

• utf8mb4_general_ci has the PAD SPACE attribute and ignores trailing spaces during comparison. utf8mb4_0900_ai_ci has the NO PAD attribute and does not ignore trailing spaces, as shown in the following example:

-- 1. In the view of utf8mb4_general_ci with the 'PAD SPACE' attribute, 'a ' = 'a'.
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_general_ci;
+---------------------------------------+
| 'a ' = 'a' COLLATE utf8mb4_general_ci |
+---------------------------------------+
|                                     1 |
+---------------------------------------+
-- 2. In the view of utf8mb4_0900_ai_ci with 'NO PAD' attribute, 'a ' != 'a'.
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_0900_ai_ci;
+---------------------------------------+
| 'a ' = 'a' COLLATE utf8mb4_0900_ai_ci |
+---------------------------------------+
|                                     0 |
+---------------------------------------+

• utf8mb4_general_ci is based on UCA 4.0.0, and utf8mb4_0900_ai_ci is based on UCA 9.0.0. UCA 9.0.0 has more character weight definitions. For non-BMP characters, utf8mb4_0900_ai_ci also calculates different weights for them. utf8mb4_general_ci thinks all of these characters to be equal as shown in the following example:

-- In the view of utf8mb4_general_ci based on UCA 4.0.0, '𠀫' and '𠀪' are equal.
mysql> select _utf8mb4'𠀫' collate utf8mb4_general_ci = '𠀪';
+----------------------------------------------+
| _utf8mb4'?' collate utf8mb4_general_ci = '?' |
+----------------------------------------------+
|                                            1 |
+----------------------------------------------+
-- In the view of utf8mb4_0900_ai_ci based on UCA 9.0.0, '𠀫' and '𠀪' are unequal.
mysql> select _utf8mb4'𠀫' collate utf8mb4_0900_ai_ci = '𠀪';
+----------------------------------------------+
| _utf8mb4'?' collate utf8mb4_0900_ai_ci = '?' |
+----------------------------------------------+
|                                            0 |
+----------------------------------------------+

Usage Suggestions

• In the testing phase of the application, you should test the coding used by the client and whether the character_set_client and character_set_results variables are correctly set after the connection with the server is established. Otherwise, there may be garbled characters.

• In general, the character_set_client, character_set_results, and character_set_connection can be centralized, and the SET NAMES command can be used to complete the setting at once.

• If the string literal is preceded by an introducer, the parser does not convert the string to the charset specified by the character_set_connection. Instead, the parser considers that the string uses the charset specified by the introducer. However, if the client charset is inconsistent with the introducer charset, the introducer does not convert the original string. In this case, the binary code of the string does not match the charset, and garbled characters are likely to occur.

• Metadata such as table names and column names in MySQL uses the utf8mb3 charset. Only Unicode BMP characters are supported. Do not use supplementary plane characters in this metadata information.

• When charset A is converted to charset B, incompatible characters are converted to '?' by MySQL, which may lead to unique key conflicts.

• In most cases, the utf8mb4 charset and its default utf8mb4_0900_ai_ci Collation can meet the needs. However, for Chinese users, if you do not want to see English letters equal to many special characters, you can additionally use utf8mb4_0900_as_ci or utf8mb4_0900_as_cs to complete filtering at different levels.

• If the target is exactly the same byte sequence, you can use utf8mb4_0900_bin Collation, which can directly use bytes for comparison and has better performance. At the same time, you can use supplementary functions such as UPPER() and LOWER() on non-binary charsets such as utf8mb4.

• If you are not sure about the sort order of two strings in a collation, you can use the WEIGHT_STRING function to check the sort key of the string to determine whether the collation is performing as expected.

• When multiple operators compare strings, if the input strings do not come from the same charset or use different collations, the final collation will be selected according to some rules, and charset conversion will occur. If you are not familiar with these rules, it is better to add the COLLATE clause to specify the collation you want to use.

• Strings in binary charsets can be converted byte-by-byte into strings in non-binary charsets without conversion, which may also lead to garbled characters. Therefore, when using INSERT statements, be careful to use the b and X prefixes to insert the literal of binary strings into columns in non-binary charsets.

References

https://dev.mysql.com/doc/refman/8.0/en/charset.html

https://www.unicode.org/reports/tr10/tr10-34.html

https://en.wikipedia.org/wiki/Unicode

https://en.wikipedia.org/wiki/UTF-8

About the Author

Xizhe Zhang is a R&D personnel for RDS for MySQL kernel, primarily focusing on performance optimization and RDS O&M related research and development tasks.

0 1 0
Share on

ApsaraDB

396 posts | 77 followers

You may also like

Comments