This topic provides an overview of the dynamic data masking feature provided by PolarProxy.
Prerequisites
The version of PolarProxy must be V2.4.12 or later. For more information about how to view the current version of and upgrade PolarProxy, see Minor version update.
Data masking solutions
If you want to authorize third parties to generate reports, analyze data, perform development and test activities, or perform other database-related operations, you may need to obtain the latest customer data from databases in the production environment in real time. To avoid disclosing personal information, data must be masked before it is provided to third parties. Alibaba Cloud provides the following data masking solutions: dynamic data masking and static data masking. PolarProxy uses dynamic data masking.
Data masking solution | Description | Advantage | Limits |
Dynamic data masking | When your application initiates a data query request, PolarDB masks the sensitive data that is queried before PolarDB returns the data to the application. To achieve this, you need to specify the database account, the database name, and the table or column that requires data masking before the data is queried. |
| Compared with mirror databases, the query performance of production databases is a bit lower because PolarProxy masks the sensitive data in the production databases in real time. |
Static data masking | PolarProxy exports all data in a production database to a mirror database, and encrypts or masks the sensitive data during the export. | Your application queries data from mirror databases instead of production databases. As a result, data masking does not affect the services that require access to production databases. |
|
How it works
After you configure data masking rules in the PolarDB console, the console writes these rules to PolarProxy. When your application connects to a database by using the account specified in the data masking rules and queries the specified columns, PolarProxy masks the data that is queried from the database and returns the masked data to the client.
The preceding figure shows the following data masking rules:
The data masking rules take effect only when you use the
testAcc
account to query data from a database.PolarProxy masks only the data that is queried in the
name
andage
columns.
If a column in the query result is masked, all values of the column are masked. If you execute the SELECT * FROM t1
statement and the t1
table contains name
and age
columns, the values of these two columns in the query result are masked.
If your application uses the testAcc
account to connect to a database and queries data in the name
, age
, and hobby
columns of a table, PolarProxy masks data in the name
and age
columns and returns the masked data together with the unmasked data in the hobby
column.
PolarProxy uses different methods to mask different types of data. The following table describes data masking methods.
Data type | Data masking method | Example |
Integer data types: TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT | PolarProxy returns a random value in the format defined in the data type of the raw data. |
|
Decimal data types: DECIMAL, FLOAT, and DOUBLE |
| |
Date and time data types: DATE, TIME, DATETIME, TIMESTAMP, and YEAR |
| |
Other data types | PolarProxy replaces the data with asterisks (*). |
|
Considerations
The dynamic data masking feature applies only to cluster endpoints. Cluster endpoints consist of the default cluster endpoint and custom cluster endpoints. If you use the primary endpoint to connect to a database and query data from the database, the dynamic data masking feature does not take effect. For more information about how to view a cluster endpoint, see View the endpoint and port number.
If query results contain data that must be masked and the size of a single row exceeds 16 MB, the query session is closed.
For example, you want to query data in the
name
anddescription
columns of theperson
table. In this table, the sensitive data in thename
column must be masked. The size of the data in a row of thedescription
column exceeds 16 MB. In this case, when you execute theSELECT name, description FROM person
statement, the query session is closed.If a column in which you want to mask the sensitive data is used as the value of an input parameter in a function, data masking does not take effect.
For example, a data masking rule is created to mask the sensitive data in the
name
column. When you execute theSELECT CONCAT(name, '') FROM person
statement, your application can still read the raw values of thename
column.If a column in which you want to mask the sensitive data is used together with the UNION operator, data masking may not take effect.
For example, a data masking rule is created to mask the sensitive data in the
name
column. When you execute theSELECT hobby FROM person UNION SELECT name FROM person
statement, your application can still read the raw values of thename
column.
Enable the dynamic data masking feature
For more information, see Manage data masking rules.
Appendix: Impacts on cluster performance
The dynamic data masking feature affects the performance of clusters in the following scenarios.
In this example, the read-only queries per second (QPS) of clusters are used to show the difference in performance.
Scenario | Impact on performance | |
Whether your account is included in the data masking rule | Whether your query hits the data masking rule | |
No | No | Data masking does not take effect on queries made by your account. This way, the performance of your cluster is not affected. |
Yes | No | PolarProxy analyzes only the column definition data in the result set and does not mask the raw data in the query results. This causes performance overhead of approximately 6%. After the dynamic data masking feature is enabled, the read-only QPS decreases by approximately 6%. |
Yes | PolarProxy analyzes the column definition data in the result set and masks the raw data in the query results. In this case, performance overhead is based on the size of the result set. A larger number of rows in the query results cause greater performance overhead. If the query result of a single row is returned, the performance overhead of approximately 6% occurs. |