This topic describes the data de-identification algorithms that are supported by Data Security Center (DSC).
Category | Description | Algorithm | Input | Applicable sensitive data and scenario |
---|---|---|---|---|
Hashing | Raw data cannot be retrieved after it is de-identified by using this type of algorithm.
This type of algorithm is applicable to password protection or scenarios in which you must check whether data is sensitive by comparison. You can use common hash algorithms and specify a salt value. |
MD5 | Salt value |
|
Secure Hash Algorithm 1 (SHA-1) | Salt value | |||
SHA-256 | Salt value | |||
Hash-based Message Authentication Code (HMAC) | Salt value | |||
Redaction by using asterisks (*) or number signs (#) | Raw data cannot be retrieved after it is de-identified by using this type of algorithm.
This type of algorithm is applicable to scenarios in which sensitive data is to be shown on a user interface or shared with others. This type of algorithm redacts specified text in sensitive data with asterisks (*) or number signs (#). |
Keeps the first N characters and the last M characters | Values of N and M |
|
Keeps characters from the Xth position to the Yth position | Values of X and Y | |||
Redacts the first N characters and the last M characters | Values of N and M | |||
Redacts characters from the Xth position to the Yth position | Values of X and Y | |||
Redacts characters that precede a special character when the special character appears for the first time | At sign (@), ampersand (&), or period (.) | |||
Redacts characters that follow a special character when the special character appears for the first time | At sign (@), ampersand (&), or period (.) | |||
Substitution (customization supported) | Raw data can be retrieved after it is de-identified by using some of the algorithms.
This type of algorithm can be used to de-identify fields in fixed formats, such as ID card numbers. This type of algorithm substitutes the entire value or part of the value of a field with a mapped value by using a mapping table. In this case, raw data can be retrieved after it is de-identified. This type of algorithm also substitutes the entire value or part of the value of a field randomly based on a random interval. In this case, raw data cannot be retrieved after it is de-identified. DSC provides multiple built-in mapping tables and allows you to customize substitution algorithms. |
Substitutes specific content in ID card numbers with mapped values | Mapping table for substituting the IDs of administrative regions |
|
Randomly substitutes specific content in ID card numbers | Code table for randomly substituting the IDs of administrative regions | |||
Randomly substitutes specific content in the IDs of military officer cards | Code table for randomly substituting type codes | |||
Randomly substitutes specific content in passport numbers | Code table for randomly substituting purpose fields | |||
Randomly substitutes specific content in permit numbers of Exit-Entry Permits for Travelling to and from Hong Kong and Macao | Code table for randomly substituting purpose fields | |||
Randomly substitutes specific content in bank card numbers | Code table for randomly substituting Bank Identification Numbers (BINs) | |||
Randomly substitutes specific content in landline telephone numbers | Code table for randomly substituting the IDs of administrative regions | |||
Randomly substitutes specific content in mobile numbers | Code table for randomly substituting mobile network codes | |||
Randomly substitutes specific content in unified social credit codes | Code table for randomly substituting the IDs of registration authorities, code table for randomly substituting type codes, and code table for randomly substituting the IDs of administrative regions | |||
Substitutes specific content in general tables with mapped values | Mapping table for substituting uppercase letters, mapping table for substituting lowercase letters, mapping table for substituting digits, and mapping table for substituting special characters | |||
Randomly substitutes specific content in general tables | Code table for randomly substituting uppercase letters, code table for randomly substituting lowercase letters, code table for randomly substituting digits, and code table for randomly substituting special characters | |||
Rounding | Raw data can be retrieved after it is de-identified by using some of the algorithms.
This type of algorithm can be used to analyze and collect statistics on sensitive datasets. DSC provides two types of rounding algorithms. One algorithm rounds numbers and dates, and raw data cannot be retrieved after it is de-identified. The other algorithm bit-shifts text, and raw data can be retrieved after it is de-identified. |
Rounds numbers | Numbers are rounded to the Nth digit before the decimal point. Valid values of N: 1 to 19. |
|
Rounds dates | Dates are rounded to the year, month, day, hour, or minute level. | |||
Shifts characters | Number of places by which specific bits are moved and shift direction (left or right) | |||
Encryption | Raw data can be retrieved after it is de-identified by using this type of algorithm.
This type of algorithm can be used to encrypt sensitive fields that need to be retrieved after encryption. Common symmetrical encryption algorithms are supported. |
Data Encryption Standard (DES) algorithm | Encryption key |
|
Triple Data Encryption Standard (3DES) algorithm | Encryption key | |||
Advanced Encryption Standard (AES) algorithm | Encryption key | |||
Shuffling | Raw data cannot be retrieved after it is de-identified by using this type of algorithm.
This type of algorithm can be used to de-identify structured data columns. This type of algorithm extracts values of a field in a specified range from the source table and rearranges the values in a specific column. Alternatively, this type of algorithm randomly selects values from a specific column within the value range and rearranges the selected values. This way, the values are mixed up and de-identified. |
Randomly shuffles data | Shuffle method: rearrangement or random selection |
|