Configure a sensitive data identification rule and run a sensitive data identification task - DataWorks

Data Security Guard allows you to configure a sensitive data identification rule based on a sensitive field type. After the rule is configured, you can use the rule to identify sensitive data of the corresponding type within a tenant. DataWorks provides various built-in sensitive field types and sensitive data identification rules. If the built-in rules do not meet your business requirements, you can configure custom sensitive field types and sensitive data identification rules. This topic describes how to create a sensitive field type and configure a sensitive data identification rule based on the sensitive field type.

Background information

DataWorks allows you to configure a sensitive data identification rule based on a specific sensitivity level and a specific category. This helps you identify sensitive data in your workspace. You can manually correct inaccurate identification results on the Manual Data Correction tab. The Data Discovery page displays statistics on the sensitive fields that hit sensitive data identification rules in each workspace recently. For more information, see Manually correct sensitive data identification results and Identify sensitive data. The following figure shows how a sensitive data identification rule is configured and used. 逻辑图

Note

If you want to identify and mask sensitive data in a Cloudera's Distribution including Apache Hadoop (CDH) cluster, you must use the sampling crawler feature of DataWorks to randomly extract sample data from a CDH Hive table. Then, you can identify sensitive data in the sample data in Data Security Guard. To prevent the risk of data leaks, the sample data is not stored in DataWorks. For more information, see Create and manage CDH Hive sampling crawlers.

Go to the Data Identification Rules tab

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Click the icon in the upper-left corner, choose All Products > Data Governance > Data Security Guard, and then click Try now.
Note
- If your Alibaba Cloud account is granted the required permissions, you can directly access the homepage of Data Security Guard.
- If your Alibaba Cloud account is not granted the required permissions, you are redirected to the authorization page of Data Security Guard. You can use the features of Data Security Guard only after your Alibaba Cloud account is granted the required permissions.
In the left-side navigation pane, choose Rule Configuration > Sensitive Data Identification. The Data Identification Rules tab appears.

Step 1: Configure a category for a sensitive field type

A sensitive field type must belong to a category and be defined with a sensitivity level. Before you create a sensitive field type and configure a sensitive data identification rule based on the sensitive field type, you must specify the category and sensitivity level of the sensitive data that you want to identify.

The first time you use Data Security Guard, the default categories are displayed in the Built-in Classification Template section on the left side of the Data Identification Rules tab. You can search for a category by name. You can also click the icon next to a category name to add a same-level category, add a subcategory, rename the category, or delete the category.
If you are an existing user, you can create a category on the left side of the Data Identification Rules tab based on your business requirements.

Note

The name of the category must be unique. The name must be 1 to 30 characters in length, and can contain only letters and digits.
Before you delete a category, check whether the category contains published sensitive data identification rules that are configured based on sensitive field types. If the category contains published sensitive data identification rules, unpublish all rules in the category before you delete the category. For more information, see the Manage sensitive data identification rules section in this topic.
For information about how to specify the category and sensitivity level of sensitive data, see Specify the category and sensitivity level of sensitive data.

Step 2: Configure a sensitive data identification rule

Sensitive data identification rules must be configured based on sensitive field types. This topic provides an example on how to create a sensitive field type and configure a sensitive data identification rule based on the sensitive field type. You can also configure a sensitive data identification rule based on a built-in sensitive field type.

In the upper-right corner of the Data Identification Rules tab, click + Sensitive Field Type.

Configure the basic information about the sensitive field type.

In the Basic Information step, configure the parameters, such as Sensitive Field Type, Data Category, and Sensitivity Level. 敏感字段类型

The following table describes the key parameters.

Parameter	Description
Sensitive Field Type	The name of the sensitive field type, such as name, ID number, or phone number. The name must be unique.
Data Category	The category to which the sensitive field type belongs. If the existing categories do not meet your business requirements, you can configure a category on the Data Category and Sensitivity Level page. For more information, see Specify the category and sensitivity level of sensitive data.
Sensitivity Level	The sensitivity level of the sensitive field type. A larger number indicates a higher sensitivity level. If the existing sensitivity levels do not meet your requirements, you can configure a sensitivity level on the Data Category and Sensitivity Level page. For more information, see Specify the category and sensitivity level of sensitive data.

Click Next.

Configure a sensitive data identification rule based on the sensitive field type.

In the Rule Configuration step, configure a sensitive data identification rule and the conditions based on which the rule is triggered, and test the accuracy of the rule. 配置规则

Parameter	Description
Identification Rule Hitting Conditions	You can select an option from the drop-down list to specify the mechanism based on which the rule hits the conditions. A rule is hit if one of the following conditions is met: The sensitive data identification rule is triggered when the `sensitive content identification` or `sensitive field name identification` condition is met. A rule is hit if all of the following conditions are met: The sensitive data identification rule is triggered when the `sensitive content identification` and `sensitive field name identification` conditions are met. Note The mechanism takes effect only for rules that identify sensitive data based on `data content` or `field names`.
Data Content Identification	Identifies sensitive data based on data content, which indicates field values. For example, for a data record in which the value of the `name` field is Tom, the rule identifies Tom. Note Sensitive data identification based on data content is available only in DataWorks Professional Edition or a more advanced edition. If you use DataWorks Basic Edition or DataWorks Standard Edition, you must upgrade your DataWorks service to the Professional Edition or a more advanced edition. For information about how to upgrade the edition of DataWorks, see Billing of DataWorks editions. You can select an identification method that is used to match sensitive texts. Valid values: Regular Expression: Enter a regular expression and test data to test the accuracy of the rule. Built-in Identification Rule: Select a built-in sensitive data identification rule and enter test data to test the accuracy of the rule. Note The Built-in Identification Rule option is available only in DataWorks Enterprise Edition or a more advanced edition. Sample Library: Select a sample library and enter test data to test the accuracy of the rule. For information about how to create and manage sample libraries, see Identify sensitive data by using sample libraries. Custom Sensitive Data Identification Model: Select a custom rule model and enter test data to test the accuracy of the rule. For information about how to generate a custom data identification model, see Generate a custom data identification model. Note The Custom Sensitive Data Identification Model option is available only for a MaxCompute compute engine. The Custom Sensitive Data Identification Model option is available only in DataWorks Enterprise Edition or a more advanced edition.
Field Name Identification	Identifies sensitive data based on field names. For example, for a data record in which the value of the `name` field is Tom, the rule identifies `name`. You can enter the fields used to match sensitive data. You can specify multiple fields. The logical relationship between fields is `OR`. Check the input formats of different types of data sources: E-MapReduce (EMR) and CDH: `project.table.column` MaxCompute: `project.schema.table.column` (If you do not specify a schema, default is used.) Hologres: `instance_id.project.table.column` In the preceding input formats, you can use an asterisk () as a wildcard character in each section of the format. Examples: a.b.: indicates that data in all fields in Table b of Project a is identified as sensitive data. ab.c.salary: indicates that data in all fields named salary in tables whose names start with c in projects whose names start with ab is identified as sensitive data. cd.ef.sa*ry: indicates that data in all fields whose names start with sa and end with ry in tables whose names start with ef in projects whose names end with cd is identified as sensitive data.
Field Description Identification	Identifies sensitive data based on field comments. For example, you can enter "phone number" and "contact method" as the comment that is used to identify phone number fields. In this case, if the system identifies that a field comment contains "phone number" or "contact method", the field is considered a phone number field. You can specify up to 10 field comments. Each field comment can be 0 to 100 characters in length. The types of characters in comments are not limited.
Field Exclusion	The fields that you want to ignore. The fields are not used to match sensitive data and do not hit the configured sensitive data identification rule. You can specify multiple fields. The logical relationship between fields is `OR`. Check the input formats of different types of data sources: E-MapReduce (EMR) and CDH: `project.table.column` MaxCompute: `project.schema.table.column` (If you do not specify a schema, default is used.) Hologres: `instance_id.project.table.column` In the preceding input formats, you can use an asterisk () as a wildcard character in each section of the format. Examples: a.b.: indicates that data in all fields in Table b of Project a is identified as sensitive data. ab.c.salary: indicates that data in all fields named salary in tables whose names start with c in projects whose names start with ab is identified as sensitive data. cd.ef.sa*ry: indicates that data in all fields whose names start with sa and end with ry in tables whose names start with ef in projects whose names end with cd is identified as sensitive data.
Hit Ratio Configuration	The hit ratio threshold of `sensitive content identification` for a column. If the ratio of identified sensitive data records to non-empty data records of a column exceeds the threshold, such as 50%, the sensitive data identification rule is hit. By default, the hit ratio threshold is set to 50%. The following formula is used to calculate the hit ratio of sensitive content identification for a column: `100% × Number of data records that match the condition of sensitive content identification/Total number of non-empty data records in the column`. Note The hit ratio threshold takes effect only for `sensitive content identification`.

Publish the sensitive data identification rule.
Click Publish to Use. You can use the rule in a sensitive data identification task to identify sensitive data only after the rule is published.

Note

If you do not want to use the rule, you can click Save as Draft to save the rule.
If data in a column hits multiple sensitive data identification rules that are configured for different sensitive field types, take note of the following items:
- If the number of conditions hit in each sensitive data identification rule is the same, the system identifies sensitive data in the column based on the conditions in the following order: sensitive field name identification, sensitive content identification, and sensitive comment identification.
- If the number and types of conditions hit in each sensitive data identification rule are the same, the system preferentially identifies sensitive data in the column based on the sensitive data identification rule whose sensitive field type has the highest sensitivity level.

Step 3: Authorize a sensitive data identification task to identify sensitive data and start the task

After you configure the sensitive data identification rule, you can authorize a sensitive data identification task to identify sensitive data in the tenant based on the sensitive data identification rule and start the task.

Authorize a sensitive data identification task to identify sensitive data.
In the upper-left corner of the Sensitive Data Identification page, click Run Task.
Note
After the sensitive data identification task is started, click Authorization Records in the upper-right corner of the Sensitive Data Identification page to view the authorization details.

Start the sensitive data identification task.

Configure the sensitive data identification task.

In the Enable sensitive data identification tasks panel, configure the task type, scanning method, and scanning range.

The following table describes the parameters.

Parameter	Description
Task Type	The method used to run the sensitive data identification task. If you set this parameter to Automatic Tasks, the task is periodically run to scan data based on the specified scanning range and scanning time after you start the task. If you set this parameter to Manual Tasks, the task is immediately run to scan data based on the specified scanning range after you start the task. This type of task is a one-time task. After the task finishes running, the task ends.
Account Used for Identification	The Alibaba Cloud account or a RAM user used to sample and scan data. The range of data that can be sampled and scanned varies based on the permissions of the account that you use.
Content Identification	Specifies whether content identification conditions and metadata identification conditions in the rule take effect. Valid values: Content recognition and metadata recognition. The corresponding conditions take effect after you select an option. Note If you do not select Content recognition, Data Security Guard does not sample or scan data. Conditions that identify sensitive data based on data content do not take effect. However, conditions that identify sensitive data based on field names or field comments still take effect.
Sampling quantity	The number of sampled data records for sensitive content identification. We recommend that you set this parameter to a value greater than 100. If you select the Content recognition option, you must configure this parameter.
Scanning frequency and Scan time	The scanning frequency for automatic tasks. If you set the Task Type parameter to Automatic Tasks, you must configure this parameter.
Scanning range	The range of data that the sensitive data identification task scans. If you set this parameter to Full, all data of the accounts authorized by the current tenant is scanned. If you set this parameter to Custom Range, the data of tables in a specified project is scanned. Note By default, the project scope is all projects of all compute engines. You can scan data only in specified tables of a MaxCompute project. A table name must be less than `100` characters in length and all types of characters are supported. If you do not specify a table name, all tables in the project are scanned. The wildcards (`.`) are supported. For example, `.name` indicates that `name` is the suffix. `private.` indicates that `private` is the prefix. If you specify multiple tables or fields, separate the table names or field names with commas (,). You can click Add Custom Range* to add multiple custom scanning ranges. The final scanning range is the union of the multiple custom ranges.

Click Run to start the task.
After the task is started, the task status changes.
- Manual tasks: A progress bar is displayed to the right of Task Status on the Sensitive Data Identification page. When the progress reaches 100%, the scan is complete. The following formula is used to calculate the task progress: 100% × (Number of identified tables in the task/Total number of tables that need to be identified in the task).
- Automatic tasks: Opening is displayed to the right of Task Status on the Sensitive Data Identification page. After the scanning time configured for the task is reached, the platform identifies sensitive data based on the related configurations.
Note
- If you modify the sensitive data identification rule, the new rule takes effect the next time the task is automatically run. If you want the new rule to take effect in real time, you must manually trigger the task to run.
- After the scan is complete, No Task is displayed to the right of Task Status on the Sensitive Data Identification page.

Manage sensitive data identification rules

Copy a rule: To copy an existing rule, click the icon in the Actions column of the rule. By default, the name of the generated rule is suffixed with -copy and is saved as a draft. You can modify the copied rule based on your business requirements.
Modify a rule: To modify the information about a rule, click the icon in the Actions column of the rule.
Note
- You cannot modify the basic information of sensitive data identification rules that are configured based on built-in sensitive field types.
- If you modify the rule, the identification results obtained based on the original sensitive data identification rule are cleared.
Delete a rule: If you no longer require a rule, you can click the icon in the Actions column of the rule to delete the rule.
Important
Take note of the following impacts of deleting a sensitive data identification rule for a sensitive field type.
- The identification results generated based on the sensitive data identification rule for the sensitive field type are deleted. For more information, see Manually correct sensitive data identification results.
- The statistics on the sensitive field type are no longer displayed on the Data Discovery page. For more information, see Identify sensitive data.
- If the sensitive field type is referenced by a risk identification rule, the sensitive field type is removed from the configurations of the risk identification rule. For more information, see Risk identification rule management (old version).
Publish multiple rules at a time: After you publish a sensitive data identification rule, the system identifies sensitive data based on the sensitive data identification rule. If a large number of sensitive data identification rules are configured, you can publish multiple rules at a time.
1. In the upper-right corner of the Data Identification Rules tab, click Batch Publish and select the rules that you want to publish.
  Note
  You can select only rules that are in the Draft state.
2. Click Publish. After the rules are published, the status of the rules changes to Published.
  Note
  If you do not want to publish the rules, click Cancel. Then, the status of the rules changes to Draft.
Unpublish multiple rules at a time: After you unpublish a sensitive data identification rule, the system stops identifying sensitive data based on the rule. Sensitive data identification records of the related sensitive field types on the Data Discovery page and the Manual Data Correction tab are also deleted. Before you unpublish a sensitive data identification rule, check whether the rule is referenced by data masking rules or risk identification rules. If so, change the status of the data masking rule to Inactive and remove the related sensitive field type from the configurations of the risk identification rule. For more information, see Create a data masking rule and Risk identification rule management (old version).
1. On the Data Identification Rules tab, click Batch Unpublish and select the rules that you want to unpublish.
  Note
  You can select only rules in the Published state.
2. Click Unpublish. Then, the status of the rules changes to Draft.
  Note
  If you do not want to unpublish the rules, click Cancel. Then, the status of the rules changes to Published.

Subsequent operations: View task execution records

The Task Execution Records tab displays the execution records of complete tasks in the previous week. The execution records of tasks that are being executed are not displayed on this tab. You can view the following task information in each record: Start Time, End Time, Duration, Task Type, Owner, and Data Range.