Data Security Center (DSC) provides the data insights feature that allows you to manage sensitive data identification tasks and identify and classify sensitive data in authorized assets. The identification results include the sensitive data locations, sensitive data types, and sensitivity levels. This helps you manage access permissions on assets to improve data security. This topic describes how to use an identification task to identify sensitive data.
Identification task description
An identification task scans connected data assets by using the identification models in an identification template to identify and classify sensitive data. For more information about how to use an identification template, see View and configure an identification template.
Identification task types
DSC provides default identification tasks and custom identification tasks.
Default identification tasks
After authorization is complete, DSC uses the main identification template and the common identification template to create an identification task for each asset instance. The task is called a default identification task. For more information about the main identification template and the common identification template, see Configure identification templates.
For more information about how to authorize DSC to access a data asset, see Asset authorization. The following table describes information about a default identification task.
Configuration item | Description |
Identification template | A default identification task uses the main identification template and the common identification template. You cannot modify the settings.
|
Scan cycle (default) |
The interval between two scans is at least 24 hours. |
Scan scope | Take note of the following items for all authorized assets:
If you change the main identification template, the system does not immediately scan data. The new identification template is used in the subsequent run of the default identification task. |
Custom identification tasks
You can create a custom identification task to scan specific data assets by using enabled identification templates. To use a disabled identification template, you must enable the template. For more information about how to create a template, see View and configure an identification template.
Overview
Scan speed
The following content describes the scan speeds of data assets. The scan speeds are only for reference:
Structured data stored in ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, or PolarDB, or data stored in big data systems such as Tablestore or MaxCompute: Large databases that contain more than 1,000 tables are scanned at a rate of 1,000 columns per minute.
Unstructured data stored in OSS or Simple Log Service: The time required to scan 1 TB of data is approximately 6 hours.
Scan limits
To prevent excessively large files or tables in databases from compromising the overall scan progress, DSC imposes the following limits on the size of files or fields that can be scanned:
Structured data and data stored in big data systems: The first 200 rows of data in a table are sampled. Only the first 10 KB of data in each row of each field in the sampled data is scanned.
Unstructured data stored in OSS or Simple Log Service:
If the file size exceeds 200 MB, the file is not scanned. Otherwise, the file is scanned.
For compressed or archived files in OSS, only the first 1,000 child files are scanned.
More than 800 types of OSS files can be scanned: text, office file, image, design file, code file, data file, binary file, file for signature verification, archived file, application file, audio file, video file, and chemical structure file. For more information, see Supported OSS files.
For more information about the limits, see Limits.
Scanned data objects
Database assets: <Instance>/<Database>/<Table name>. Each data table is used as a data object.
Big data: <Instance>/<Table name>. Each data table is used as a data object.
OSS: <OSS bucket>/<Object name>. Each object is used as a data object.
Simple Log Service: <Simple Log Service project>/<Logstore>/<Time interval>. Each 5-minute period is considered a time interval. The data stored in each time interval is used as a data object.
Scan results
The sensitivity levels in the scan results of an identification task are determined based on the identification models that are hit in the identification templates used by the task. The highest sensitivity level reached prevails. DSC classifies sensitive data from S1 to S10. A higher number indicates a higher sensitivity level. N/A indicates that no sensitive data is identified.
The range of sensitivity levels available for an identification model is based on the associated identification template. For more information, see Configure identification templates.
Suggestions
Item | Description |
Confirm the scan scope and priority | If you want to classify a large volume of data but cannot immediately scan all data, we recommend that you first evaluate which data assets have a high scan priority. Data assets that have potential high risks, such as data that is frequently accessed, updated, or subject to unknown operations, must be scanned first. |
Specify the scope of the first scan | To achieve optimal scan performance, you can specify the scan scope. For example, you can specify the scope of the first scan to a database, OSS bucket, or multiple files. This way, you can determine the identification features and feature rules that you want to use and identify critical sensitive data. If you do not want to use all identification features, we recommend that you do not enable all identification features. False positives or invalid identification results may increase the difficulty of risk evaluation. For example, if you enable all identification features for specific data types such as date, time, and URL in specific cases, a large volume of data is matched. This may not be suitable for large-scale data scans. To scan structured data, make sure that sufficient data is sampled. Otherwise, scan results cannot be detected. |
Specify a task start time | We recommend that you specify the start time for an identification task to automatically run the task daily, weekly, or monthly based on the update frequency of the data assets. This way, you can detect changes in the data assets from the previous scan and identify sensitive data at the earliest opportunity. You can run periodic scans to identify trends of or abnormal values in the scan results. |
Prerequisites
DSC is authorized to access and identify the required data assets. For more information, see Asset authorization.
Manage default identification tasks
View default identification tasks
Log on to the DSC console.
In the left-side navigation pane, choose
.On the Identification Tasks tab of the Tasks page, click Default Tasks.
On the Identify task monitoring page, view the default identification task list.
You can perform the following operations on a default identification task:
Rescan: If the identification model is upgraded, you change the main identification template, or your database is updated, start a rescan to obtain scan results at the earliest opportunity.
Suspend: If an exception occurs in your database, find the required data asset and click Suspend in the Actions column to suspend the ongoing default identification tasks.
Terminate: If you terminate a default identification task, the system completes the ongoing task but no longer runs the task in subsequent operations.
Enable: If you enable a default identification task that is terminated, the task is resumed.
NoteDefault identification tasks cannot be deleted.
Modify the scan settings of a default identification task
You can configure the periodic scan for a default identification task. We recommend that you set the scan cycle to a value that is approximately the same as the frequency of data updates in your databases. This allows you to detect sensitive information in changed data. The minimum scan cycle is daily.
Log on to the DSC console.
In the left-side navigation pane, choose
.On the Identification Tasks tab of the Tasks page, click Default Tasks.
On the Identify task monitoring page, find the data asset for which you want to specify the scan cycle and click Scan settings.
In the Scan Settings dialog box, specify the scan cycle and scan start time and click OK.
ImportantTo minimize the impact of the scan operation on databases, we recommend that you set the scan start time to an off-peak hour.
When an identification task is running, we recommend that you monitor the database or service status to check for abnormal spikes in CPU utilization and memory usage. If an exception related to the task occurs, we recommend that you suspend or terminate the task. To stop the scan task, go to the Tasks page, find the required data asset, and then click Suspend or Terminate in the Actions column.
Manage a custom identification task
If you create a custom identification task, the system automatically uses an enabled identification template to scan specified assets. To use an enabled identification template instead of the main identification template to scan a specific database, create a custom identification task.
Create a custom identification task
Log on to the DSC console.
In the left-side navigation pane, choose
.On the Identification Tasks tab of the Tasks page, click Create.
In the Create panel, configure the parameters and click Next. After the configurations are complete, click OK.
Category
Parameter
Description
Basic Information
Task Name
Enter a task name.
Scan Type
Select a task start time. Valid values:
Immediate Scan: immediately scans data after you create the identification task.
Periodic Scan: periodically scans data after you create the identification task. You can select the scan frequency and scan period from the Scan Frequency and Scan Time drop-down lists. If you want to immediately scan data, select Scan Once Now.
NoteScan Time is effective only for structured data.
Scope
Select the scan scope of the identification task. Valid values:
Global Scan: scans all authorized assets that can be connected within the current Alibaba Cloud account. If you enable the multi-account management feature, the assets include all authorized assets that can be connected within the members.
Data Domain: scans assets in a specific data domain. For more information about a data domain, see Manage assets by using a data domain.
Asset Type: scans the assets of one or more asset types.
Identification Template
Select an identification template that is used for the scan. Only enabled identification templates are supported. You can select up to two enabled identification templates. For more information about how to create a template, see View and configure an identification template.
Config
Identification Scope of Structured Data
Select the scan scope of structured data, such as data stored in ApsaraDB RDS or PolarDB. Valid values:
Global Scan: scans all structured data specified in the Scope parameter.
Specify Scan Scope: allows you to select the instance and database that you want to scan. To add multiple instances for scanning, click Add Identification Scope.
Identification Scope of Unstructured Data
Configure the Scan Range and Scan Depth parameters for the unstructured data in OSS.
Scan Scope:
Global Scan: scans all unstructured data assets specified in the Scope parameter.
Specify Scan Scope: allows you to select the OSS bucket that you want to scan. You can select only assets specified in the Scope parameter. You can select multiple buckets.
After you specify the objects that you want to scan, you can configure filter conditions to perform fine-grained scan. For example, you can specify inclusive or exclusive values for Prefix, Directory, or Suffix.
Scan Depth:
Global Scan: scans all bucket paths.
Specify Scan Depth: scans only the specified bucket path. The path depth is separated by forward slashes (/). Valid values: 1 to 10. We recommend that you set the scan depth to an integer that is less than or equal to 10. For example, if you set the scan depth to 5, OSS bucket paths within five layers are scanned.
Data Identification Configuration of Simple Log Service
Only if Simple Log Service is included in data assets specified in the Scope parameter, you can view and configure the Asset Scope and Time Range parameters in Data Identification Configuration of Simple Log Service.
Asset Scope:
Global Scan: scans all unstructured data assets specified in the Scope parameter.
Specify Scan Range: allows you to select the projects and Logstores that you want to scan. You can select only assets specified in the Scope parameter. You can select one project and multiple Logstores.
Time Range:
Last 15 Minutes, Last 1 Hour, Yesterday, Last 1 Days, Last 7 Days, and Last 30 Days
Custom: You can specify a custom time range at an increment of 5 minutes. The unit of the time range is minutes.
Other Settings
Tagging Result Overwriting
Specify the method that you want to use to process revised sensitive data that is outdated. Valid values:
Skip Manual Tagging Result: retains the original revised results. We recommend that you select this method.
Overwrite Manual Tagging Result: overwrites the original revised results with new identification results.
Task notes
Enter the description of the task.
Modify or delete a custom identification task
Edit: You can modify all parameters of a custom identification task.
> Delete: You can delete a custom identification task that you no longer require.
Manage identification task status
Perform rescan operation
If you upgrade the identification model or update your database, you can perform the rescan operation to obtain scan results at the earliest opportunity. The rescan operation runs a full scan on the specified asset. After you perform the rescan operation, the full scan is immediately run. We recommend that you set the scan start time to an off-peak hour.
Before you perform the rescan operation, make sure that related identification templates are enabled.
If you set the Scan Type parameter of a custom identification task to Immediate Scan, rescan operations are not supported.
On the Identification Tasks tab, perform the rescan operation.
Perform a rescan operation on a custom identification task: In the task list, find the custom identification task that you want to manage and click Rescan in the Actions column.
Perform a rescan operation on a default identification task: Click the Default Tasks tab. Then, find the required data asset and click Rescan in the Actions column.
View the scan progress in the Scan Status column of a task.
Suspend or terminate an identification task
Suspend: If an exception occurs in your database, find the required custom identification task and click Suspend in the Actions column.
Terminate: This operation terminates the current and subsequent identification tasks. You can terminate default and custom identification tasks.
Revise a hit identification model
You can create a revision task to revise sensitive data that is incorrectly tagged or has no tags. This helps enterprises manage and protect data in a more accurate manner. DSC allows you to revise and restore sensitive data identification models. To create a revision task, perform the following steps:
Log on to the DSC console.
In the left-side navigation pane, choose
.On the Tasks page, click the Revision Tasks tab.
In the left-side navigation pane, click the asset type that you want to manage.
Find the data that you want to manage and click Revise or Resume in the Actions column. Then, perform operations as prompted. Finally, click OK.
After you perform the restoration operation, the previous identification model is restored.
View sensitive data identification results
On the Asset Insight and Data Directory pages, you can view the latest sensitive data detected by using the main and common identification templates. For more information, see View sensitive data identification results.
You can create an export task to export sensitive data identification results that are obtained by using the main identification template or an active identification template. You can specify identification templates and data assets to create an export task and download the exported sensitive data identification results.
The identification template and data asset that you specified in the export task must be associated with an identification task that is complete. Otherwise, the downloaded sensitive data identification results are empty.
Create an export task
To create an export task and download the export results, perform the following steps:
Log on to the DSC console.
In the left-side navigation pane, choose
.On the Tasks page, click the Export Tasks tab.
On the Export Tasks tab, click Create.
Configure an export task and click OK.
In the Basic Information section of the Create page, enter a task name and select an identification template.
You can select only an enabled identification template.
In the Export Dimension section of the Create page, select Asset Type or Asset Instance.
Asset Type: Select the asset type that you want to export.
Asset Instance: Select the instances that contain the data that you want to export.
After you create the export task, you can view the status of the task in the export task list. A larger amount of data requires a longer export period.
Download the sensitive data identification results
After the Export Status of the task changes to Completed, click Download in the Actions column of the task.
After the export is complete, download the exported data within three days. The task expires after three days. In this case, you cannot download the exported sensitive data.
References
For more information about the identification templates used in identification tasks and the supported sensitive data types, see View and configure an identification template.
For more information about the data asset types that are supported by DSC for sensitive data identification, see Supported data asset types.
For more information about common issues that may occur during identification tasks, see Sensitive data scan and identification.