All Products
Search
Document Center

Data Security Center:Identify sensitive data by using identification tasks

Last Updated:Nov 28, 2024

Data Security Center (DSC) provides the data insights feature that allows you to manage sensitive data identification tasks and identify and classify sensitive data in authorized assets. The identification results include the sensitive data locations, sensitive data types, and sensitivity levels. This helps you manage access permissions on assets to improve data security. This topic describes how to use an identification task to identify sensitive data.

Identification task description

An identification task scans connected data assets by using the identification models in an identification template to identify and classify sensitive data. For more information about how to use an identification template, see View and configure an identification template.

Identification task types

DSC provides default identification tasks and custom identification tasks.

Default identification tasks

After authorization is complete, DSC uses the main identification template and the common identification template to create an identification task for each asset instance. The task is called a default identification task. For more information about the main identification template and the common identification template, see Configure identification templates.

For more information about how to authorize DSC to access a data asset, see Asset authorization. The following table describes information about a default identification task.

Configuration item

Description

Identification template

A default identification task uses the main identification template and the common identification template. You cannot modify the settings.

  • Main identification template: You can specify a built-in industry template, such as the classification template for the Internet industry and the classification template for the Internet of Vehicles (IoV) industry, or a custom identification template as the main identification template.

  • Common identification template: The template is used to protect personal information security and privacy rights in accordance with GB/T 35273-2020 Information security technology - Personal information security specification issued by the Standardization Administration of China. The common identification template helps enterprises and organizations implement personal information management and risk control in an effective manner.

Scan cycle (default)

  • After you connect to a database, an Object Storage Service (OSS) bucket, or a Logstore on the Authorization Management tab, the system automatically creates a default identification task.

    • If you click Connect on the Authorization Management page and select Immediately scan database assets and identify data., DSC immediately runs the default identification task.

    • If you click Connect on the Authorization Management page and clear Immediately scan database assets and identify data., you must manually run the default data identification task. To run the task, go to the Data Insights > Tasks page. On the Identification Tasks tab, click Default Tasks, find the task, and then click Rescan.

      Note

  • After you connect to a database by using the account and password of the database, the system automatically creates a default identification task. The system also performs a scan operation every morning starting from the next day.

The interval between two scans is at least 24 hours.

Scan scope

Take note of the following items for all authorized assets:

  • Database and OSS assets: All data in authorized data assets is scanned during the first scan, and only incremental data in authorized data assets is scanned during subsequent scans.

  • Simple Log Service assets: During each scan, all data in authorized data assets that is stored between 00:00 and 24:00 on the previous day is scanned based on the time when the scan is performed.

    If you want to scan more data, you can create a custom identification task and specify the scan scope. For more information, see Create a custom identification task.

If you change the main identification template, the system does not immediately scan data. The new identification template is used in the subsequent run of the default identification task.

Custom identification tasks

You can create a custom identification task to scan specific data assets by using enabled identification templates. To use a disabled identification template, you must enable the template. For more information about how to create a template, see View and configure an identification template.

image

Overview

Scan speed

The following content describes the scan speeds of data assets. The scan speeds are only for reference:

  • Structured data stored in ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, or PolarDB, or data stored in big data systems such as Tablestore or MaxCompute: Large databases that contain more than 1,000 tables are scanned at a rate of 1,000 columns per minute.

  • Unstructured data stored in OSS or Simple Log Service: The time required to scan 1 TB of data is approximately 6 hours.

Scan limits

To prevent excessively large files or tables in databases from compromising the overall scan progress, DSC imposes the following limits on the size of files or fields that can be scanned:

  • Structured data and data stored in big data systems: The first 200 rows of data in a table are sampled. Only the first 10 KB of data in each row of each field in the sampled data is scanned.

  • Unstructured data stored in OSS or Simple Log Service:

    • If the file size exceeds 200 MB, the file is not scanned. Otherwise, the file is scanned.

    • For compressed or archived files in OSS, only the first 1,000 child files are scanned.

    • More than 800 types of OSS files can be scanned: text, office file, image, design file, code file, data file, binary file, file for signature verification, archived file, application file, audio file, video file, and chemical structure file. For more information, see Supported OSS files.

For more information about the limits, see Limits.

Scanned data objects

  • Database assets: <Instance>/<Database>/<Table name>. Each data table is used as a data object.

  • Big data: <Instance>/<Table name>. Each data table is used as a data object.

  • OSS: <OSS bucket>/<Object name>. Each object is used as a data object.

  • Simple Log Service: <Simple Log Service project>/<Logstore>/<Time interval>. Each 5-minute period is considered a time interval. The data stored in each time interval is used as a data object.

Scan results

The sensitivity levels in the scan results of an identification task are determined based on the identification models that are hit in the identification templates used by the task. The highest sensitivity level reached prevails. DSC classifies sensitive data from S1 to S10. A higher number indicates a higher sensitivity level. N/A indicates that no sensitive data is identified.

The range of sensitivity levels available for an identification model is based on the associated identification template. For more information, see Configure identification templates.

Suggestions

Item

Description

Confirm the scan scope and priority

If you want to classify a large volume of data but cannot immediately scan all data, we recommend that you first evaluate which data assets have a high scan priority. Data assets that have potential high risks, such as data that is frequently accessed, updated, or subject to unknown operations, must be scanned first.

Specify the scope of the first scan

To achieve optimal scan performance, you can specify the scan scope. For example, you can specify the scope of the first scan to a database, OSS bucket, or multiple files. This way, you can determine the identification features and feature rules that you want to use and identify critical sensitive data.

If you do not want to use all identification features, we recommend that you do not enable all identification features. False positives or invalid identification results may increase the difficulty of risk evaluation. For example, if you enable all identification features for specific data types such as date, time, and URL in specific cases, a large volume of data is matched. This may not be suitable for large-scale data scans.

To scan structured data, make sure that sufficient data is sampled. Otherwise, scan results cannot be detected.

Specify a task start time

We recommend that you specify the start time for an identification task to automatically run the task daily, weekly, or monthly based on the update frequency of the data assets. This way, you can detect changes in the data assets from the previous scan and identify sensitive data at the earliest opportunity. You can run periodic scans to identify trends of or abnormal values in the scan results.

Prerequisites

DSC is authorized to access and identify the required data assets. For more information, see Asset authorization.

Manage default identification tasks

View default identification tasks

  1. Log on to the DSC console.

  2. In the left-side navigation pane, choose Data Insights > Tasks.

  3. On the Identification Tasks tab of the Tasks page, click Default Tasks.

  4. On the Identify task monitoring page, view the default identification task list.

  5. You can perform the following operations on a default identification task:

    • Rescan: If the identification model is upgraded, you change the main identification template, or your database is updated, start a rescan to obtain scan results at the earliest opportunity.

    • Suspend: If an exception occurs in your database, find the required data asset and click Suspend in the Actions column to suspend the ongoing default identification tasks.

    • Terminate: If you terminate a default identification task, the system completes the ongoing task but no longer runs the task in subsequent operations.

    • Enable: If you enable a default identification task that is terminated, the task is resumed.

    Note

    Default identification tasks cannot be deleted.

Modify the scan settings of a default identification task

You can configure the periodic scan for a default identification task. We recommend that you set the scan cycle to a value that is approximately the same as the frequency of data updates in your databases. This allows you to detect sensitive information in changed data. The minimum scan cycle is daily.

  1. Log on to the DSC console.

  2. In the left-side navigation pane, choose Data Insights > Tasks.

  3. On the Identification Tasks tab of the Tasks page, click Default Tasks.

  4. On the Identify task monitoring page, find the data asset for which you want to specify the scan cycle and click Scan settings.

    image

  5. In the Scan Settings dialog box, specify the scan cycle and scan start time and click OK.

    Important
    • To minimize the impact of the scan operation on databases, we recommend that you set the scan start time to an off-peak hour.

    • When an identification task is running, we recommend that you monitor the database or service status to check for abnormal spikes in CPU utilization and memory usage. If an exception related to the task occurs, we recommend that you suspend or terminate the task. To stop the scan task, go to the Tasks page, find the required data asset, and then click Suspend or Terminate in the Actions column.

Manage a custom identification task

If you create a custom identification task, the system automatically uses an enabled identification template to scan specified assets. To use an enabled identification template instead of the main identification template to scan a specific database, create a custom identification task.

Create a custom identification task

  1. Log on to the DSC console.

  2. In the left-side navigation pane, choose Data Insights > Tasks.

  3. On the Identification Tasks tab of the Tasks page, click Create.

  4. In the Create panel, configure the parameters and click Next. After the configurations are complete, click OK.

    Category

    Parameter

    Description

    Basic Information

    Task Name

    Enter a task name.

    Scan Type

    Select a task start time. Valid values:

    • Immediate Scan: immediately scans data after you create the identification task.

    • Periodic Scan: periodically scans data after you create the identification task. You can select the scan frequency and scan period from the Scan Frequency and Scan Time drop-down lists. If you want to immediately scan data, select Scan Once Now.

      Note

      Scan Time is effective only for structured data.

    Scope

    Select the scan scope of the identification task. Valid values:

    • Global Scan: scans all authorized assets that can be connected within the current Alibaba Cloud account. If you enable the multi-account management feature, the assets include all authorized assets that can be connected within the members.

    • Data Domain: scans assets in a specific data domain. For more information about a data domain, see Manage assets by using a data domain.

    • Asset Type: scans the assets of one or more asset types.

    Identification Template

    Select an identification template that is used for the scan. Only enabled identification templates are supported. You can select up to two enabled identification templates. For more information about how to create a template, see View and configure an identification template.

    Config

    Identification Scope of Structured Data

    Select the scan scope of structured data, such as data stored in ApsaraDB RDS or PolarDB. Valid values:

    • Global Scan: scans all structured data specified in the Scope parameter.

    • Specify Scan Scope: allows you to select the instance and database that you want to scan. To add multiple instances for scanning, click Add Identification Scope.

    Identification Scope of Unstructured Data

    Configure the Scan Range and Scan Depth parameters for the unstructured data in OSS.

    • Scan Scope:

      • Global Scan: scans all unstructured data assets specified in the Scope parameter.

      • Specify Scan Scope: allows you to select the OSS bucket that you want to scan. You can select only assets specified in the Scope parameter. You can select multiple buckets.

        After you specify the objects that you want to scan, you can configure filter conditions to perform fine-grained scan. For example, you can specify inclusive or exclusive values for Prefix, Directory, or Suffix.

    • Scan Depth:

      • Global Scan: scans all bucket paths.

      • Specify Scan Depth: scans only the specified bucket path. The path depth is separated by forward slashes (/). Valid values: 1 to 10. We recommend that you set the scan depth to an integer that is less than or equal to 10. For example, if you set the scan depth to 5, OSS bucket paths within five layers are scanned.

    Data Identification Configuration of Simple Log Service

    Only if Simple Log Service is included in data assets specified in the Scope parameter, you can view and configure the Asset Scope and Time Range parameters in Data Identification Configuration of Simple Log Service.

    • Asset Scope:

      • Global Scan: scans all unstructured data assets specified in the Scope parameter.

      • Specify Scan Range: allows you to select the projects and Logstores that you want to scan. You can select only assets specified in the Scope parameter. You can select one project and multiple Logstores.

    • Time Range:

      • Last 15 Minutes, Last 1 Hour, Yesterday, Last 1 Days, Last 7 Days, and Last 30 Days

      • Custom: You can specify a custom time range at an increment of 5 minutes. The unit of the time range is minutes.

    Other Settings

    Tagging Result Overwriting

    Specify the method that you want to use to process revised sensitive data that is outdated. Valid values:

    • Skip Manual Tagging Result: retains the original revised results. We recommend that you select this method.

    • Overwrite Manual Tagging Result: overwrites the original revised results with new identification results.

    Task notes

    Enter the description of the task.

Modify or delete a custom identification task

image

  • Edit: You can modify all parameters of a custom identification task.

  • image> Delete: You can delete a custom identification task that you no longer require.

Manage identification task status

Perform rescan operation

If you upgrade the identification model or update your database, you can perform the rescan operation to obtain scan results at the earliest opportunity. The rescan operation runs a full scan on the specified asset. After you perform the rescan operation, the full scan is immediately run. We recommend that you set the scan start time to an off-peak hour.

Before you perform the rescan operation, make sure that related identification templates are enabled.

Note

If you set the Scan Type parameter of a custom identification task to Immediate Scan, rescan operations are not supported.

  1. On the Identification Tasks tab, perform the rescan operation.

    • Perform a rescan operation on a custom identification task: In the task list, find the custom identification task that you want to manage and click Rescan in the Actions column.

    • Perform a rescan operation on a default identification task: Click the Default Tasks tab. Then, find the required data asset and click Rescan in the Actions column.

  2. View the scan progress in the Scan Status column of a task.

Suspend or terminate an identification task

image

  • Suspend: If an exception occurs in your database, find the required custom identification task and click Suspend in the Actions column.

  • Terminate: This operation terminates the current and subsequent identification tasks. You can terminate default and custom identification tasks.

Revise a hit identification model

You can create a revision task to revise sensitive data that is incorrectly tagged or has no tags. This helps enterprises manage and protect data in a more accurate manner. DSC allows you to revise and restore sensitive data identification models. To create a revision task, perform the following steps:

  1. Log on to the DSC console.

  2. In the left-side navigation pane, choose Data Insights > Tasks.

  3. On the Tasks page, click the Revision Tasks tab.

  4. In the left-side navigation pane, click the asset type that you want to manage.

  5. Find the data that you want to manage and click Revise or Resume in the Actions column. Then, perform operations as prompted. Finally, click OK.

    image

    After you perform the restoration operation, the previous identification model is restored.

View sensitive data identification results

On the Asset Insight and Data Directory pages, you can view the latest sensitive data detected by using the main and common identification templates. For more information, see View sensitive data identification results.

You can create an export task to export sensitive data identification results that are obtained by using the main identification template or an active identification template. You can specify identification templates and data assets to create an export task and download the exported sensitive data identification results.

Important

The identification template and data asset that you specified in the export task must be associated with an identification task that is complete. Otherwise, the downloaded sensitive data identification results are empty.

Create an export task

To create an export task and download the export results, perform the following steps:

  1. Log on to the DSC console.

  2. In the left-side navigation pane, choose Data Insights > Tasks.

  3. On the Tasks page, click the Export Tasks tab.

  4. On the Export Tasks tab, click Create.

  5. Configure an export task and click OK.

    1. In the Basic Information section of the Create page, enter a task name and select an identification template.

      You can select only an enabled identification template.

    2. In the Export Dimension section of the Create page, select Asset Type or Asset Instance.

      • Asset Type: Select the asset type that you want to export.

      • Asset Instance: Select the instances that contain the data that you want to export.

    After you create the export task, you can view the status of the task in the export task list. A larger amount of data requires a longer export period.

Download the sensitive data identification results

After the Export Status of the task changes to Completed, click Download in the Actions column of the task.

image

Important

After the export is complete, download the exported data within three days. The task expires after three days. In this case, you cannot download the exported sensitive data.

References