Create a CDH Hive sampling crawler - DataWorks - Alibaba Cloud Documentation Center

You can use the sampling crawler feature of DataWorks to sample data in a Cloudera's Distribution Including Apache Hadoop (CDH) Hive table for sensitive data identification in Data Security Guard. If you configure data masking rules in Data Security Guard, data of the sensitive fields that match the rules is masked when you preview the data on the details page of the related table in Data Map. This topic describes how to create and manage CDH Hive sampling crawlers.

Prerequisites

A DataWorks new-version resource group (general-purpose resource group) or an exclusive resource group for scheduling is created. For more information, see Create and use a new-version resource group or Create and use an exclusive resource group for scheduling.
A CDH cluster is registered to a DataWorks workspace. For more information, see Register a CDH or CDP cluster to DataWorks.
The Data Security Guard service is activated and sensitive data identification rules are configured. For more information, see Overview and Configure sensitive data identification rules.

Limits

You can use the sampling crawler feature only in the China (Shanghai) and China (Chengdu) regions.
Data sampling for databases by cluster is supported. You can create only one sampling crawler for each cluster, but you can select one or more databases from which sample data is to be collected for each sampling crawler.
By default, if you do not specify a database for data sampling in a cluster, the data of all databases in the cluster is sampled.
Sample data can be collected by using an Alibaba Cloud account or as a RAM user to which the AliyunDataWorksFullAccess policy is attached.
After you create, modify, or delete a CDH Hive table, you must recollect the sample data.
Only manual collection is supported.

Create a sampling crawler

Go to the DataMap page in the DataWorks console.
In the left-side navigation pane of the DataMap page, click Collect Metadata.
On the Data Source Perspective tab of the page that appears, find CDH Hive (used only for data sampling).
Click Manage in the upper-right corner. The Data Sources for Which Crawler Is Created tab appears.
You can click the Data Sources for Which No Crawler Is Created tab and view the data sources for which no crawler is created on this tab.

Click Create Crawler in the upper-right corner. In the Configure Collection Plan dialog box, configure parameters. The following table describes the parameters.

新建采集器

Parameter	Description
Cluster	The CDH cluster for which you want to collect sample data. You can select one from the CDH clusters that are registered to DataWorks workspaces in the current region. For more information, see Integrate and use CDH and CDP.
Database	The database for which data sampling is performed. By default, if you do not configure this parameter, the data of all databases in the cluster is sampled.
Exclusive Resource Group	The exclusive resource group for scheduling that is connected to the CDH cluster.
Sampling collection service	The CDH component that is used to collect sample data. For more information, see Integrate and use CDH and CDP.
Collection account number	The account that is used to collect sample data. The parameter is automatically configured based on the mappings between Alibaba Cloud accounts or RAM users and Kerberos accounts that you configure to register the CDH cluster on the cluster management page. For more information, see Create and manage workspaces.
Execution Plan	The frequency of collecting sample data. The parameter can be set only to On-demand Execution.

Click OK.

Manage sampling crawlers

On the Data Sources for Which Crawler Is Created tab, you can view the details of a sampling crawler, including the status, execution plan, last run time, time consumed for the last run, and average running time. You can also perform the following operations on the sampling crawler:

Details: View the configurations of the sampling crawler.
Edit: Modify the configurations of the sampling crawler, including the Cluster and Exclusive Resource Group parameters.
Delete: Delete the sampling crawler.
Run: Run the sampling crawler to collect sample data based on the configurations. After you run the sampling crawler, identified sensitive fields are displayed in Data Security Guard. If you configure data masking rules, data of the sensitive fields that match the rules is masked when you preview the data on the details page of the related table in Data Map.
Stop: Stop the sampling crawler.

What to do next

Sample data is collected for CDH Hive tables. If you configure data masking rules in Data Security Guard, data of the sensitive fields that match the rules is masked when you preview the data on the details page of the related table in Data Map. For more information, see Overview and View the details of a table.