An Introduction and Best Practice of DataWorks Data Security Module

By Liu Tianyuan and Jia Xi, Product Managers of DataWorks

This article is a part of the One-stop Big Data Development and Governance DataWorks Use Collection.

1. Background of Data Security Protection

The history of data usage can be divided into three phases: 1.0, 2.0, and 3.0. The 1.0 phase is characterized by a single data user. Therefore, data security protection is also single, using institutional norms and post-audit for data security management. At the 2.0 phase, data work began to be structured, forming multiple roles, such as data development, BI, and mining modeling. Data security protection adopts virtual machine connection. At the 3.0 phase, the amount of data becomes larger, and the data role becomes more complex. Discovering value from the data requires more participants, including operation, product, research, and development personnel. At this time, data security management is generally based on the classification and grading of data. On this basis, authority control, desensitization, encryption, auditing, and other access controls are carried out.

Enterprises are facing three main problems:

Since the amount of data is large, the biggest problem for enterprise managers is where to distribute the sensitive data. On this basis, the enterprises can carry out authority control, encryption, and desensitization.
Almost 50% of sensitive data leakage worldwide is caused by malicious criminal acts from hackers and internal personnel. However, internal data security controls are always ignored in actual management. Data leakage will cause economic losses and lead to serious consequences, such as customer losses.
Regulatory compliance is increasingly demanding for various enterprises, and the nation is valuing data security more. At the same time, data security management is more difficult due to the complexity of people using the data.

2. Data Security Governance System

The data security governance system is carried out from three aspects: system, product, and operation. It requires the cooperation of the three parts, and none is indispensable.

System: Set a standard to determine the red line, such as the standards of data classification and grading, the control of account authority, the process of data security audit, and the standards of data authority application. Therefore, the data management personnel of the enterprises and the users of the enterprise data can understand the standards and which lines cannot crossed. Then, you can use the products to implement the system.
Product: Using the functions of products to visualize the standards of classification and grading, authority control, audit process, and other systems to the product's functions. DataWorks is a comprehensive system of products, including encryption, upgrading, desensitization, security permissions, and a series of data security requirements.
Operation: Finally, operators are required to perform risk management, strategy optimization, and security reinforcement to complete the entire closed loop.

What can and cannot be done is stipulated by the system. Then, the products are used to visualize these systems. Finally, the operation department should reward or punish and perform optimization according to these systems. This way, a closed loop of the data security governance system with system, product, and operation coordination can be formed.

Therefore, Alibaba Cloud DataWorks combines various engines to provide enterprises with overall out-of-the-box security capabilities. These capabilities include several important data security processes described in the Data Security Capability Maturity Model GB/T37988-2019 (DSMM): transmission, storage, processing, exchange, and general use.

From another perspective, the product security capabilities also cover a combination of pre-event, in-event, and post-event work cycles. It provides comprehensive data risk control capabilities for enterprises, including pre-event and in-event standardized development and production, data availability and invisibility, data risk behavior control, and post-event sensitive data control.

3. Core Security Capabilities

(1) Help Enterprises Build Basic Security Facilities Quickly

DataWorks' production and development isolation, RABC role permission system, and visualization data permission management capabilities combined with the engine's security features, such as fine-grained authorization, data encryption storage, and data backup, can solve the preceding pre-event security and border security problems, allowing enterprises to solve the first problem of security governance.

At the same time, you can also use quick diagnostics to understand the configuration items that may be risky in daily development work, such as publishing without testing, developing and publishing by yourself, do not control download permissions, and other risky behaviors.

Most of the requirements in DSMM are met using simple configurations.

(2) Sensitive Data Management

DataWorks Data Security Guard provides automated and intelligent sensitive data protection in common scenarios. As shown in the figure below, an automated classification and grading management system is generated first based on the data itself and metadata, which automatically identifies and classifies the sensitive data of users. In this process, security personnel may need to make the configuration rules. After automated classification and grading, the final result will generate a classification and grading database, which can be corrected by business personnel and managed by security officers. Based on this database, data security control can be done at the top security control layer or in the whole process of data usage based on classification and grading. For example, data display scenarios need to be desensitized, data usage scenarios need to be controlled by permissions, data output scenarios need to be reviewed, and all scenarios need to be audited.

4. Best Practices of Enterprise Data Security

DataWorks data security capabilities mainly solve four questions:

Pre-Event: What sensitive data is in the enterprise, and where is the sensitive data distributed? (These correspond to the data classification and grading feature of DataWorks Data Security Guard.)
In-Event: How can private data be protected, and how can data be available and invisible? (These correspond to the data desensitization technology of Data Security Guard by DataWorks.) What kind of data resource access and high-risk behavior operations require the approval of the superior? What approval processes are required to complete the operation? (These correspond to the data approval policy and high-risk behavior approval policy definition feature of DataWorks.)
Post-Event: Who is using the data? How are they using the data? Which of these operations are risky? (These correspond to the monitoring and auditing features of DataWorks Data Security Guard.)
After a data breach, how can we find the cause? (That refers to trace to the source, which corresponds to the data watermarking feature of DataWorks Data Security Guard.)

(1) An Introduction to Best Practices for Major Data Security in DataWorks

Intelligent and Automated Classification and Grading Function

The core advantage of DataWorks Data Security Guard is that it can provide a wide range of identification rules and configuration methods.

First, DataWorks Data Security Guard has built-in 50 types of personal sensitive information identification models, such as mobile phone numbers, ID numbers, and bank card numbers. Secondly, there is a customized recognition function. You can define regular expressions or train some recognition models. In addition, you can define metadata identification by yourself. There are some sensitive data types, and the characteristics are not clear, such as salary, which only contains some numbers. You can use some special naming conventions when creating tables for such data. You can also specify that a column of a table of a certain item is sensitive data of this type.

Then, you carry out a certain lineage spread according to these defined rules. A new table may not match the defined data rules, but its original table hits one of the sensitive data types. In this case, it can be spread to the new table.

Finally, operations can be performed based on these core advantages, classification, desensitization, and watermarking. You can also display the statistical results on the DataWorks Data Security Guard page so users can see these charts.

Data Desensitization-Security Protection during Data Use

Data access in MaxCompute, E-MapReduce, and other engines is aggregated on the big data platform, DataWorks. In the scenarios of data query, migration, and download, you can flexibly configure data on the DataWorks Data Security Guard page. You can configure what desensitization to perform on what types of sensitive data and which scenarios to desensitize. DataWorks Data Security Guard offers covering, Hash, and pseudonyms for desensitization. Covering desensitization is mainly used in BI scenarios. BI staff needs to analyze the data. For example, the analysis shows it is a mobile phone number, and the middle four digits can be replaced by four asterisks. ETL scenarios may need to publish some production tasks and perform a Join operation. At this time, you do not need to know the data characteristics. This scenario is suitable for Hash desensitization. For example, you can desensitize the original mobile phone number into a string of Hash values. However, it is necessary to know the data characteristics for the algorithm model. This scenario is suitable for desensitization with pseudonyms. The original mobile phone number can be desensitized into another fake mobile phone number, but it still looks like a mobile phone number.

Operation Fraud Detection

The common users can see who operated which part at what time, and all records can be viewed, but you do not know which operations are risky. DataWorks Data Security Guard can provide behavior detection. You can also customize risk rules. Some built-in expert models can determine normal operations and operations that may be problematic based on the characteristics, environment, history, and account of user operations.

Data Watermark Tracing

Regarding big data engines, most data operations are completed in DataWorks, an overall big data development and governance platform. In various scenarios, whether it is downloading data, exporting data in some way, or querying the data and taking a photo, these situations will cause a data breach. DataWorks Data Security Guard will embed the data watermark and generate an operation database for each queried data no matter which way. When data is leaked, the user takes the leaked data and returns it to the DataWorks Data Security Guard by page to query the operation database. DataWorks Data Security Guard can help trace back who may have written what SQL leak at what time. This solution can trace the source after a data breach.

The preceding are the main functions of DataWorks Data Security Guard and how it combines with the system and operation of an enterprise to form an enterprise's data security best practice.

An Autonomous and Controllable Login Authentication System

Some enterprises tend to manage one set of local accounts instead of managing sub-accounts on another set of cloud accounts.

DataWorks meets the requirements of allowing you to act as a RAM role by using a local account to log on to the Alibaba Cloud console in role-playing mode to use DataWorks. A RAM Role can be added to the DataWorks space as a member played by multiple people or one person. This way, enterprises can realize unified authentication management and achieve an autonomous and controllable system.

Permission Control That Meets the Internal Compliance Requirements of the Enterprise (Process)

DataWorks allows you to define fine-grained data permission control processes and control processes for publishing data service APIs and exporting data synchronization tasks.

How can enterprise security managers formulate data security policies is the top priority of security planning. Many problems need to be carefully considered, such as What are the high risky behaviors? Who is likely to conduct high-risk behaviors? How can you avoid risky behaviors? Who can supervise the high-risk behaviors of involved personnel?

DataWorks predefines control solutions for security managers when facing several typical high-risk scenarios. In addition to naturally supporting data refinement (column level, Download/Update/Drop/Alter/Select/Desc) permission control capabilities, it allows managers to define different approval processes for data at different security levels. Besides, you can set different management and control processes for certain high-risk behaviors, such as data download, data export, and data service API publish. The preceding methods are used to enhance data security.

Case 1: Enterprise data is divided into C1, C2, and C3 sensitivity levels according to the risk level from high to low. When developers need to apply for access to C1 data, they can be defined as only table owner approval. When applying for access to C2, it can be defined as table owner and department head approval. When accessing C3, it can be defined as owner, department head, and CIO approval.

Case 2: If an enterprise requires strict approval to synchronize data out of a data warehouse, the administrator can customize the security policy of the Data Integration task source to target. Let's assume that once the rule of MaxCompute data source to MySQL data source is hit, a predefined approval process must be passed.

Standardized Production Development Process

DataWorks supports the best practice-based productization capability of digital production that isolates production and development environments.

The figure below shows the development process in a standard mode. First, one DW space corresponds to two engine environments, one for development and one for production.

In the data modeling process, the administrator defines the data standards that may be used in the modeling process. Then, the modeler designs the model, submits the model, and publishes it to the production environment after being verified by the supervisor, O&M, or deployment personnel.

In the data development and production process, developers execute code development, dependency configuration and debugging in the development environment, and submit a publishing application after the smoke testing is done. In this case, an O&M, deployment, or administrator role should perform the code.

Diff Review: After confirming its correctness, users can execute and publish it to the production environment, allowing standardized and safe code to run regularly in the production environment to produce data.

Community

An Introduction and Best Practice of DataWorks Data Security Module

1. Background of Data Security Protection

2. Data Security Governance System

3. Core Security Capabilities

(1) Help Enterprises Build Basic Security Facilities Quickly

(2) Sensitive Data Management

4. Best Practices of Enterprise Data Security

(1) An Introduction to Best Practices for Major Data Security in DataWorks

Intelligent and Automated Classification and Grading Function

Data Desensitization-Security Protection during Data Use

Operation Fraud Detection

Data Watermark Tracing

An Autonomous and Controllable Login Authentication System

Permission Control That Meets the Internal Compliance Requirements of the Enterprise (Process)

Standardized Production Development Process

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Data Security on the Cloud Solution

Big Data Consulting for Data Technology Solution

Database Security Solutions

Big Data Consulting Services for Retail Solution