By Zhang Huarui, Product Manager of DataWorks
This article is a part of the One-stop Big Data Development and Governance DataWorks Use Collection.
What is data analysis? The figure below depicts the relationship between data, information, and intelligence. We collect data from an external environment, either a software environment, such as log data in a business system, or a hardware environment, such as collecting sensor data. The value density of the collected raw data is usually relatively low. More valuable information will be obtained after processing and extraction. Further analysis and production of the information will result in intelligence.
In general, data analysis is the process of exploring, cleaning, transforming, and modeling data to uncover valuable information and support decision-making with conclusions. Data analysis can make decision-making more scientific and help enterprises operate more effectively.
Data analysis includes the following steps:
1. Requirements Analysis: You should clarify the requirements first, such as what are the core issues of the business, what decisions need to be made, and what information needs to be obtained to support this decision.
2. Data Collection: Once the requirements are identified, data collection can be carried out. However, you may encounter problems during this process. The first problem is that the expected data may not be stored at the technical level, or it may be stored but scattered in different locations in the data warehouse. Data collection needs to solve these problems first.
3. Data Cleaning: The collected data is incomplete, repeated, incorrect, and null. Data cleaning is the process of preventing and correcting these errors.
4. Exploration and Analysis: Exploration and analysis refer to analyzing data with analysis tools. This process may also involve the application of machine learning algorithms.
5. Visualization Presentation: When data analysis is completed, the results are presented to the demand side in a visualization way.
Data analysis includes local data analysis and online data analysis.
The most common local data analysis method is to use Excel. However, since data analysts need to repeat their analysis weekly, monthly, or quarterly, the first problem of using Excel is its low efficiency. Moreover, poor performance is another issue when using Excel for data analysis. Performance becomes a bottleneck when there is a large amount of data to analyze. Excel is limited in the number of rows it can display and the performance of personal computers. Data isolation will occur when using Excel for data analysis. The value of analysis on a single data table is limited, and enterprises pay more attention to the results of analysis by pulling through data from various business systems. The last problem is security risks. The enterprise's data is placed on the personal computer of a local employee, which makes it difficult to control the rights in the sharing and downloading process.
Online data analysis can solve the preceding problems. Online data analysis tools can refresh data to avoid repeated operations. For example, there is a partition table of a user profile in the data analysis module of DataWorks. If you perform a pivot operation on this partition table in September and want to do this operation again in October, you can update the data by copying the configuration in September to the partition in October.
In addition to the great computing power of the computing engine, online data analysis can perform efficient data analysis on large amounts of data. At the same time, it can fetch analysis from databases of different business systems to break down barriers. The DataWorks data analysis module also allows you to export the analysis results into a MaxCompute table or directly share the analysis results with others, so data flow among different systems and people.
Using the online data analysis module allows you to carry out data analysis and share the data without downloading data locally, which is very important. In this process, the authority is controllable to ensure data safety.
DataWorks users often encounter the following questions:
These problems can be solved using the DataWorks analysis module.
The following figure shows how various DataWorks modules work together to fulfill data analysis requirements.
Firstly, data collection is carried out by searching the tables needed for data analysis on the data map. Then, you can view and understand the data by table details, data preview, and data lineage. After that, you need to apply for the query permission of this table or a field. Then, you can go to DataStudio for further processing or go to the data analysis module to use the function of query from data source in data analysis. You can take the numbers from the data source, put them into the table, and carry out data profiling and pivot analysis in the form of a table.
In general, data analysis is suitable for quick data insight analysis, online editing, and data visualization modules. The three functions of the data analysis module include spreadsheets, dimension tables, and reports.
The spreadsheet is the core function of the data analysis module, which provides personal space for data fetching, exploration, and sharing. It takes the spreadsheet as the main body and supports the common table functions, allowing you to get started quickly.
The spreadsheet has the following functions:
The function of query from data sources supports a wide range of data source types. When a data source is added, you can query the field structure, generate a data preview, or generate query statements by double-clicking or dragging. Also, you can click run to generate data query results. Its editor is intelligent and provides functions, such as completion, highlighting prompts, and intelligent error correction. The query code can be saved, and the running history can be traced back.
When you get the data, you can use the data profiling function to gain insight into the data. This function counts the distribution of the values of each field to help you know about the distribution of data. It also supports overview mode and verbose mode. In addition, it supports data screening and multi-level drill down, which can help you determine whether these data need to be cleaned and clarify and determine the cleaning purposes.
The Pivot function is a common function for data analysts. Spreadsheets support local pivot and data source pivot. You can pivot the data of the spreadsheets or pivot the full data of the data sources. The dimensions of the pivot functionality support customized sorting, numeric types can be grouped, and the pivot configuration supports copying.
The saving and sharing of spreadsheets support the template function. You can save the spreadsheet as a template first and create a new one directly from the saved one. You can also solidify the analysis results into a MaxCompute table, generate create table statements, and insert data into it. When you want to share a spreadsheet, you can assign a person or permission.
Dimension table is a simple and efficient table editing tool. In daily work, the operation department often needs to maintain a MaxCompute table online. Generally, they need to ask a research and development colleague to create a new MaxCompute production table. Every time the data is updated, it is necessary to ask research and development colleagues to repeat the process of creating a new production table. This process involves many people but is inefficient.
After using the dimension table, the operation department can create a MaxCompute production table by themselves visually. They can name and describe the fields and click OK to generate an online table. After the table is generated, you can import local data into the table or write data on it directly. You can write or modify data directly in the dimension table, shortening the entire operation process and improving work efficiency.
A report is a visualization tool that provides a variety of report components. You can drag the components to complete a report. After the build is finished, you can share the entire report with others.
An Introduction and Best Practice of DataWorks Data Security Module
1,076 posts | 263 followers
FollowAlibaba Cloud MaxCompute - February 19, 2019
Alibaba Cloud MaxCompute - March 2, 2020
Alibaba Cloud Community - March 7, 2022
Alibaba Cloud Community - March 29, 2022
Alibaba Cloud Community - March 4, 2022
Alibaba Clouder - February 11, 2021
1,076 posts | 263 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreSecure and easy solutions for moving you workloads to the cloud
Learn MoreMore Posts by Alibaba Cloud Community