What is Data Analysis? Methods, Process, Examples, Techniques

Data analysis is a process of using kinds of tools and techniques to clean, transform and model your data to get useful information for your business.

A Comparison of Data Modeling Methods for Big Data

Organizations need to invest in appropriate data models to draw insights from them. This article gives an overview of data modeling methods and introduces Alibaba Cloud’s Big Data modeling practices.
The explosive growth of the Internet, smart devices, and other forms of information technology in the DT era has seen data growing at an equally impressive rate. The challenge of the era, it seems is how to classify, organize, and store all of this data.

Why Is Data Modeling Necessary?

In a library, we need to classify all books and arrange them on shelves to make sure we can easily access every book. Similarly, if we have massive amounts of data, we need a system or a method to keep everything in order. The process of sorting and storing data is called "data modeling".

A data model is a method by which we can organize and store data. Just as the Dewey Decimal System organizes the books in a library, a data model helps us arrange data according to service, access, and use. Torvalds, the founder of Linux, alluded to the importance of data modeling when he wrote an article on “what makes an excellent programmer”: “Poor programmers care about code, and good programmers care about the data structure and the relationships between data”. Appropriate models and storage environments offer the following benefits to big data:

• Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
• Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
• Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
• Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.

Therefore, it is without question that a big data system requires high-quality data modeling methods for organizing and storing data, allowing us to reach the optimal balance of performance, cost, efficiency, and quality.

Relational Database Systems and Data Warehouse

E. F. Codd is the originator of relational databases. He first proposed the relational model of database systems and began researching relational methodology and relational data theories. Almost every modern company has come to use relational databases to store and process data. This comes as a result of the rise of an entire generation of data software like Oracle, Informix, and DB2. The data warehouse system is hardly an exception. Many data warehouse systems store and process data by leveraging the strengths of relational databases, and even employ data models that employ the same theory.

Despite the recent rapid growth of storage and computing infrastructure for Big Data as well as the growing popularity of NoSQL technology, Hadoop, Spark, and Alibaba Cloud’s MaxCompute still use SQL for large-scale data processing. Data gets stored in tables, and relational theory is used to describe the relationships between data. However, there are different options in the form of relational data models based on how you access the data.

Modeling Methodology for OLTP and OLAP Systems

The main data operation in the OLTP system is random read/write. The OLTP system mainly employs entity-relationship models that meet 3NF to store data to solve data redundancy problems and inconsistency in transaction processing. The main data operation in the OLAP system is batch read/write. The OLAP system focuses on data integration and performance of one-off, complex big data queries, and processing instead of inconsistency in transaction processing. Therefore, the OLAP system needs to use different data modeling methods.

Related Blogs

PostgreSQL Database Design for Pivot Data Analysis

This article discusses database design for audio/video/picture (pan-content) website data pivot analysis with ApsaraDB RDS for PostgreSQL and HybridDB for PostgreSQL.

Apart from social networking sites and e-commerce websites, people tend to visit popular audio, video, image, and text content websites the most. For web developers and publishers, content management is very important, and data pivoting is an important tool for content management. Video websites are now capable of playback on various devices, such as mobile phones, computers, TV boxes, TVs, and projectors. This means that organizations need to keep track of data including device attributes, member attributes, channel attributes, and so on.

Business Requirements

Generate device/user profiles

IDs, multi-dimensional tags, multi-valued column tags (for example, movies with certain directors/actors that have been watched by users within the last 7 days/one month).

Generally, there will be tens of thousands of values for multivalued columns (for example, tens of thousands of actors/movies). A device/person generally has dozens or even hundreds of attribute columns. There may be dozens of VALUEs in a single multivalued column for a device.

Profile pivoting

2.1. Query for the number of targets (the number of selected devices/members) based on any combination of tag conditions

2.2. Select a group according to any combination of tag conditions, and count the proportion taken by each category of certain columns of the group (count, group by, quantile, multidimensional pivoting)

The concurrency requirement is low.

Select target devices and members

Pagination query for IDs that satisfy any combination of tag conditions (to select devices or members that satisfy the conditions)

The concurrency requirement is low.

Point query (key value query) requirements (check whether user meets the selection rules according to any combination of tag conditions and user IDs)

The concurrency requirement for point query is high, which may involve tens of thousands of requests per second.

Volume Estimation

For audio and video websites, there are typically less than a million pieces of content (however, after introducing short video clips or user-generated media, there may be billions of pieces of content).

There can't be more than 10 billion users and devices (based on the world's population). In addition, devices will age, and there won't be more than a billion active devices.

Depending on people's capacity for abstract thinking, the number of tag columns may be in the hundreds. Multivalued columns (such as favorite actors, movies, and directors) may make up a larger proportion, maybe 50%.

The VALUE range of multivalued columns (such as actors, movies, and directors) is expected to be in the millions. (Favorite variety star tag of user A: Wang Han, Zhang Yu, Liu Wei)

There may be dozens of multivalued column tags, among which "recently watched movies" is generally useful. Aside from porn identifiers, I'd guess no one watches movies all day long.

MaxCompute: The AI Platform Bringing Big Data Analysis to the Masses

Data is the new corporate currency but many businesses are failing to effectively analyze and capitalize on petabytes of information because, quite frankly, they don’t know where to start.

This is where Alibaba Cloud MaxCompute can help. It is an AI-enabled big data processing platform to help enterprises unlock the immense value of their data.

The platform offers a combination of data intelligence services, mainly for batch structural data storage and processing. It is cheap to use and can process 100PB data in six hours. That’s roughly the same amount of data as 100 million HD movies, or one-third of Facebook’s entire data warehouse.

Let’s look at an example. What do you do with all the data captured from your social media streams? With MaxCompute, you could upload every Facebook like or retweet in a matter of minutes. And, using its machine learning tools, gain insights into how the market responds to your promotions and products.

You could break down this information by campaign or date or even mine user characteristics and spending habits to further optimize and personalize your social media streams.

Benefits of MaxCompute

MaxCompute is an incredibly low-cost service. Costing just USD $1.44 to sort 1TB of data, the platform set a new low-price record in the 2016 CloudSort Sort Benchmark competition.

You can create an ever-expanding ecosystem as project owners, data analysts and developers can work concurrently using MaxCompute. The platform also provides powerful security services and disaster recovery to protect your data.

A single MaxCompute cluster can scale up to 10,000 servers. And your data analysts do not need to adopt a distributed computing model to overcome the limited processing capacities of a single server for big data applications. That’s because MaxCompute uses a distributed model so you can analyze your data without worrying about the service requirements or the underlying model.

With usability and scalability on this scale, MaxCompute is bringing big data analysis to the masses.

Alibaba Cloud launched MaxCompute in Mainland China and Singapore at the start of 2017. In China, the platform has already been used to help ease traffic congestion, diagnose diseases using medical imagery and predict the winner of a singing talent competition.

The MaxCompute service is now available in Hong Kong, Europe, and Australia through the Internet, a classic network or VPC. If you’re not located in those regions you can still connect to the service over the Internet.

Drilling into Big Data – Data Querying and Analysis (6)

In the previous article, we have discussed about Spark for big data and show you how to set it up on Alibaba Cloud.

In this blog series, we will walk you through the basics of Hive, including table creation and other underlying concepts for big data applications.

"Our ability to do great things with data will make a real difference in every aspect of our lives," Jennifer Pahlka

There are different ways of executing MapReduce operations. First is the traditional approach, where we use Java MapReduce program for all types of data. The second approach is the scripting approach for MapReduce to process structured and semi-structured data. This approach is achieved by using Pig. Then comes the Hive Query Language, HiveQL or HQL, for MapReduce to process structured data. This is achieved by Hive.

The Case for Hive

As discussed in our previous article, Hadoop is a vast array of tools and technologies and at this point, it is more convenient to deploy Hive and Pig. Hive has its advantages over Pig, especially since it can make data reporting and analyzing easier through warehousing.

Hive is built on top of Hadoop and used for querying and analysis of data that is stored in HDFS. It is a tool which helps programmers analyze large data sets and access the data easily with the help of a query language called HiveQL. This language internally converts the SQL-like queries into MapReduce jobs for deploying it on Hadoop.

We also have Impala at this standpoint, which is quite commonly heard along with Hive, but if you watch keenly, Hive has its own space in the market place and hence it has better support too. Impala is also a query engine built on top of Hadoop. It makes use of existing Hive as many Hadoop users already have it in place to perform batch oriented jobs.

The main goal of Impala is to make fast and efficient operations through SQL. Integrating Hive with Impala gives users an advantage to use either Hive or Impala for processing or to create tables. Impala uses a language called ImpalaQL which is a subset of HiveQL. In this article, we will focus on Hive.

Features of Hive

Hive is designed for managing and querying only structured data
While dealing with these structured data, Map Reduce doesn't have optimization features like UDFs but Hive framework can help you better in terms of optimizations
The complexity of Map Reduce programming can be reduced using Hive as it uses HQL instead of Java. Hence the concepts are similar to SQL.
Hive uses partitioning to improve performance on certain queries.
A significant component of Hive is Metastore which resides in a relational database.

Hive and Relational Databases

Relational databases are of "Schema on Reaad and Schema on Write", where functions like Insertions, Updates, and Modifications can be performed. By borrowing the concept of "write once read many (WORM)", Hive was designed based on "Schema on Read only". A typical Hive query runs on multiple Data Nodes and hence it was tough to update and modify data across multiple nodes. But this has been sorted out in the latest versions of Hive.

File Formats

Hive supports various file formats like the flat Files or text files, SequenceFiles, RC and ORC Files, Avro Files, Parquet and custom input and output formats. Text file is the default file format of Hive.

Elasticsearch Distributed Consistency Principles Analysis (3) - Data

The "Elasticsearch Distribution Consistency Principle Analysis" article series describes the implementation method, principles, and existing problems of consistency models based on Elasticsearch v6.

The previous two articles described the composition of the ES clusters, master election algorithm, master update meta process, and analyzed the consistency issues of the election and Meta update. This article analyzes the data flow in ES, including its write process, PacificA algorithm model, SequenceNumber, Checkpoint and compares the similarities and differences between ES implementation and the standard PacificA algorithm. We will be covering:

Current issues
Data write process
PacificA algorithm
SequenceNumber, Checkpoint, and failure recovery
Comparing ES and PacificA
Summary

Current Issues

Anyone who has ever used ES knows that each ES Index is divided into multiple Shards. Shards are distributed on different nodes to enable distributed storage and queries and support large-scale datasets. Each Shard has multiple copies, one of which is the Primary node, and the others are Replica nodes. Data is written to the Primary node first then synchronized with Replica nodes from the Primary node. When reading data, to improve read capability, both Primary node and Replica nodes accept read requests.

With this model, we can see that ES has some of the following characteristics:

High data reliability: The data has multiple copies.
High service availability: If the Primary node crashes, a new Primary node can be chosen from the Replica nodes to continue offering services.
Extended read capability: The Primary node and Replica nodes can take read requests.
Failure recovery capability: If the Primary node or Replica nodes crash, there are not enough copies. New copies can be generated by copying the data from the new Primary node.

Some questions may come to mind, for example:

How is data copied from Primary node to Replica nodes?
Does it need to write to all copies to be successful?
Do Primary node crashes cause data loss?
Is the latest data always read when reading from Replica nodes?
Do I need to copy all Shard data when performing failure recovery?

As you can see, although we can easily understand the general principles of ES data consistency, many details remain unclear. This article focuses on the ES write process, the consistency algorithm used, SequenceId and Checkpoint design, and other aspects to describe how ES works and address the questions above. It is important to note that the analysis in this article is based on ES version 6.2. Much of the content does not apply to previous ES versions, such as version 2. X version.

Related Courses

Big Data Analysis Career Learning Path

The job market for architects, engineers, and analytics professionals with Big Data expertise continue to increase. The Academy’s Big Data Career path focuses on the fundamental tools and techniques needed to pursue a career in Big Data. Work through our course material, learn different aspects of the Big Data field, and get certified as a Big Data Professional!

Delivery Business Data Analysis

This course is associated with Delivery Business Data Analysis. You must purchase the certification package before you are able to complete all lessons for a certificate.

Python Structured Data Processing Quick Start

This course is associated with Python Structured Data Processing Quick Start. You must purchase the certification package before you are able to complete all lessons for a certificate.

Related Market Products

Delivery Business Data Analysis

Understand the basic data and business related to the current take-out industry, and analyze relevant industry data using the Alibaba Cloud Big Data Platform.

Alibaba Cloud Analysis Architecture and Case Studies

The objective of this course is to introduce the core services of Alibaba Cloud Analysis Architecture (E-MapReduce, MaxCompute, Table Store) and to show you some classic use cases.

A Quick Guide to Process Structured Data with Python

Quickly understand processing of structured data using Python Pandas through hands-on practice.

Related Products

Data Lake Analytics

Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.

AnalyticDB for MySQL

AnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency. It is fully compatible with the MySQL protocol and SQL:2003 syntax and can perform instant multidimensional analysis and business exploration for huge amounts of data.

Community

What is Data Analysis? Methods, Process, Examples, Techniques

Why Is Data Modeling Necessary?

Relational Database Systems and Data Warehouse

Modeling Methodology for OLTP and OLAP Systems

Related Blogs

Business Requirements

Volume Estimation

Benefits of MaxCompute

The Case for Hive

Hive and Relational Databases

File Formats

Current Issues

Related Courses

Related Market Products

Related Documentation

Related Products

Read previous post:

Read next post:

You may also like

Comments

Related Products