Organizations need to invest in appropriate data models to draw insights from them. This article gives an overview of data modeling methods and introduces Alibaba Cloud’s Big Data modeling practices.
The explosive growth of the Internet, smart devices, and other forms of information technology in the DT era has seen data growing at an equally impressive rate. The challenge of the era, it seems is how to classify, organize, and store all of this data.
In a library, we need to classify all books and arrange them on shelves to make sure we can easily access every book. Similarly, if we have massive amounts of data, we need a system or a method to keep everything in order. The process of sorting and storing data is called "data modeling".
A data model is a method by which we can organize and store data. Just as the Dewey Decimal System organizes the books in a library, a data model helps us arrange data according to service, access, and use. Torvalds, the founder of Linux, alluded to the importance of data modeling when he wrote an article on “what makes an excellent programmer”: “Poor programmers care about code, and good programmers care about the data structure and the relationships between data”. Appropriate models and storage environments offer the following benefits to big data:
• Performance: Good data models can help us quickly query the required data and reduce I/O throughput.
• Cost: Good data models can significantly reduce unnecessary data redundancy, reuse computing results, and reduce the storage and computing costs for the big data system.
• Efficiency: Good data models can greatly improve user experience and increase the efficiency of data utilization.
• Quality: Good data models make data statistics more consistent and reduce the possibility of computing errors.
Therefore, it is without question that a big data system requires high-quality data modeling methods for organizing and storing data, allowing us to reach the optimal balance of performance, cost, efficiency, and quality.
E. F. Codd is the originator of relational databases. He first proposed the relational model of database systems and began researching relational methodology and relational data theories. Almost every modern company has come to use relational databases to store and process data. This comes as a result of the rise of an entire generation of data software like Oracle, Informix, and DB2. The data warehouse system is hardly an exception. Many data warehouse systems store and process data by leveraging the strengths of relational databases, and even employ data models that employ the same theory.
Despite the recent rapid growth of storage and computing infrastructure for Big Data as well as the growing popularity of NoSQL technology, Hadoop, Spark, and Alibaba Cloud’s MaxCompute still use SQL for large-scale data processing. Data gets stored in tables, and relational theory is used to describe the relationships between data. However, there are different options in the form of relational data models based on how you access the data.
The main data operation in the OLTP system is random read/write. The OLTP system mainly employs entity-relationship models that meet 3NF to store data to solve data redundancy problems and inconsistency in transaction processing. The main data operation in the OLAP system is batch read/write. The OLAP system focuses on data integration and performance of one-off, complex big data queries, and processing instead of inconsistency in transaction processing. Therefore, the OLAP system needs to use different data modeling methods.
This article discusses database design for audio/video/picture (pan-content) website data pivot analysis with ApsaraDB RDS for PostgreSQL and HybridDB for PostgreSQL.
Apart from social networking sites and e-commerce websites, people tend to visit popular audio, video, image, and text content websites the most. For web developers and publishers, content management is very important, and data pivoting is an important tool for content management. Video websites are now capable of playback on various devices, such as mobile phones, computers, TV boxes, TVs, and projectors. This means that organizations need to keep track of data including device attributes, member attributes, channel attributes, and so on.
IDs, multi-dimensional tags, multi-valued column tags (for example, movies with certain directors/actors that have been watched by users within the last 7 days/one month).
Generally, there will be tens of thousands of values for multivalued columns (for example, tens of thousands of actors/movies). A device/person generally has dozens or even hundreds of attribute columns. There may be dozens of VALUEs in a single multivalued column for a device.
2.1. Query for the number of targets (the number of selected devices/members) based on any combination of tag conditions
2.2. Select a group according to any combination of tag conditions, and count the proportion taken by each category of certain columns of the group (count, group by, quantile, multidimensional pivoting)
The concurrency requirement is low.
Pagination query for IDs that satisfy any combination of tag conditions (to select devices or members that satisfy the conditions)
The concurrency requirement is low.
The concurrency requirement for point query is high, which may involve tens of thousands of requests per second.
For audio and video websites, there are typically less than a million pieces of content (however, after introducing short video clips or user-generated media, there may be billions of pieces of content).
There can't be more than 10 billion users and devices (based on the world's population). In addition, devices will age, and there won't be more than a billion active devices.
Depending on people's capacity for abstract thinking, the number of tag columns may be in the hundreds. Multivalued columns (such as favorite actors, movies, and directors) may make up a larger proportion, maybe 50%.
The VALUE range of multivalued columns (such as actors, movies, and directors) is expected to be in the millions. (Favorite variety star tag of user A: Wang Han, Zhang Yu, Liu Wei)
There may be dozens of multivalued column tags, among which "recently watched movies" is generally useful. Aside from porn identifiers, I'd guess no one watches movies all day long.
Data is the new corporate currency but many businesses are failing to effectively analyze and capitalize on petabytes of information because, quite frankly, they don’t know where to start.
This is where Alibaba Cloud MaxCompute can help. It is an AI-enabled big data processing platform to help enterprises unlock the immense value of their data.
The platform offers a combination of data intelligence services, mainly for batch structural data storage and processing. It is cheap to use and can process 100PB data in six hours. That’s roughly the same amount of data as 100 million HD movies, or one-third of Facebook’s entire data warehouse.
Let’s look at an example. What do you do with all the data captured from your social media streams? With MaxCompute, you could upload every Facebook like or retweet in a matter of minutes. And, using its machine learning tools, gain insights into how the market responds to your promotions and products.
You could break down this information by campaign or date or even mine user characteristics and spending habits to further optimize and personalize your social media streams.
MaxCompute is an incredibly low-cost service. Costing just USD $1.44 to sort 1TB of data, the platform set a new low-price record in the 2016 CloudSort Sort Benchmark competition.
You can create an ever-expanding ecosystem as project owners, data analysts and developers can work concurrently using MaxCompute. The platform also provides powerful security services and disaster recovery to protect your data.
A single MaxCompute cluster can scale up to 10,000 servers. And your data analysts do not need to adopt a distributed computing model to overcome the limited processing capacities of a single server for big data applications. That’s because MaxCompute uses a distributed model so you can analyze your data without worrying about the service requirements or the underlying model.
With usability and scalability on this scale, MaxCompute is bringing big data analysis to the masses.
Alibaba Cloud launched MaxCompute in Mainland China and Singapore at the start of 2017. In China, the platform has already been used to help ease traffic congestion, diagnose diseases using medical imagery and predict the winner of a singing talent competition.
The MaxCompute service is now available in Hong Kong, Europe, and Australia through the Internet, a classic network or VPC. If you’re not located in those regions you can still connect to the service over the Internet.
In the previous article, we have discussed about Spark for big data and show you how to set it up on Alibaba Cloud.
In this blog series, we will walk you through the basics of Hive, including table creation and other underlying concepts for big data applications.
"Our ability to do great things with data will make a real difference in every aspect of our lives," Jennifer Pahlka
There are different ways of executing MapReduce operations. First is the traditional approach, where we use Java MapReduce program for all types of data. The second approach is the scripting approach for MapReduce to process structured and semi-structured data. This approach is achieved by using Pig. Then comes the Hive Query Language, HiveQL or HQL, for MapReduce to process structured data. This is achieved by Hive.
As discussed in our previous article, Hadoop is a vast array of tools and technologies and at this point, it is more convenient to deploy Hive and Pig. Hive has its advantages over Pig, especially since it can make data reporting and analyzing easier through warehousing.
Hive is built on top of Hadoop and used for querying and analysis of data that is stored in HDFS. It is a tool which helps programmers analyze large data sets and access the data easily with the help of a query language called HiveQL. This language internally converts the SQL-like queries into MapReduce jobs for deploying it on Hadoop.
We also have Impala at this standpoint, which is quite commonly heard along with Hive, but if you watch keenly, Hive has its own space in the market place and hence it has better support too. Impala is also a query engine built on top of Hadoop. It makes use of existing Hive as many Hadoop users already have it in place to perform batch oriented jobs.
The main goal of Impala is to make fast and efficient operations through SQL. Integrating Hive with Impala gives users an advantage to use either Hive or Impala for processing or to create tables. Impala uses a language called ImpalaQL which is a subset of HiveQL. In this article, we will focus on Hive.
Features of Hive
Relational databases are of "Schema on Reaad and Schema on Write", where functions like Insertions, Updates, and Modifications can be performed. By borrowing the concept of "write once read many (WORM)", Hive was designed based on "Schema on Read only". A typical Hive query runs on multiple Data Nodes and hence it was tough to update and modify data across multiple nodes. But this has been sorted out in the latest versions of Hive.
Hive supports various file formats like the flat Files or text files, SequenceFiles, RC and ORC Files, Avro Files, Parquet and custom input and output formats. Text file is the default file format of Hive.
The "Elasticsearch Distribution Consistency Principle Analysis" article series describes the implementation method, principles, and existing problems of consistency models based on Elasticsearch v6.
The previous two articles described the composition of the ES clusters, master election algorithm, master update meta process, and analyzed the consistency issues of the election and Meta update. This article analyzes the data flow in ES, including its write process, PacificA algorithm model, SequenceNumber, Checkpoint and compares the similarities and differences between ES implementation and the standard PacificA algorithm. We will be covering:
Anyone who has ever used ES knows that each ES Index is divided into multiple Shards. Shards are distributed on different nodes to enable distributed storage and queries and support large-scale datasets. Each Shard has multiple copies, one of which is the Primary node, and the others are Replica nodes. Data is written to the Primary node first then synchronized with Replica nodes from the Primary node. When reading data, to improve read capability, both Primary node and Replica nodes accept read requests.
With this model, we can see that ES has some of the following characteristics:
Some questions may come to mind, for example:
As you can see, although we can easily understand the general principles of ES data consistency, many details remain unclear. This article focuses on the ES write process, the consistency algorithm used, SequenceId and Checkpoint design, and other aspects to describe how ES works and address the questions above. It is important to note that the analysis in this article is based on ES version 6.2. Much of the content does not apply to previous ES versions, such as version 2. X version.
The job market for architects, engineers, and analytics professionals with Big Data expertise continue to increase. The Academy’s Big Data Career path focuses on the fundamental tools and techniques needed to pursue a career in Big Data. Work through our course material, learn different aspects of the Big Data field, and get certified as a Big Data Professional!
This course is associated with Delivery Business Data Analysis. You must purchase the certification package before you are able to complete all lessons for a certificate.
This course is associated with Python Structured Data Processing Quick Start. You must purchase the certification package before you are able to complete all lessons for a certificate.
Understand the basic data and business related to the current take-out industry, and analyze relevant industry data using the Alibaba Cloud Big Data Platform.
The objective of this course is to introduce the core services of Alibaba Cloud Analysis Architecture (E-MapReduce, MaxCompute, Table Store) and to show you some classic use cases.
Quickly understand processing of structured data using Python Pandas through hands-on practice.
The scenario where data is distributed across different regions is common in log analysis. In this scenario, you need to perform hierarchical analysis on user data based on both the logs and the data from databases. The result is written to databases and can be queried through report systems. Association query of Logstores and databases is required.
The Open Data Processing Service (MaxCompute) is formerly known as the Big Data Computing Service (ODPS ). It stores and computes structured data in batches, providing solutions for massive data warehouses as well as big data analysis and modeling.
With the data integration service, RDS data can be imported into MaxCompute to achieve large-scale data computing.
Data Lake Analytics does not require any ETL tools. This service allows you to use standard SQL syntax and business intelligence (BI) tools to efficiently analyze your data stored in the cloud with extremely low costs.
AnalyticDB for MySQL is a real-time data warehousing service that can process petabytes of data with high concurrency and low latency. It is fully compatible with the MySQL protocol and SQL:2003 syntax and can perform instant multidimensional analysis and business exploration for huge amounts of data.
2,599 posts | 762 followers
FollowAlibaba Clouder - July 17, 2019
Alibaba Clouder - July 15, 2020
Data Geek - May 14, 2024
Alex - January 22, 2020
Alibaba Cloud Native - July 18, 2024
Ellen Cibula - January 18, 2023
2,599 posts | 762 followers
FollowAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA powerful and accessible data visualization tool
Learn MoreMore Posts by Alibaba Clouder