×
Community Blog Exploring Application and Analysis of Big Data in The Cloud

Exploring Application and Analysis of Big Data in The Cloud

Learn the technology of big data, this tutorial will clear you how to application and analysis big data in the cloud.

Eight Things You Should Know about Big Data

As a senior technical expert at Alibaba Group, I will share my thoughts on what there is to say about big data, past, present, future.

First, before we can get anywhere, we will need to get clear what exactly "big data" is. The concept of big data has been around for more than 10 years, but there has never been an accurate definition, which is perhaps not required. Data engineers see big data from a technical and system perspective, whereas data analysts see big data from a product perspective. But, as far as I can see, big data cannot be summarized as a single technology or product, but rather it comprises a comprehensive, complex discipline surrounding data. I look at big data from mainly two aspects: the data pipeline (the horizontal axis in the following figure) and the technology stack (the vertical axis in the following figure).

big data figure

Social media has always focused on emphasizing, even sensationalizing the "big" in big data. I don't think this really shows big data actually is, and I don't really like the terming "big data." I'd prefer just to say "data" because the essence of big data essentially just lies in the concept and application of "data".

After figuring out what big data is, we have to look at where big data is on the maturity curve. Looking at the development of technologies in the past, each emerging technology has its own lifecycle that is typically characterized by the technology maturity curve seen below.

technology maturity curve

People tend to be optimistic about an emerging technology at its inception, holding high expectations that this technology will bring great changes to society. As a result, the technology gains popularity at a considerable speed but then reaches a peak at which time people begin to realize that this new technology is not as revolutionary as originally envisioned and their interests wane. After this, everyone's thoughts about this technology then falls into the trough of disillusionment. But, after a certain period of time, people start to understand how the technology can benefit them and start to apply the technology in a productive way, with the technology gaining mainstream adoption.

The Two Stages of Big Data in History

After going through the peak of inflated expectations and trough of disillusionment, big data is now at a stage of steady development. We can verify this from the curve of big data on Google Trends. Big data started to garner everyone's attention somewhere around 2009, reaching its peak at around 2015, and then attention slowly fell. Of course, this curve does not fit perfectly with the preceding technology maturity curve. But you can imagine that this has something to do with what this data represents. That is, one could imagine and suppose that a downward shift in the actual technology curve may lead to the increase in searches for this technology on Google and other search engines.

Related Blogs

MaxCompute: Serverless Big Data Service with High Reliability

This article mainly describes the architecture and features of Alibaba Cloud's general-purpose computing engine MaxCompute.

By Yongming Wu

On January 18, 2019, the Big Data Technology Session of Alibaba Cloud Yunqi Developer Salon, co-hosted by the Alibaba Cloud MaxCompute developer community and Alibaba Cloud Yunqi Community, was held at Beijing Union University. During this session, Wu Yongming, Senior Technical Expert at Alibaba, shared MaxCompute, a serverless big data service with high availability, and the secrets behind the low computing cost of MaxCompute.

This article is based on the video of the speech and the related presentation slides.

Big Data in Alibaba

First, let's see some background information about big data technologies at Alibaba. As shown in the following figure, Alibaba began to establish a network of big data technologies very early, and it was safe to say that Alibaba Cloud was founded to help Alibaba solve technical problems related to big data. Currently, almost all Alibaba business units are using big data technologies. Big data technologies are applied both widely and deeply in Alibaba. Additionally, the whole set of big data systems at Alibaba Group are integrated together.
big data systems

Overview of Alibaba Cloud's Computing Platform

The Alibaba Cloud Computing Platform business unit is responsible for the integration of Alibaba big data systems and R&D related to storage and computing across the whole Alibaba Group. The following figure shows the structure of the big data platform of Alibaba, where the underlying layer is the unified storage platform - Apsara Distributed File System that is responsible for storing big data. Storage is static, and computing is required for mining data value. Therefore, the Alibaba big data platform also provides a variety of computing resources, including CPU, GPU, FPGA, and ASIC. To make better use of these computing resources, we need unified resource abstraction and efficient management. The unified resource management system in Alibaba Cloud is called Job Scheduler. Based on this resource management and scheduling system, Alibaba Cloud has developed a variety of computing engines, such as the general-purpose computing engine MaxCompute, the stream computing engine Blink, the machine learning engine PAI, and the graph computing engine Flash. In addition to these computing engines, the big data platform also provides various development environments, on which the implementation of many services is based.

Big Data Application: Technical Architecture of a Big Data Platform

How should we build the architecture of a big data platform? This article studies the case of OpSmart Technology to elaborate on the business and data architecture of the IoT for enterprises.

Abstract: How should we design the architecture of a big data platform? Are there any good use cases for this architecture? This article studies the case of OpSmart Technology to elaborate on the business and data architecture of Internet of Things for enterprises, as well as considerations during the technology selection process.

How should we build the architecture of a big data platform? Are there any good use cases for this architecture? This article studies the case of OpSmart Technology to elaborate on the business and data architecture of the Internet of Things for enterprises, as well as considerations during the technology selection process.

Based on the "Internet + big data + airport" model, OpSmart Technology provides wireless network connectivity services on-the-go to 640 million users every year. As the business expanded, OpSmart technology faced the challenge of increasing amounts of data. To cope with this, OpSmart Technology took the lead to build an industry-leading big data platform in 2016 with Alibaba Cloud products.

Below are some tips shared by OpSmart Technology's big data platform architect:

Business architecture

Business architecture

OpSmart Technology's business architecture is shown in the figure above. Our primary business model is to use our own devices to explore value in the data, and then apply the data to our business.

On the collection layer, we founded the first official Wi-Fi brand for airports in China, "Airport-Free-WiFi", covering 25 hub airports and 39 hub high-speed rail stations nationwide and providing wireless network services on-the-go to 640 million people each year. We also have the nation's largest Wi-Fi network for driving schools and our driving school Wi-Fi network is expected to cover 1,500-plus driving schools by the end of 2017. We are also the Wi-Fi provider of China's four major auto shows (Beijing, Shanghai, Guangzhou, and Chengdu) to serve more than 1.2 million people. In addition, we are also running the Wi-Fi network for 2,000-plus gas stations and 600-plus automobile 4S (sales, spare parts, service, survey) stores across the country.

On the data application layer, we connected online and offline behavioral data for user profiling to provide more efficient and precise advertisement targeting including SSP, DSP, DMP and RTB. We also worked with the Ministry of Public Security to eliminate public network security threats.
OpSmart Technology's big data and advertising platforms also offer technical capabilities for enterprises to help them establish their own big data platforms and improve their operation management efficiency with a wealth of quantitative data.

Data architecture

Data architecture

We abstracted our data architecture, which contains a number of themes as shown in the figure. The subject in the figure can be understood as users, and the object can be understood as things. The subject and object are connected through various forms. Such connections are established in time and space and are completed through computer and telecommunication networks. The subject has its own reflection in the connection network, which can be understood as a virtual identity (Avatars). The object also has its own reflection in the connection network, such as the Wikipedia description of a topic, or a commercialized product or service. These reflections are then packaged by advertisements as an advertising image. All these are object mirrors. The interaction between the subject and the object is actually the interaction between the subject image and the object image, and such interactions leave traces in both time and space.

The individual and group characteristics of the subject and object, as well as the subject-object relationships, all constitute big data. Through in-depth mining and learning, this information will give birth to powerful insights and have immeasurable value to businesses.

Drilling into Big Data – A Gold Mine of Information (1)

This blog series is aimed at showing you how to make effective use of Big Data and Business Intelligence to decipher insights quickly from raw enterprise data.
By Priyankaa Arunachalam, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.

The volume of data generated every day is a mystery as it is increasing continually at a rapid rate. Although data is everywhere, the intelligence that we can glean from it matters more. These large volumes of data is what we call "Big Data". Organizations generate and gather huge volumes of data believing that this data might help them in advancing their products and improving their services. For example, a shop may have its customer information, stock details, purchase history, and website visits.

Often times, organizations store these data for regular business activities but fail to use it for further Analytics and Business Relationships. This data which is unanalyzed and left unused is what we call "Dark Data".

"Big Data is indeed a buzzword, but it is one that is frankly under-hyped," Ginni Rometty

The problem of untangling insights from data obtained from multiple sources has been around from the day when software applications were found. This is normally time consuming and becomes obsolete for any form of decision making with the data moving so fast. The main aim of this blog series is to make effective use of big data and extend the use of business intelligence to decipher insights quickly and accurately from raw enterprise data on Alibaba Cloud.

What Is Big Data?

In the simplest terms, when the data you have is too large to be stored and analyzed by traditional databases and processing tools, then it is "Big Data". If you have heard about the 3Vs of big data, then it is simple to understand the underlying definition of big data.

  1. Volume - Massive amount of data from various sources.
  2. Variety - Variety of non-standard formats in which the data is generated both by human and machines.
  3. Velocity - High Velocity at which the data is generated, stored, processed and retrieved

Big Data and Analytics

Every individual and organization has data in one form or another, which they tried managing using spreadsheets, Word documents, and databases. With emerging technologies, the size and variety of data is increasing day by day, and it is no longer possible to analyze the data through traditional means.

The most important aspect of big data analytics is understanding your data. A good way to do this is to ask yourself these questions:

  1. Where do I get this data?
  2. What sort of data it is?
  3. How do I bring this data into a big data environment?
  4. Where do I store this received data?
  5. How do I process and analyze the stored data?
  6. What insights can I bring out of it?
  7. How can these insights transform my Business?

Before exploring Alibaba Cloud's E-MapReduce, in this article we will target answering the above listed questions to get started with big data.

Data Sources and Types

Data is typically generated when a user interact with a physical device, software, or system. These interactions can be classified into three types:

  1. Transactional Data - The most important data to be considered. It is the data recorded by huge Retailers and B2B companies on daily-basis. It is collected based on every event that occurs, for example, number of products, products purchased, stock modified, customer information, distributors details and lots more.
  2. Social Data - Common data or public data which can provide remarkable insights to companies. For example, a customer may Tweet about a product, or like and comment about a purchase. This can help companies predict the consumer behavior, their purchasing patterns and sentiments, which is typically a kind of CRM data.
  3. Machinery Data - This is one major source of real-time data where we get data from electronic devices, such as sensors, machines, and even web logs.

For most enterprises, data can be categorized into the following types.

  1. Structured Data - When you are able to place data in a relational database with a schema enforced, then data is called "Structured". Analyzing becomes easier due to pre-defined structures and relations between data. A common type of structured data is a table.
  2. Unstructured Data - Though Big Data is a collection of variety of data, survey says that 90% of Big Data is unstructured. Data that has its own internal structure but does not clearly fits into a database is termed to be "Unstructured". This includes text documents, audio, video, image files, mails, presentations, web contents and streaming data.
  3. Semi structured Data - This type of data cannot be accommodated in a relational database but can be tagged, which makes analyzing easier. XML, JSON and NoSQL databases are considered to be semi structured.

Exploring Blockchain and Big Data with Alibaba Cloud Data Lake Analytics

Wang Liang, CEO of YouYue Beijing Tech Inc., shares his experience of using an Alibaba Cloud Data Lake Analytics for big data analysis on blockchain logs.

By Wang Liang, CEO of YouYue Beijing Tech Inc.

I started my own company in the business of big data for promotion. A major difficulty that I had to conquer was processing offline data and cold backup data.

We needed to collect hundreds of millions of search results from the Apple's App Store. To reduce costs, we only kept data from the past one month in the online database and backed up historical data in Alibaba Cloud Object Storage Service (OSS). When we needed the historical data, we had to import it to the database or Hadoop cluster. Similar methods are used for data like users' click logs by companies with limited budgets. This method is inconvenient as it requires a large amount of work for ETL import and export. Currently, frequent requirements such as historical data backtracking and analysis, due to increasingly complex data operations, are consuming a considerable part of R&D engineers' time and energy.

Alibaba Cloud's Data Lake Analytics was like a ray of hope. It allows us to query and analyze data in the OSS using simple database methods that supports MySQL. It simplifies big data analysis and delivers satisfying performance. To a certain extent, it can directly support some of the online services.

Here I'd like to share my experience, using an example of big data analysis for blockchain logs. I picked blockchain data because the data volume is huge, which can fully demonstrate the features of Data Lake Analytics. Another reason is that all the blockchain data is open.

Experimental Data Set

The data set used in this article contains all the data of Ethereum as of June 2018. More than 80% of today's DApps are deployed in Ethereum, so the data analysis and mining are of high values.

The data logic of Ethereum is chained data blocks, and each block contains detailed transaction data. The transaction data is similar to common access logs, which include the From (source user address), To (destination user address), and Data (sent data) fields. For example, ETH exchange between users and Ethereum program (smart contract) invoking by users are completed through transactions. Our experimental data set contains about 5 million blocks and 170 million transaction records.

We can see the data structure of Ethereum in Ethscan.io, as shown in the following figure:
data structure

Figure 1. Data structure of Ethereum (blocks on the right and transaction data on the right)

By the time when this article was completed (July 25, 2018), there were around 6 million blocks and over 200 million transaction records in Ethereum. We also took out the Token application data from the transaction data because Token (also known as virtual currency) is still a main application. Based on the data, we can analyze token transactions that are worth thousands of billions of US dollars.

Data Acquisition Process

The process of obtaining Ethereum data includes three steps. The entire process is completed in the CentOS 7 64-bit ECS server environment of Alibaba Cloud. We recommend preparing two 200 GB disks, one for original Ethereum data and the other for data exported from Ethereum, and ensure that the memory is not less than 4 GB.

Related Courses

Big Data Analysis Career Learning Path

The job market for architects, engineers, and analytics professionals with Big Data expertise continue to increase. The Academy’s Big Data Career path focuses on the fundamental tools and techniques needed to pursue a career in Big Data. Work through our course material, learn different aspects of the Big Data field, and get certified as a Big Data Professional!

ACA Big Data Certification

Alibaba Cloud's Big Data Certification Associate is designed for engineers who can use Alibaba Cloud Big Data products. It covers basic distributed system theory and Alibaba Cloud's core products like MaxCompute, DataWorks, E-MapReduce and ecosystem tools.

Introduction to Big Data Platform on the Cloud

Learn how to utilize data to make better business decisions. Optimize Alibaba Cloud's big data products to get the most value out of your data.

Alibaba Cloud Big Data Quickstart Series: Data Integration

This course is associated with Alibaba Cloud Big Data Quickstart Series: Data Integration. You must purchase the certification package before you are able to complete all lessons for a certificate.

Related Market Products

Introduction to Big Data Platform on the Cloud

Learn how to utilize data to make better business decisions. Optimize Alibaba Cloud's big data products to get the most value out of your data.

Alibaba Cloud Big Data Products Overview

This course briefly explains the basic knowledge of Alibaba Cloud big data product system and several products in large data applications, such as MaxCompute, DataWorks, RDS, DRDS, QuickBI, TS, Analytic DB, OSS, Data Integration, etc. Students can refer to the application scenarios explained, combine with the enterprise's own business and demand, apply what we have learned to practice.

How to Migrate Data to Big Data Platform on the Cloud

Quickly learn the basic concepts and architectures of MaxCompute and DataWorks, and how to perform data integration with multiple data sources and formats.

Related Documentation

Big data instance families

Description

Big data instance families are designed to deliver cloud computing and storage for a large amount of data to support big data-oriented business needs. These instances are applicable to scenarios that require offline computing and storage of a large amount of data, such as Hadoop distributed computing, extensive log processing, and large-scale data warehousing. The instance families are suitable for businesses that use a distributed network and require high-performance storage systems.

These instance families are suitable for customers in Internet, finance, and other industries that need to compute, store, and analyze big data. Big data instance families use local storage to ensure a large amount of storage space and high storage performance.

Common features of big data instances:

  1. Enterprise-level computing power ensures efficient and stable data processing.
  2. More bandwidth per instance and higher packet forwarding rates enhance network performance. Enhanced network performance ensures that the network can withstand the demands at peak times.
  3. When you create an instance for the first time, disks require time to warm up before they can achieve optimal performance. Big data instances provide a sequential read and write performance of 190 MB/s for a single disk and a maximum storage throughput of 5 GB/s for a single instance. This can shorten the time of reading data from and writing data to Hadoop Distributed File System (HDFS) files.
  4. The cost of local storage is 97% lower than that of standard SSDs, a huge reduction in costs for building Hadoop clusters.

When you use big data instances, take note of the following items:

  1. Instances with local SSDs do not support instance type, bandwidth, or billing method changes or automatic recovery upon host failures.
  2. The associations of instances to local disks are fixed. The number and capacity of local disks of an instance are based on the instance type. Instances with local SSDs are bound to their local disks. You cannot attach additional local disks to these instances or detach local disks from these instances and attach the disk to another instance.
    You cannot create snapshots for local disks. If you need to create an image for the system disk and data disks of an instance with local SSDs, we recommend that you create an image by using the snapshots of both the system disk and data disks (data disks must be non-local disks).
  3. You cannot create images that contain system disks and data disks based on instance IDs.
  4. You can attach a standard SSD to an instance with local SSDs. The capacity of the standard SSD is scalable.
  5. Local disks are attached to a single physical server, which increases the risk of single point of failure (SPOF). The reliability of data stored on local disks depends on the reliability of the physical server.

Big data analysis

The Open Data Processing Service (MaxCompute) is formerly known as the Big Data Computing Service (ODPS ). It stores and computes structured data in batches, providing solutions for massive data warehouses as well as big data analysis and modeling.

With the data integration service, RDS data can be imported into MaxCompute to achieve large-scale data computing.

Related Products

MaxCompute

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Alibaba Cloud Elasticsearch

Alibaba Cloud Elasticsearch is based on the open-source Elasticsearch engine and provides commercial features. Designed for scenarios such as search and analytics, Alibaba Cloud Elasticsearch features enterprise-level access control, security monitoring, and automatic updates.

0 0 0
Share on

Alibaba Clouder

2,599 posts | 764 followers

You may also like

Comments

Alibaba Clouder

2,599 posts | 764 followers

Related Products

  • Dataphin

    As a unified PaaS platform for intelligent data creation and management at Exabyte Scale, Dataphin applies Alibaba Group’s unique and proven OneData, OneID & OneService technologies to help enterprises thrive in the new era of data intelligence.

    Learn More
  • Data Integration

    Data Integration is an all-in-one data synchronization platform. The platform supports online real-time and offline data exchange between all data sources, networks, and locations.

    Learn More