Data is crucial to the operation of any business. Businesses often collect large numbers of logs so that they can better understand their own services and the people who are using them.
As time goes by, the number and activity of users continuously increases, as does the speed at which data is generated, the accumulated volume of data, and the number of data dimensions and data types. As a result, we end up with more and more data islands.
We have reached a point where data islands are now heavy burden for a company's IT department, increasing costs without increasing earnings.
Storing and processing large amounts of data poses serious problems to businesses, such as:
This article discusses some potential solutions to address issues when dealing with massive amounts of data for specific industries.
Packages generate a lot of tracking data from collection to shipment, transportation, transshipment, and distribution. Each time a package reaches a checkpoint, it will be scanned and its status will be recorded.
During the transportation process, the vehicle associated with the package is tracked, including the vehicle records, route, fuel consumption, vehicle status, and driver's status. During the distribution process, both the positional information of the courier and the distribution status of the package will be tracked, which also generates a large amount of data.
A transportation vehicle may generate dozens of thousands of tracking records in a day. A courier may also generate dozens of thousands of tracking records in a day. Even a small package may generate hundreds of tracking records!
One potential solution for the logistics industry is dynamic path planning. With dynamic path planning, we can obtain information on the nearby couriers in real time according to their positional information. This information can then be used to create an optimal least cost path for quick delivery. If you are interested, you can refer to Technologies used in Double 11 shopping festival - logistics and dynamic path planning (article in Chinese).
The financial industry is another generator of mass data. Financial industry data mainly includes data of user transactions, enterprise transactions, and securities. To make matters worse, most transactions in the financial industry requires real-time computing, and the use of a wide range of different statistical analysis functions.
Additionally, data collected in the financial industry must be kept secure. I have shared an analysis on the needs of a securities trading simulation system. If you are interested, refer to Analysis and application of PostgreSQL in databases in the securities industry (article in Chinese).
Data generated by IoT is sequential, and therefore requires StreamCompute (e.g. triggered upon reaching the threshold) and post-event analysis. As the data size is huge, it also needs to be compressed.
Another feature of IoT is that data reported by sensors always contains numerical ranges (e.g. a temperature range), geographic positions, images, etc. How can we effectively store and search such varied data types? I have also analyzed the features of databases of the IoT applications in a few articles (articles in Chinese).
Rise of StreamCompute - PostgreSQL joins hands with PipelineDB to support IoT
A simple algorithm to help IoT and financial users cut data storage costs by 98%
PostgreSQL cutting-edge technology - range type and Gist indexes boost development in IoT
Monitoring involves overseeing the business operations and IT infrastructure statuses, such as server status, network status, and storage status. Monitoring has been used considerably in all kinds of businesses, but monitoring needs and data types vary within industries.
For example, some industries may need positional monitoring, such as tracking the location of a bus and sending alerts when it goes beyond a geo fence, or when the driver displays unusual driving behaviors.
Public security data comes from various fields, including communication records and travel records.
Public security data can be incredibly large. A typical use case may be risk control or capturing criminal suspects. It relies heavily on analyzing chronological, geographic positional information (image searches).
How can these needs be met?
The challenges of running businesses with massive amounts of data can be summarized with this list of questions:
We will cover three components:
1.RDS PostgreSQL
RDS PostgreSQL supports chronological data, block range indexes, inverted indexes, multi-core parallel computing, JSON, array storage, OSS_FDW external read/write, and other features.
RDS PostgreSQL solves the problems of OLTP, GIS application, complex query, spatial data processing, multi-dimensional analysis, and cold/hot data separation.
2.HybridDB PostgreSQL
HybridDB PostgreSQL supports array storage, horizontal expansion, block compression, extensive data types, machine learning library, PLPYTHON, PLJAVA, and PLR programming, OSS_FDW external read and write, etc.
It completely solves the issue of computing massive amounts of data.
3.Object Storage Service (OSS)
Data can be shared among multiple RDS instances through OSS_FDW. OSS supports multi-copy, and cross regional replication.
OSS solves the problems of data islands, storing massive data, cross-data-center disaster tolerance, massive data backup, etc.
Data is written in three ways:
Data can be shared among multiple RDS instances through OSS_FDW.
Let's assume that we have two enterprises, and we have used two RDS database instances, but there is a part of the data that needs to be shared between them. The traditional method requires the use of ETL, but now we can efficiently achieve data sharing among multiple instances by using OSS_FDW.
The RDS PostgreSQL OSS_FDW parallel read/write function (to enable multiple worker processes to read and write to the same table file) provides a highly efficient read/write process for shared data.
The parallel feature is embodied in three features: OSS parallel read/write, RDS PostgreSQL multi-core parallel computing, multi-computer parallel computing of RDS PostgreSQL or HybridDB.
Local RDS PostgreSQL and HybridDB are used for real time data storage. When stored data needs to be analyzed or shared, we store it in OSS.
When we compare the amount of storage with the computing resources used, the cost of OSS is much lower, which reduces the company's IT costs without negatively influencing flexibility.
OSS solves the problems of enterprises on data redundancy and high costs, and meets the requirements on data backup and disaster tolerance.
By using RDS PostgreSQL, HybridDB, and OSS, we can separate our computing resources and storage resources.
Since we are able to store less data at the computing nodes (most of the data is stored in OSS), resizing, reducing, disaster tolerance, and backing up the computer nodes is much easier.
Computing methods may be divided into the following types
1.Stream computing
StreamCompute can be divided in two sub-types, real-time computing, and real-time threshold alerts.
Both can be achieved through pipelinedb (base on postgresql).
Benefits:
The standard SQL interface and a wide range of different built-in functions support complex StreamCompute demands. The diversified data types (including GIS, JSON, etc.) support even more StreamCompute business scenarios. This asynchronous message–based mechanism supports the needs of the second type of StreamCompute.
pipelinedb is currently under adjustment, and may be used as an extension of PostgreSQL in the future.
Taking the monitoring industry as an example, the asynchronous message–based mechanism of StreamCompute can effectively avoid the futility of traditional active inquiry monitoring methods.
2.Real-time interaction
We can use RDS PostgreSQL to meet traditional OLTP demands.
Features of PostgreSQL include: GIS, JSON, array, cold/hot data separation, horizontal database partitioning, K-V type, multi-core parallel computing, BRIN indexes, GIN indexes, etc.
Scenarios supported by PostgreSQL include: StreamCompute, image searches, chronological data, path planning, fuzzy searches, full-text searches, similarity searches, auction sniping, genetics, finance, chemistry, GIS application, complex searches, BI, multi-dimensional analysis, spatial data searches, etc.
It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.
3.Quasi real-time analysis
Used in combination with OSS, both RDS PostgreSQL and HybridDB can achieve quasi real-time analysis.
The same OSS data can also be shared and concurrently accessed by multiple instances.
4.Offline data analysis and data mining
Used in combination with OSS, both RDS PostgreSQL and HybridDB can be used to analyze and mine offline data.
RDS PostgreSQL supports single computer multi-core parallel computing, and HybridDB for PostgreSQL supports multi-computer multi-core parallel computing. Users can make the choice based on computing capacity.
Required computing capacities
The soul of computing is working with different data types.
are mainly represented in the following aspects:
1.Performance
RDS PostgreSQL mainly handles online transactions and a small amount of quasi real-time analysis.
Performance of PostgreSQL OLTP is at the level of commercial databases. For details, refer to this article:
Performance evaluation in the database industry_tpc.org
Multi-core parallel computing, JIT, operator reuse, and other features of PostgreSQL allow it to leave other RDBMS databases far behind in terms of OLAP capacity. For more on the OLAP analytical capacity of PostgreSQL, refer to this article:
PostgreSQL 10 also has many enhancements in the aspect of HTAP.
2.Features
Features are a major strength of PostgreSQL. You can find the detailed description in the previous section "Required computing capacity".
3.Scalability
The performance of complex computing can be enhanced by adding additional CPUs.
Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.
4.Cost
As most of the data that needs to be separated is stored at the OSS, users do not need to worry about the disaster tolerance and backups for this part of the data. Compared to storage in databases, storage costs are greatly reduced.
Both RDS PG and HybridDB PG support a wide variety of standard SQL interfaces, which are also used to access data stored in OSS (the Table interface). All of these features combined lead to significant reductions in development costs.
Maintenance cost: There is almost no O&M cost when using the cloud services.
5.Industries covered
It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.
1.Performance
HybridDB PostgreSQL is an MPP architecture, and therefore has outstanding computing performance.
2.Features
You can find the detailed description in the previous section "Required computing capacity".
3.Scalability
The performance of complex computing can be enhanced by adding additional computing nodes.
Storage capacity of RDS PG can be enhanced and the storage limit can be broken by using the OSS storage and the OSS_FDW extension.
4.Cost
As most of the data that needs to be separated is stored at the OSS, users do not need to worry about the disaster tolerance and backups for this part of the data. Compared to storage in databases, storage costs are greatly reduced.
Both RDS PG and HybridDB PG support a wide variety of standard SQL interfaces, which are also used to access data stored in OSS (the Table interface). All of these features combined lead to significant reductions in development costs.
Maintenance cost: There is almost no O&M cost when using the cloud services.
5.Industries covered
It covers a wide range of industries, including banking, insurance, securities, IoT, Internet, gaming, astronomy, publishing, e-commerce, traditional enterprises, etc.
Join the Alibaba Cloud Q&A contest to win tickets to CES 2018 in Las Vegas!
2,599 posts | 765 followers
FollowRupal_Click2Cloud - September 13, 2022
Alibaba Clouder - June 23, 2020
Alibaba Cloud_Academy - September 4, 2023
Alibaba Cloud Storage - March 28, 2019
PM - C2C_Yuan - September 15, 2023
Alibaba Clouder - February 27, 2018
2,599 posts | 765 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreAlibaba Cloud equips financial services providers with professional solutions with high scalability and high availability features.
Learn MoreMore Posts by Alibaba Clouder
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free Get Started for Free
Raja_KT March 5, 2019 at 6:02 am
Good praises for Postgresql esp v10 supporting HTAP ..... RDS PostgreSQL supports chronological data, block range indexes, inverted indexes, multi-core parallel computing, JSON, array storage, OSS_FDW external read/write, and other features and many others