Best practice of real-time update based on EMR StarRocks
Huiliang Technology
As a global technology platform, Mobvista is committed to promoting global business growth in the digital era. The company was founded in Guangzhou, China, in 2013 and listed on the main board of the Hong Kong Stock Exchange in December 2018. Huiliang Technology focuses on providing advertising technology services and marketing technology services needed for the development of mobile internet ecology for global customers, especially Chinese overseas customers. Through the one-stop SaaS product matrix, mobile application developers can easily, quickly and efficiently realize full-link marketing activities and achieve growth in the global market.
EMR StarRocks real-time data change
Alibaba Cloud EMR launched StarRocks service at the beginning of the year( https://help.aliyun.com/document_detail/405463.html ), StarRocks is a new generation of extremely fast full-scene MPP (Massively Parallel Processing) data platform, dedicated to building an extremely fast and unified analysis experience. EMR StarRocks has the following characteristics:
• Compatible with MySQL protocol, and can use MySQL client and common BI tools to dock StarRocks to analyze data
• Distributed architecture:
• Divide data tables horizontally and store them in multiple copies
• The cluster size can be flexibly scaled to support 10 PB of data analysis
• Support MPP framework and accelerate parallel computing
• Supports multiple replicas with elastic fault tolerance
• Support vectorization engine and CBO
• Support elastic expansion and contraction
• Support the detail model, aggregation model, primary key model and update model. The primary key model can perform efficient Upsert or Delete operations according to the primary key. Through storage and index optimization, efficient query optimization can be achieved while concurrent updates.
Next, we will focus on how StarRocks can achieve efficient real-time data change.
StarRocks key model-driven real-time analysis
In various fields, analysis has become the center of the data platform. With the rapid development of technology in the field of data infrastructure, users' demand for real-time analysis is also increasing. Users are no longer satisfied with retrospective analysis, and begin to explore the use of real-time analysis to generate actionable insights that can affect the current business. Real-time analysis needs fresh data as the basis: users are looking for an OLAP solution that can refresh their real-time dashboard system, monitoring system and more applications in seconds, while expanding the data size to PB or more.
With the improvement of StarRocks users' demand for real-time thermal data analysis capability and the widespread demand for synchronizing transaction processing (TP) database tables to StarRocks, StarRocks needs to be able to support real-time data change (update/delete).
Before ushering in the primary key model, StarRocks users rely on the update model (Unique Key) to realize real-time data changes.
StarRocks update model
Before StarRocks supports the primary key model, real-time data change operations can be performed by updating the model (Unique Key). The update model is essentially a merge on read scheme: when data is imported, the update model will sort the data and directly write it to a new file without duplicate key checking. When reading, the key value is compared to merge multiple versions, and only the latest version of data is returned. The update mode of Merge on read writes data very fast. However, due to the need to merge multiple versions during reading, the read data performance is poor and unstable, and it cannot be further accelerated using indexes such as Bitmap.
StarRocks primary key model
Unlike the update model, the primary key model of StarRocks is implemented based on the delete and insert mode.
In order to solve the problem of reading and enlarging the updated model, the StarRocks primary key model introduces the primary key index. During data import, for each piece of data, the primary key model locates the location of the original record through the primary key index. After finding the record, you only need to mark the record with a Delete Bitmap, which means that the record is deleted, and then all other update records can be inserted into the new block as new data. The advantage of this is that all blocks can be loaded in parallel when reading, and then only the deleted records need to be filtered according to the Delete Bitmap tag.
Complete data update (delete+insert) during data import, and only keep the latest status of each piece of data. For the primary key model, it is not necessary to merge multiple versions when querying. In the case of import, the query performance can be basically unaffected.
Not only is the query performance superior, the primary key model has many other features:
• Full support for upsert, delete, and partial update operations
• Query performance is not affected during real-time update
• The query performance is not lower than the detailed model
• Low memory consumption, supporting primary key index drop
Use of primary key model in summary
Introduction to the material platform
Huiliang Technology has a rich product line of materials, among which XMP is the SaaS tool used by Huiliang Technology for unified advertising of multiple overseas media, CAS is the marketing creative analysis system of Huiliang Technology, ADS is the SaaS tool used by Huiliang Technology for unified advertising of multiple domestic media, Gatlin is the SaaS tool used by Huiliang Technology for mass production of advertising materials, and Topon is the global advertising aggregation platform of Huiliang Technology. These product lines of Huiliang Technology all have advertising materials. In order to achieve unified management, it is proposed to build a material platform, Creative Service, through which the materials are maintained, stored, retrieved and shared uniformly.
Create Service obtains data from five data sources (ADS, XMP, CAS, Gatlin and Topon). The data is divided into structured data and unstructured data: the audio and video images of advertising materials constitute unstructured data stored in the object storage. The metadata of these unstructured data forms structured data, which falls into the OLAP data warehouse to serve the downstream real-time operation business, Enable real-time data-driven advertising strategy adjustment to improve conversion rate.
The business characteristics of the material platform are as follows:
• The data source is complex, and the data comes from more than five data sources.
• Real-time operation requires second-level data freshness.
• Large amount of data, with more than 5 billion structured data in stock.
• There are many data dimensions and more than 80 data warehouse table fields.
OLAP selection
The requirements of the material platform for the OLAP engine are as follows:
• Query speed: the OLAP layer of the material platform is used by downstream relevant business personnel (such as product managers) to perform ad hoc queries, and the timeliness of query returns is high (second level/sub second level). For traditional OLAP engines, queries usually need at least minutes to return, which seriously reduces the efficiency of analysis.
• Unification: if the query system is not unified, offline and online selection query engines are composed and coexist, increasing the learning cost of business personnel. Another problem caused by the coexistence of multiple systems is data islands, and it is difficult to correlate data between systems.
• Real-time data update: Real-time operation (such as real-time advertising strategy adjustment) requires second-level data freshness, which determines the effectiveness of downstream advertising strategies. The storage engine of OLAP data warehouse is required to support efficient and frequent data import and change operations.
Based on these demands, StarRocks and ClickHouse engines are mainly compared. The comparison is mainly made from the following dimensions.
After investigation, StarRocks' native support in real-time analysis scenarios and more efficient query performance (especially multi-table) can better support the business scenarios of the material platform, and finally decided to use StarRocks to build an OLAP analysis platform for structured data of the material platform.
StarRocks real-time analysis landing practice
The overall structure is as follows:
• Core storage layer: core storage components, used to store Creative Service Metadata, and StarRocks data warehouse
• Kafka: access request for cache batch update, peak shaving and valley filling
• Java Thread: used to consume Kafka data and update the data to the core storage component StarRocks
• OSS: used to store material files
StarRocks primary key model enables real-time analysis
The material platform selects StarRocks' primary key model as the business data storage model, and compares it with the update model (Unique Key) during model selection:
The primary key model uses the delete+insert update mode of the primary key index to efficiently complete the multi-version merge operation during import. Compared with the update model (merge on read), the primary key model does not need to perform aggregation operations when querying, and supports predicate and index push down. It can provide efficient queries while supporting real-time and frequent updates and other scenarios.
The following figure shows the query performance comparison between the primary key model and the update model during and after import. From the test results, we can see that in the scenario of continuous data import, due to the difference of update mode, the primary key model has 3 - 15 times the advantage of the update model in query performance.
In the process of using the primary key model, in combination with business scenario requirements, focus on two primary key model features:
• Partial Update addresses the need for column updates in the business section.
• The primary key persistent index reduces the memory usage of the primary key model.
Real-time partial column update import
The StarRocks primary key model supports partial column updates during data import. When importing, only a few columns that need to be updated are provided on the premise of providing the primary key columns. The primary key model will automatically use the primary key index to efficiently fill the remaining columns and complete the update operation. There is no need to do data splicing operations in the upstream, and more flexible update scenarios are supported.
The data sources of the aggregation at the OLAP layer are complex and need to be obtained from five different data sources (ADS, XMP, CAS, Gatling and Topon). A typical analysis scenario requires data including the information of the advertising material itself, such as the size source, the effect of the advertising material, such as conversion, click, exposure, etc. These indicators are scattered in five data sources, and a method is needed to efficiently combine these five data sources for real-time data analysis.
Because five different data sources have the same unique primary key (the unique identification code of the advertising material), the amount of data is large, and the downstream ad hoc query has high requirements for the delay of the returned results. Finally, we decided to save the data into a wide table by pre-widening.
When updating, the OLAP layer can only get several columns of data from one data source at a time. The summary uses the partial column update feature of the primary key model to complete such update operations. When importing, only the existing columns and primary key columns need to be entered, and several columns of data can be updated into the table without unnecessary operations, which avoids complex and expensive operations such as upstream data back-checking or multi-stream merging.
At present, the column update feature of the primary key model supports the real-time import of several tables and hundreds of millions of records. The typical table has about 10 columns, 2 to 3 columns are updated each time, and the update frequency is about 10 seconds.
Primary key index drop
When the inventory team tested the StarRocks primary key model in the early stage, it first chose to maintain the primary key index in pure memory (persistent_index=off). When importing data, the primary key index maintained by pure memory needs to be pulled to build an index in memory, which consumes a lot of memory. In addition, the amount of data during the test is huge, and there has been a shortage of memory during the process.
The primary key index of StarRocks can be persisted through the disk, and the index file can be read directly from the disk during data import. According to the test results, compared with the primary key index of pure memory, while the import performance is not significantly lost, the memory is saved by more than 90%.
After the inventory team started the primary key model index inventory, it thoroughly solved the problem of insufficient memory caused by the primary key index in the test and production environment.
Future planning
In the future, the collection material platform will continue to further improve its service capability based on the new features of StarRocks. The specific plan is as follows:
• Improve the analysis function of the material platform and further empower advertising marketing
• Improve the system and data monitoring alarm
• After the primary key model supports Conditional Update, explore direct Routine Load for data import
• The primary key model supports continuous optimization of performance after specifying the sort key
As a global technology platform, Mobvista is committed to promoting global business growth in the digital era. The company was founded in Guangzhou, China, in 2013 and listed on the main board of the Hong Kong Stock Exchange in December 2018. Huiliang Technology focuses on providing advertising technology services and marketing technology services needed for the development of mobile internet ecology for global customers, especially Chinese overseas customers. Through the one-stop SaaS product matrix, mobile application developers can easily, quickly and efficiently realize full-link marketing activities and achieve growth in the global market.
EMR StarRocks real-time data change
Alibaba Cloud EMR launched StarRocks service at the beginning of the year( https://help.aliyun.com/document_detail/405463.html ), StarRocks is a new generation of extremely fast full-scene MPP (Massively Parallel Processing) data platform, dedicated to building an extremely fast and unified analysis experience. EMR StarRocks has the following characteristics:
• Compatible with MySQL protocol, and can use MySQL client and common BI tools to dock StarRocks to analyze data
• Distributed architecture:
• Divide data tables horizontally and store them in multiple copies
• The cluster size can be flexibly scaled to support 10 PB of data analysis
• Support MPP framework and accelerate parallel computing
• Supports multiple replicas with elastic fault tolerance
• Support vectorization engine and CBO
• Support elastic expansion and contraction
• Support the detail model, aggregation model, primary key model and update model. The primary key model can perform efficient Upsert or Delete operations according to the primary key. Through storage and index optimization, efficient query optimization can be achieved while concurrent updates.
Next, we will focus on how StarRocks can achieve efficient real-time data change.
StarRocks key model-driven real-time analysis
In various fields, analysis has become the center of the data platform. With the rapid development of technology in the field of data infrastructure, users' demand for real-time analysis is also increasing. Users are no longer satisfied with retrospective analysis, and begin to explore the use of real-time analysis to generate actionable insights that can affect the current business. Real-time analysis needs fresh data as the basis: users are looking for an OLAP solution that can refresh their real-time dashboard system, monitoring system and more applications in seconds, while expanding the data size to PB or more.
With the improvement of StarRocks users' demand for real-time thermal data analysis capability and the widespread demand for synchronizing transaction processing (TP) database tables to StarRocks, StarRocks needs to be able to support real-time data change (update/delete).
Before ushering in the primary key model, StarRocks users rely on the update model (Unique Key) to realize real-time data changes.
StarRocks update model
Before StarRocks supports the primary key model, real-time data change operations can be performed by updating the model (Unique Key). The update model is essentially a merge on read scheme: when data is imported, the update model will sort the data and directly write it to a new file without duplicate key checking. When reading, the key value is compared to merge multiple versions, and only the latest version of data is returned. The update mode of Merge on read writes data very fast. However, due to the need to merge multiple versions during reading, the read data performance is poor and unstable, and it cannot be further accelerated using indexes such as Bitmap.
StarRocks primary key model
Unlike the update model, the primary key model of StarRocks is implemented based on the delete and insert mode.
In order to solve the problem of reading and enlarging the updated model, the StarRocks primary key model introduces the primary key index. During data import, for each piece of data, the primary key model locates the location of the original record through the primary key index. After finding the record, you only need to mark the record with a Delete Bitmap, which means that the record is deleted, and then all other update records can be inserted into the new block as new data. The advantage of this is that all blocks can be loaded in parallel when reading, and then only the deleted records need to be filtered according to the Delete Bitmap tag.
Complete data update (delete+insert) during data import, and only keep the latest status of each piece of data. For the primary key model, it is not necessary to merge multiple versions when querying. In the case of import, the query performance can be basically unaffected.
Not only is the query performance superior, the primary key model has many other features:
• Full support for upsert, delete, and partial update operations
• Query performance is not affected during real-time update
• The query performance is not lower than the detailed model
• Low memory consumption, supporting primary key index drop
Use of primary key model in summary
Introduction to the material platform
Huiliang Technology has a rich product line of materials, among which XMP is the SaaS tool used by Huiliang Technology for unified advertising of multiple overseas media, CAS is the marketing creative analysis system of Huiliang Technology, ADS is the SaaS tool used by Huiliang Technology for unified advertising of multiple domestic media, Gatlin is the SaaS tool used by Huiliang Technology for mass production of advertising materials, and Topon is the global advertising aggregation platform of Huiliang Technology. These product lines of Huiliang Technology all have advertising materials. In order to achieve unified management, it is proposed to build a material platform, Creative Service, through which the materials are maintained, stored, retrieved and shared uniformly.
Create Service obtains data from five data sources (ADS, XMP, CAS, Gatlin and Topon). The data is divided into structured data and unstructured data: the audio and video images of advertising materials constitute unstructured data stored in the object storage. The metadata of these unstructured data forms structured data, which falls into the OLAP data warehouse to serve the downstream real-time operation business, Enable real-time data-driven advertising strategy adjustment to improve conversion rate.
The business characteristics of the material platform are as follows:
• The data source is complex, and the data comes from more than five data sources.
• Real-time operation requires second-level data freshness.
• Large amount of data, with more than 5 billion structured data in stock.
• There are many data dimensions and more than 80 data warehouse table fields.
OLAP selection
The requirements of the material platform for the OLAP engine are as follows:
• Query speed: the OLAP layer of the material platform is used by downstream relevant business personnel (such as product managers) to perform ad hoc queries, and the timeliness of query returns is high (second level/sub second level). For traditional OLAP engines, queries usually need at least minutes to return, which seriously reduces the efficiency of analysis.
• Unification: if the query system is not unified, offline and online selection query engines are composed and coexist, increasing the learning cost of business personnel. Another problem caused by the coexistence of multiple systems is data islands, and it is difficult to correlate data between systems.
• Real-time data update: Real-time operation (such as real-time advertising strategy adjustment) requires second-level data freshness, which determines the effectiveness of downstream advertising strategies. The storage engine of OLAP data warehouse is required to support efficient and frequent data import and change operations.
Based on these demands, StarRocks and ClickHouse engines are mainly compared. The comparison is mainly made from the following dimensions.
After investigation, StarRocks' native support in real-time analysis scenarios and more efficient query performance (especially multi-table) can better support the business scenarios of the material platform, and finally decided to use StarRocks to build an OLAP analysis platform for structured data of the material platform.
StarRocks real-time analysis landing practice
The overall structure is as follows:
• Core storage layer: core storage components, used to store Creative Service Metadata, and StarRocks data warehouse
• Kafka: access request for cache batch update, peak shaving and valley filling
• Java Thread: used to consume Kafka data and update the data to the core storage component StarRocks
• OSS: used to store material files
StarRocks primary key model enables real-time analysis
The material platform selects StarRocks' primary key model as the business data storage model, and compares it with the update model (Unique Key) during model selection:
The primary key model uses the delete+insert update mode of the primary key index to efficiently complete the multi-version merge operation during import. Compared with the update model (merge on read), the primary key model does not need to perform aggregation operations when querying, and supports predicate and index push down. It can provide efficient queries while supporting real-time and frequent updates and other scenarios.
The following figure shows the query performance comparison between the primary key model and the update model during and after import. From the test results, we can see that in the scenario of continuous data import, due to the difference of update mode, the primary key model has 3 - 15 times the advantage of the update model in query performance.
In the process of using the primary key model, in combination with business scenario requirements, focus on two primary key model features:
• Partial Update addresses the need for column updates in the business section.
• The primary key persistent index reduces the memory usage of the primary key model.
Real-time partial column update import
The StarRocks primary key model supports partial column updates during data import. When importing, only a few columns that need to be updated are provided on the premise of providing the primary key columns. The primary key model will automatically use the primary key index to efficiently fill the remaining columns and complete the update operation. There is no need to do data splicing operations in the upstream, and more flexible update scenarios are supported.
The data sources of the aggregation at the OLAP layer are complex and need to be obtained from five different data sources (ADS, XMP, CAS, Gatling and Topon). A typical analysis scenario requires data including the information of the advertising material itself, such as the size source, the effect of the advertising material, such as conversion, click, exposure, etc. These indicators are scattered in five data sources, and a method is needed to efficiently combine these five data sources for real-time data analysis.
Because five different data sources have the same unique primary key (the unique identification code of the advertising material), the amount of data is large, and the downstream ad hoc query has high requirements for the delay of the returned results. Finally, we decided to save the data into a wide table by pre-widening.
When updating, the OLAP layer can only get several columns of data from one data source at a time. The summary uses the partial column update feature of the primary key model to complete such update operations. When importing, only the existing columns and primary key columns need to be entered, and several columns of data can be updated into the table without unnecessary operations, which avoids complex and expensive operations such as upstream data back-checking or multi-stream merging.
At present, the column update feature of the primary key model supports the real-time import of several tables and hundreds of millions of records. The typical table has about 10 columns, 2 to 3 columns are updated each time, and the update frequency is about 10 seconds.
Primary key index drop
When the inventory team tested the StarRocks primary key model in the early stage, it first chose to maintain the primary key index in pure memory (persistent_index=off). When importing data, the primary key index maintained by pure memory needs to be pulled to build an index in memory, which consumes a lot of memory. In addition, the amount of data during the test is huge, and there has been a shortage of memory during the process.
The primary key index of StarRocks can be persisted through the disk, and the index file can be read directly from the disk during data import. According to the test results, compared with the primary key index of pure memory, while the import performance is not significantly lost, the memory is saved by more than 90%.
After the inventory team started the primary key model index inventory, it thoroughly solved the problem of insufficient memory caused by the primary key index in the test and production environment.
Future planning
In the future, the collection material platform will continue to further improve its service capability based on the new features of StarRocks. The specific plan is as follows:
• Improve the analysis function of the material platform and further empower advertising marketing
• Improve the system and data monitoring alarm
• After the primary key model supports Conditional Update, explore direct Routine Load for data import
• The primary key model supports continuous optimization of performance after specifying the sort key
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00