Spark Relational Cache implements interactive analysis of sub-second response
1. Project introduction
Background of the project
Alibaba Cloud EMR is an open source big data solution. At present, EMR has integrated many open source components, and the number of components is also increasing. The lower layer of EMR can access a variety of storage, such as object storage OSS, self-built HDFS inside the cluster, and streaming data. Users can use EMR to process massive data and conduct rapid analysis, and it can also support users to do machine learning and data cleaning on it. EMR hopes to be able to support a very large amount of business data, and also hopes to be able to achieve rapid data analysis through cluster expansion when the amount of data continues to grow.
Pain Points of Adhoc Data Analysis on the Cloud
When doing Adhoc data analysis on the cloud, it is difficult to achieve that the query delay will not increase significantly as the amount of data increases. Although various engines are emerging, and some engines run very fast in some scenarios, the query response speed will inevitably decrease when the amount of data increases. Therefore, we hope to obtain better performance on a more unified platform. At the same time, Alibaba Cloud also hopes to provide cloud-native solutions. Spark is currently the most widely used computing engine in the industry. It is widely used, but there are still many shortcomings in dealing with Adhoc. Therefore, Alibaba Cloud has made a lot of optimizations on Spark to help users meet the needs of Adhoc queries. Therefore, the caching scheme will be involved. Although Spark has a caching mechanism for a long time, there are many deficiencies in satisfying Adhoc scenarios on the cloud. Therefore, Alibaba Cloud will do a lot of optimization on Spark to help users optimize Adhoc queries. speed. But if you put the data in memory, it may not be enough to use all the data as a cache, so Spark Relational Cache was born.
Spark Relational Cache
After the user's SQL request arrives at Spark, it will take a long time to process the data source. Here, the underlying storage includes cluster HDFS, remote JindoFS and Alibaba Cloud OSS, etc. When Spark Relational Cache is available, after the query is over, it will query whether the cached data stored in Relational Cache can be used. If it cannot be used, it will be forwarded to the original path. If it can be used, it will be sent from The data is read from the cache and the result is returned to the user. Because Relational Cache is built on efficient storage, the data is converted into Relational Cache through user's DDL.
Spark Relational Cache Features
Spark Relational Cache hopes to achieve second-level or sub-second response, and can see the result quickly after submitting SQL. And it also supports a large amount of data, which is stored on the persistent storage, and at the same time, through some matching methods, the matching scene is increased. In addition, the underlying storage also uses an efficient storage format, such as columnar storage, which is used for offline analysis, and has been greatly optimized for columnar storage. In addition, Relational Cache is also a feature of user transparency. Users do not need to know the relationship between several tables when they come up to query. The Relational Cache that can be used requires only a few administrators to maintain it for a manufacturer. Spark Relational Cache supports automatic update. Users don't need to worry about querying wrong data due to the insertion of new data that makes the Cache out of date. It provides users with some setting rules to help users update. In addition, Spark Relational Cache has also conducted a lot of exploration in research and development, such as intelligent recommendation. For example, according to the user's SQL history, it can recommend users based on which relationship to establish Relational Cache.
2. Technical analysis
Alibaba Cloud EMR has many core technologies, such as data precomputation, query automatic matching, and data preorganization.
data precomputation
Data has a model in many cases. The snowflake model is a very common model in traditional databases. Alibaba Cloud EMR has added support for Primary Key/Foreign Key, allowing users to specify the relationship between tables through Primary Key/Foreign Key, improving match success rate. In terms of data pre-computation, make full use of the enhanced computing power of EMR Spark. In addition, multidimensional data analysis is supported through the Data Cube data cube.
execution plan rewrite
This part first generates precalculated results through data precomputation, and stores the results on external storage, such as OSS, HDFS, and other third-party storage. It supports data formats such as Spark DataSource, and popular storage formats such as DataLake. Support will also be added. There are similar optimization schemes in traditional databases, such as the materialized view method, but it is not appropriate to use this method in Spark. The logic matching is placed inside the Catalyst logic optimizer to rewrite the logic execution plan to determine whether the Query can Realize query through Relational Cache, and realize further Join or combination based on Relational Cache. Convert the simplified logical plan into a physical plan and execute it on the physical engine. Relying on other optimization directions of EMR Spark can achieve very fast execution results, and control the rewriting of execution plans through switches.
automatic query matching
Here is a simple example that simply joins three tables together and obtains the final result after filtering conditions. When the Query comes, first judge whether the Spark Relational Cache can meet the requirements, and then realize the filtering of the pre-calculated results, and then get the final desired result.
data pre-organization
If dozens of terabytes of data are stored in the storage, it will take a lot of time to obtain the final result from this relationship, because many Task nodes need to be started, and the scheduling of these Tasks also requires a lot of overhead. Through the file The indexing method compresses the time overhead to the second level, and can filter the total amount of files to be read during execution, which greatly reduces the number of tasks, so that the execution speed will be much faster. Because it is necessary to make the global index more effective, it is best to let the data be sorted. If you sort the structured data, you will know that it only has a very good optimization effect on the Key that is ranked first. For the sorting The latter key is more difficult, so ZOrder sorting is introduced, so that each column listed has the same effect. At the same time, the data is stored in the partition table, using GroupID as the partition column.
3. How to use
DDL
For a simple query, you can specify the automatic update switch and give it a name to facilitate subsequent management. You can also specify the form of data layout, and finally describe the relationship through SQL statements, and then provide users with the same thing as WebUI, which is convenient for users to manage Relational Cache.
Data Update
There are two main strategies for data update of Relational Cache. One is On Commit. For example, when the dependent data is updated, all the data that needs to be added can be appended and written. There is also a default On Demand form, where the user manually triggers the update through the Refresh command, which can be specified at the time of creation, or manually adjusted after creation. Incremental updates of Relational Cache are implemented based on partitions. In the future, we will consider integrating some smarter storage formats to support row-level updates.
4. Performance analysis
Cube build
Alibaba's EMR Spark only takes 1 hour to construct 1T of data.
Query performance
In terms of query performance, the average query time consumption of SSB, when there is no cache, the query time increases proportionally to the scale, and the cache cube always maintains a sub-second response.
Background of the project
Alibaba Cloud EMR is an open source big data solution. At present, EMR has integrated many open source components, and the number of components is also increasing. The lower layer of EMR can access a variety of storage, such as object storage OSS, self-built HDFS inside the cluster, and streaming data. Users can use EMR to process massive data and conduct rapid analysis, and it can also support users to do machine learning and data cleaning on it. EMR hopes to be able to support a very large amount of business data, and also hopes to be able to achieve rapid data analysis through cluster expansion when the amount of data continues to grow.
Pain Points of Adhoc Data Analysis on the Cloud
When doing Adhoc data analysis on the cloud, it is difficult to achieve that the query delay will not increase significantly as the amount of data increases. Although various engines are emerging, and some engines run very fast in some scenarios, the query response speed will inevitably decrease when the amount of data increases. Therefore, we hope to obtain better performance on a more unified platform. At the same time, Alibaba Cloud also hopes to provide cloud-native solutions. Spark is currently the most widely used computing engine in the industry. It is widely used, but there are still many shortcomings in dealing with Adhoc. Therefore, Alibaba Cloud has made a lot of optimizations on Spark to help users meet the needs of Adhoc queries. Therefore, the caching scheme will be involved. Although Spark has a caching mechanism for a long time, there are many deficiencies in satisfying Adhoc scenarios on the cloud. Therefore, Alibaba Cloud will do a lot of optimization on Spark to help users optimize Adhoc queries. speed. But if you put the data in memory, it may not be enough to use all the data as a cache, so Spark Relational Cache was born.
Spark Relational Cache
After the user's SQL request arrives at Spark, it will take a long time to process the data source. Here, the underlying storage includes cluster HDFS, remote JindoFS and Alibaba Cloud OSS, etc. When Spark Relational Cache is available, after the query is over, it will query whether the cached data stored in Relational Cache can be used. If it cannot be used, it will be forwarded to the original path. If it can be used, it will be sent from The data is read from the cache and the result is returned to the user. Because Relational Cache is built on efficient storage, the data is converted into Relational Cache through user's DDL.
Spark Relational Cache Features
Spark Relational Cache hopes to achieve second-level or sub-second response, and can see the result quickly after submitting SQL. And it also supports a large amount of data, which is stored on the persistent storage, and at the same time, through some matching methods, the matching scene is increased. In addition, the underlying storage also uses an efficient storage format, such as columnar storage, which is used for offline analysis, and has been greatly optimized for columnar storage. In addition, Relational Cache is also a feature of user transparency. Users do not need to know the relationship between several tables when they come up to query. The Relational Cache that can be used requires only a few administrators to maintain it for a manufacturer. Spark Relational Cache supports automatic update. Users don't need to worry about querying wrong data due to the insertion of new data that makes the Cache out of date. It provides users with some setting rules to help users update. In addition, Spark Relational Cache has also conducted a lot of exploration in research and development, such as intelligent recommendation. For example, according to the user's SQL history, it can recommend users based on which relationship to establish Relational Cache.
2. Technical analysis
Alibaba Cloud EMR has many core technologies, such as data precomputation, query automatic matching, and data preorganization.
data precomputation
Data has a model in many cases. The snowflake model is a very common model in traditional databases. Alibaba Cloud EMR has added support for Primary Key/Foreign Key, allowing users to specify the relationship between tables through Primary Key/Foreign Key, improving match success rate. In terms of data pre-computation, make full use of the enhanced computing power of EMR Spark. In addition, multidimensional data analysis is supported through the Data Cube data cube.
execution plan rewrite
This part first generates precalculated results through data precomputation, and stores the results on external storage, such as OSS, HDFS, and other third-party storage. It supports data formats such as Spark DataSource, and popular storage formats such as DataLake. Support will also be added. There are similar optimization schemes in traditional databases, such as the materialized view method, but it is not appropriate to use this method in Spark. The logic matching is placed inside the Catalyst logic optimizer to rewrite the logic execution plan to determine whether the Query can Realize query through Relational Cache, and realize further Join or combination based on Relational Cache. Convert the simplified logical plan into a physical plan and execute it on the physical engine. Relying on other optimization directions of EMR Spark can achieve very fast execution results, and control the rewriting of execution plans through switches.
automatic query matching
Here is a simple example that simply joins three tables together and obtains the final result after filtering conditions. When the Query comes, first judge whether the Spark Relational Cache can meet the requirements, and then realize the filtering of the pre-calculated results, and then get the final desired result.
data pre-organization
If dozens of terabytes of data are stored in the storage, it will take a lot of time to obtain the final result from this relationship, because many Task nodes need to be started, and the scheduling of these Tasks also requires a lot of overhead. Through the file The indexing method compresses the time overhead to the second level, and can filter the total amount of files to be read during execution, which greatly reduces the number of tasks, so that the execution speed will be much faster. Because it is necessary to make the global index more effective, it is best to let the data be sorted. If you sort the structured data, you will know that it only has a very good optimization effect on the Key that is ranked first. For the sorting The latter key is more difficult, so ZOrder sorting is introduced, so that each column listed has the same effect. At the same time, the data is stored in the partition table, using GroupID as the partition column.
3. How to use
DDL
For a simple query, you can specify the automatic update switch and give it a name to facilitate subsequent management. You can also specify the form of data layout, and finally describe the relationship through SQL statements, and then provide users with the same thing as WebUI, which is convenient for users to manage Relational Cache.
Data Update
There are two main strategies for data update of Relational Cache. One is On Commit. For example, when the dependent data is updated, all the data that needs to be added can be appended and written. There is also a default On Demand form, where the user manually triggers the update through the Refresh command, which can be specified at the time of creation, or manually adjusted after creation. Incremental updates of Relational Cache are implemented based on partitions. In the future, we will consider integrating some smarter storage formats to support row-level updates.
4. Performance analysis
Cube build
Alibaba's EMR Spark only takes 1 hour to construct 1T of data.
Query performance
In terms of query performance, the average query time consumption of SSB, when there is no cache, the query time increases proportionally to the scale, and the cache cube always maintains a sub-second response.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00