Spark is a general-purpose big data analytics engine. Spark features high performance, ease of use, and widespread use.
Architecture
Scenarios
Offline ETL
Offline ETL applies to data warehousing scenarios. It refers to the process of extracting, transforming, and loading large amounts of data. This process is time-consuming. In most cases, scheduled tasks are used to perform offline ETL.
OLAP
OLAP applies to business intelligence (BI) scenarios. After an analyst submits an interactive query, Spark quickly returns results. In addition to Spark, common OLAP engines include Presto and Impala. The main features of Spark 3.0 are supported in EMR Spark 2.4. For more information about the features of Spark, see Spark SQL Guide.
Stream processing
Stream processing applies to real-time data processing scenarios, such as real-time dashboard update, risk management, recommendation, monitoring, and alerting. Stream processing engines include Spark Streaming and Flink. Spark Streaming provides the DStream and Structured Streaming APIs. Structured Streaming can be used in a similar way to DataFrame and does not have high requirements for developers. Flink is suitable for scenarios in which low latency is required. Spark Streaming is suitable for scenarios in which high throughput is required. For more information, see Structured Streaming Programming Guide.
Machine learning
MLlib is a scalable machine learning library that contains classification, regression, collaborative filtering, and aggregation algorithms. MLlib provides tools such as model selection, automatic parameter tuning, and cross-validation to improve productivity. MLlib supports algorithm modules for non-deep learning. For more information, see Machine Learning Library (MLlib) Guide.
Graph computing
GraphX is a graph computing library. It supports various graph computing operators, such as property operators, structural operators, join operators, and neighborhood aggregation operators. For more information, see GraphX Programming Guide.