StarRocks - E-MapReduce - Alibaba Cloud Documentation Center

This topic introduces StarRocks and describes the features and scenarios in which you can use StarRocks.

What is StarRocks?

StarRocks is a new-generation and high-speed massively parallel processing (MPP) database that supports efficient and unified data analysis.
StarRocks is compatible with the MySQL protocol. You can use a MySQL client or a common business intelligence (BI) tool to access StarRocks for data analysis.
StarRocks uses a distributed architecture to provide the following capabilities:
- Horizontally splits tables and stores data in multiple replicas.
- Scales clusters in a flexible manner to support analysis of 10 PB data.
- Supports the MPP architecture to accelerate data computing.
- Supports multiple replicas to ensure fault tolerance.

Note

Some information in this topic is from What is StarRocks of open source StarRocks.

Features

The StarRocks team adopts the design idea of MPP databases and distributed systems. StarRocks has the following features:

Simplified architecture

StarRocks uses the MPP computing framework to execute SQL statements. The MPP computing framework leverages the computing capabilities of multiple nodes to concurrently perform queries. This helps improve user experience in interactive analytics.
StarRocks is easy to deploy, use, and maintain. You can use StarRocks clusters without the need to use external components. This helps reduce O&M costs and increases the reliability and scalability of the StarRocks system. Administrators need to only focus on the StarRocks system, without the need to learn and manage external systems.

Vectorized engine

StarRocks adopts vectorization technologies at the computing layer to optimize all operators, functions, scanning and filtering modules, and import and export modules in a systematic manner. StarRocks makes full use of the parallel computing power of CPUs by using methods such as columnar memory layout and single instruction, multiple data (SIMD) in CPUs. This way, you can perform multi-dimensional analysis on data at sub-second speeds.

Intelligent query optimization

StarRocks uses a cost-based optimizer (CBO) to perform automatic optimization for complex queries. The CBO uses statistical information to estimate execution costs and generate an execution plan without the need to perform manual operations. This significantly improves the efficiency of data analysis in ad hoc queries and extract, transform, and load (ETL) scenarios.

Federated queries

StarRocks supports federated queries by using external tables. Hive, MySQL, Elasticsearch, Iceberg, and Hudi external tables are supported. You can query data without the need to import data. This accelerates data queries.

Efficient updates

StarRocks provides models for detail queries, data aggregation, primary keys, and data updates. The model for primary keys helps you perform UPSERT operations or DELETE operations based on primary keys. StarRocks optimizes storage and indexing to ensure that queries during concurrent updates are efficient and can meet the requirements of real-time data warehouses.

Intelligent materialized views

StarRocks supports intelligent materialized views. You can create a materialized view and perform precalculation to generate a pre-aggregate table. This way, you can accelerate aggregation query requests.
Data can be automatically aggregated into a materialized view when data is imported. Therefore, the materialized view contains the same data as the source table.
When you query data, you do not need to specify a materialized view. StarRocks selects the optimal materialized view. This accelerates queries.

Standard SQL syntax

StarRocks supports standard SQL syntax, including syntax on aggregation, JOIN operations, data sorting, window functions, and custom functions.
StarRocks also supports 22 SQL statements from TPC-H and 99 SQL statements from TPC-DS.
StarRocks is compatible with the MySQL protocol. You can use a variety of clients and BI tools to access StarRocks and analyze data in StarRocks by drags and drops.

Unified batch and stream computing

StarRocks supports data import in real time or in batches.
You can import data from Kafka data sources, Hadoop Distributed File System (HDFS) data sources, and local files.
The data you import can be in formats such as ORC, Parquet, and CSV.
StarRocks allows you to import data from Kafka data sources in real time when you consume the data. No data can become duplicated or lost.
StarRocks allows you to import local or remote data in batches by using HDFS.

High availability and scalability

StarRocks stores metadata and data in multiple replicas and provides hot backup in multiple instances. This prevents single points of failure (SPOFs).
StarRocks has the elastic auto recovery capability. This ensures the stability of StarRocks clusters when node failures, disconnections, and exceptions occur.
StarRocks uses a distributed architecture to horizontally improve storage capacities and computing capabilities. Each StarRocks cluster can contain hundreds of nodes to support the management of 10 PB data.
The service is not interrupted during scaling. You can query data as usual.
StarRocks supports hot changes in table schemas. You can execute an SQL statement to dynamically modify table schemas. For example, you can add or remove a column, or create a materialized view. Tables whose schemas are being modified can be imported or queried as usual.

Scenarios

StarRocks can meet diversified data analysis requirements of enterprise users in the following scenarios:

OLAP multi-dimensional analysis
- User behavior analysis
- User persona analysis, tag analysis, and target user identification
- High-dimensional business metric reporting
- Self-service reporting platform
- Business problem identification and analysis
- Cross-theme business analysis
- Financial reporting
- System monitoring analysis
Real-time data warehouses
- Data analysis for e-commerce promotion activities
- Result analysis for live streaming in the education industry
- Waybill analysis in logistics
- Performance analysis and metric calculation in the financial industry
- Advertising analysis
- Cockpit management
- Application performance management (APM)
High-concurrency queries
- Report analysis for advertisers
- Analysis of sales channel-related personnel in retailing
- Client-based reporting in the software as a service (SaaS) industry
- Multi-page analysis on dashboards
Unified analysis
In some scenarios, an all-in-one system is required to provide various features, such as multi-dimensional analysis, high-concurrency queries, pre-computing, real-time analysis, and ad hoc queries. The system is expected to reduce architecture complexity, requirements on technology stacks, and costs in development and O&M.