By Yunlei, from Alibaba Cloud Storage Team
The OLAP computing engine is a machine and the programming language that operates that machine. Users can use specific programming languages to tell the computing engine what data needs to be read and what computing need to be performed. There are many programming languages. Anyone can design a programming language and a corresponding compiler to parse it. SQL has been around for nearly half a century. What makes SQL continue to rejuvenate? This article introduces the theoretical basis and various high-end operations of SQL.
The OLAP computing engine is a machine and the programming language that operates that machine. Users can use specific programming languages to tell the computing engine what data needs to be read and what computing need to be performed. There are many programming languages. Anyone can design a programming language and a corresponding compiler to parse it. Programming languages can be classified into imperative ones and declarative ones.
Imperative programming language is the most common programming language. C, C++, and Java are all imperative programming languages. This kind of language tells the machine which instructions should be executed, leaving little room for compiler optimization.
Declarative programming language describes what results a program should achieve and does not care about the details of how to do it. SQL is a declarative programming language. For example, the SQL statement, select count(1) from department where kpi =3.25
, indicates that the number of people with kpi =3.25 should be computed, but it does not specify how to complete it. This leaves a lot of room for subsequent optimizers to operate. The optimizer can make various explorations based on SQL requirements and actual data to find the best executive mode.
A good analysis language should have the following characteristics:
SQL has a simple syntax and clear logic. After understanding the simplest query statement, we can nest multiple layers to express complex logic. SQL is based on relational algebra and has a theoretical foundation to ensure clear semantics without ambiguity. SQL has a long development history, so there are many learning materials for beginners. At the same time, the ecology around SQL is rich, and many tools use SQL for analysis.
In addition to SQL, some software launched custom languages. For example, Elasticsearch launched Lucene syntax, Prometheus launched custom PromQL, and Splunk launched SPL. Every new syntax has a certain learning threshold for new users. Therefore, they are not as widely used as SQL. SQL is the de facto standard for data analysis.
DataModel is used to describe how data is organized in a database. Common models include the relational model, key/value model, graph model, document model, and column-family model. A relational database uses a relational model. Redis uses a key/value model. A graph database uses a graph model. MongoDB uses a document model.
The relational in the relational model is translated into relationship in Chinese. In my opinion, relationship refers to an entity composed of some interrelated attributes. Since there are association relationships between column attributes, it is called a relationship, which refers to the correlation among attributes. This correlation is reflected in: belong to the same row, meet the constraints among columns, meet the constraints among rows, and meet the constraints among different relations. All data forms an organized existence through different constraints.
The database defines relational entities through the relational model to ensure the contents meet certain constraints and provides programming interfaces to read and write the database content. A database contains many correlations, and each correlation is a table with multiple rows and columns. The columns of each row are related and may also define some constraints. The correlation between rows may be constrained by defining the primary key and sorting methods. Correlations can be implemented using foreign keys.
This kind of constraint between columns and rows is more practical in OLTP scenarios because OLTP focuses on the data. Therefore, when it stores data, it pays more attention to the storage form of data. OLAP focuses on data analysis, so these constraints are weakened in data warehouses. Therefore, in data warehouses, we only need to focus on a table with multiple rows and columns. Constraint attributes (such as PK and sorting) are mainly used for data acceleration. The relational model is used as a rigorous theory to provide a theoretical basis for the optimization of executors. However, this name is too convoluted. Therefore, in subsequent articles, when relational model-related theories are involved, the word relational will be used. Otherwise, a table will be used to refer to it.
Relational models and relational algebra are the theoretical foundations of SQL. Algebra is more than the simple mathematical calculations of addition, subtraction, multiplication, and division that we are familiar with. In the computer industry, we have seen many kinds of algebra, such as linear algebra (commonly used in neural networks) and boolean algebra (used in circuits). Shannon brought boolean algebra into logic circuit design, providing a theoretical basis for computer binary calculation. There are numerous types of algebra.
Relational algebra, derived from set algebra, deals with the transformation between sets. It is a series of operations in relational algebra that accept one or two relations as input and produce a new relation as a result. Since the input and output are the same relations, we can concatenate multiple operators to form more complex operators. The operators contained in relational algebra are σ (select, filter out some rows from a relation to form a new relation), Π (projection, filter out some columns from a relation to form a new relation ), ∪ (Union, merge two relations), ∩ (Intersection, take the intersection parts of two relations), – (difference, take the difference set parts of two relations), × (Product, Cartesian product of two relations), ⋈ (Join, join two relations under certain conditions), ρ (Rename, rename the columns in relations), ← (Assignments, name a temporary query into a new relation), δ (Duplicate Eliminating), γ (Aggregation, aggregate some columns and the result forms a new relationship), τ (Sorting, the sorting result forms a new relationship). Commonly used relational operations are defined here, and their names indicate the meaning of their operations, so the details of each operation will not be described here. In the syntax parsing and optimizer phase, we will come across relational algebra again, and with the help of the theoretical basis of relational algebra, we will do some transformations on the syntax tree. Here, we only need to know about these operations on relational algebra and check how to express these operations in SQL syntax later.
The SQL syntax contains several categories of functions:
Data Manipulation Language (DML) is used to add, delete, modify, and query data.
Data Definition Language (DDL) is used to define the format of a table.
Data Control Language (DCL) is used to control permissions.
Although DML and DCL are the basic functions of the SQL system, the focus of this article is more on the technology of data processing and how to speed it up, so more attention is paid to DDL. In DDL, there are also functions like addition, deletion, modifications, and query. Among them, this article pays more attention to query, which means how to speed up data reading and computing. The optimization of data writing and storage also aims to speed up data computing.
SQL stands for Structured Query Language. SQL syntax is simple and easy to learn and use. Thus, it is the most common language in the data analysis field. SQL is an operational tool for data analysis. For users, SQL represents the operational semantics of users, but for programs, they only receive a string. If the program wants to understand the meaning of SQL, it should go through lexical analysis, syntax analysis, and semantic analysis to construct an abstract syntax tree. Lexical analysis and syntax analysis are basic operations. The compilation principle course of computers in university should include this part. The mode of lexical analysis and syntax analysis is fixed, so it is not helpful to improve the computing speed. However, as an essential first link in the OLAP engine, it is necessary to make a brief introduction to lexical analysis and syntax analysis, which helps understand the query plan and optimizer in the subsequent chapters. Since the focus of this article is on computing speed, the brief introduction will not take up too much space in this chapter.
Developers can develop custom analysis language. It can also be seen as a language as long as the language conforms to certain rules that embrace unambiguity and semantic integrity. However, it is difficult to develop a new language. Most new languages use procedural programming, where each phrase represents a simple operation, or use a declarative pipeline syntax, where each part represents input, computing, and output. It is difficult to define a syntax that can be infinitely expanded without ambiguity. It is not comparable to SQL in terms of semantic integrity. Whether you are developing a new language or adopting SQL, the process is similar to the following figure. The OLAP engine parses SQL, generates abstract syntax trees, and converts them into logical execution plans. After optimization, high-performance operator combinations are generated. This is the process of compilation and optimization.
Figure 2-1 – Program Compilation and SQL Compilation
Before we understand compilation, let's first understand the structure definition of SQL. SQL revolves around relations. It can define various operations on relation and operations of multiple relations.
The objects of SQL operations are structured data. The basic syntax and nested extensions of SQL all revolve around relation. A relation can be thought of as a table consisting of multiple rows and columns in a database. An SQL accepts input from one or more relations and outputs a relation. When you nest a query, the inner query outputs an intermediate relation and serves as the input relation to the outer query, similar to the pipeline syntax in Linux command lines. Hereinafter, table is used to indicate relationship.
In the next part, we will focus on various SQL Syntax.
One Way to Solve Development Environment Problems: A Guide to Remote Container Development
1,042 posts | 256 followers
FollowAlibaba Cloud Community - June 28, 2023
ApsaraDB - September 30, 2021
Alibaba Cloud Community - December 27, 2022
ApsaraDB - October 24, 2022
ApsaraDB - December 30, 2021
ApsaraDB - March 26, 2024
1,042 posts | 256 followers
FollowAn on-demand database hosting service for SQL Server with automated monitoring, backup and disaster recovery capabilities
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreMore Posts by Alibaba Cloud Community
Dikky Ryan Pratama July 1, 2023 at 3:52 pm
awesome!