Learning about Distributed Systems – Part 18: Run AND Write Fast

Cumbersome MapReduce

In the previous articles, we introduced some methods to improve the performance of distributed computing frameworks. This way, while scalability continues to bring computing power, we can use that computing power more efficiently.

This solves a key problem in big data application development: execution efficiency.

However, there is another issue that affects the efficiency of big data applications, development efficiency, which still bothers developers.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

The section above shows the code for the MapReduce version of WordCount on the Hadoop website. It's too long and complicated.

However, this is the simplest WordCount. A little more complicated logic probably requires more complicated MapReduce programs.

This development and commissioning cost is unacceptable for enterprise production, so we must think of a way out.

MapReduce is created like this, but we can find problems from it to guide new solutions.

The most fundamental reason that the code for the MapReduce version is so complicated lies in the Programming Paradigm.

There are many types of programming paradigms (such as the most common process-oriented, object-oriented, and functional expressions). The following is what impacts MapReduce:

Imperative programming focuses on the process and is more machine-friendly.
Declarative programming focuses on the goal and is more people-friendly.

Imperative programming focuses more on the bottom and is more detailed. You can have full control, but you also have to bear the complexity that comes with it. Declarative programming gives up control over repetitive work in exchange for the liberation of the workforce.

The complicated code writing of MapReduce is caused by imperative programming. In the big data field, MapReduce is in the same boat as Storm in the stream processing field.

However, it should be emphasized that declarative programming is essentially a way of encapsulating specific targets based on imperative programming. This is because machines only accept instructions, and do not understand them.

Since specific targets are limited, imperative programming and declarative programming cannot replace each other. Instead, like programming languages, each has its own applicable scenario.

It is clear that we need declarative programming in the scenario of big data application development.

The two most typical subclasses of declarative programming are DSL and functional programming. As we mentioned earlier in this series, MapReduce draws on map() and reduce() from functional programming, so why is it now borrowing from imperative programming?

The reason is that MapReduce is written in Java. In Java, class and function are not first-class citizens, so they cannot be written in a few lines like the typical functional programming language. Class and function/method can only be defined line by line. We have no choice. Java is famous for its verbosity.

Java is also improving. With the introduction of Lambda, Java is more functional. In combination with the Stream API, the programming in MapReduce can also be very simple:

public class WordCount {
  
  public static void main(String[] args) throws IOException {
      Path path = Paths.get("src/main/resources/book.txt");
      Map<String, Long> wordCount = Files.lines(path).flatMap(line -> Arrays.stream(line.trim().split("\s")))
              .map(word -> word.replaceAll("[^a-zA-Z]", "").toLowerCase().trim())
              .filter(word -> word.length() > 0)
              .map(word -> new SimpleEntry<>(word, 1))
              .collect(groupingBy(SimpleEntry::getKey, counting()));

      wordCount.forEach((k, v) -> System.out.println(String.format("%s ==>> %d", k, v)));

  }
}

Helpful Hive

The SQL-based MapReduce led to the creation of Hive:

We can extract the following important points from the Hive architecture diagram above:

For servers, Hive consists of HiveServer2 and Hive MetaStore Server.
HiveServer2 receives requests from the client side and parses, plans, optimizes, and executes SQL statements.
Hive MetaStore Server is responsible for managing the metadata stored in MetaStore DB.
For clients, Hive provides support for standard JDBC and ODBC. It also provides its own JDBC implementation, beeline, as the default client.
HiveServer2 parses SQL statements from the client into MapReduce tasks and submits them to YARN for execution to process data stored in HDFS.

In general, Hive can be described as a distributed database:

Data Storage: Although data is stored in HDFS, the metadata (or schema) is defined and managed by Hive. We will talk about it in a future article.
Data Computing: Hive supports the parsing and execution of SQL statements, which is the focus of this article.

We can create the following table for the WordCount example:

CREATE TABLE tmp.word_count (
  word string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '<some_hdfs_location>'

Then, the following SQL statement (called HQL in Hive) is required to get the desired result:

SELECT 
  word, count(*) 
FROM 
  tmp.word_count
GROUP BY 
  word;

There is no doubt that the process is much simpler.

Let's look at how SQL statements become MapReduce Tasks using the EXPLAIN command.

You can clearly see the Map and Reduce phases and the group and count operations. As mentioned earlier, complex logic requires multiple MapReduce tasks to be completed. However, most of the complex logic can be implemented in Hive with only one SQL statement, but at the bottom layer, the SQL statement will be parsed into many stages, and each stage may (but not necessarily) correspond to one MapReduce task.

In addition to the simple code writing, Hive is an interactive environment that eliminates the cumbersome compilation, debugging, and deployment process, which undoubtedly further reduces the development cost.

Hive quickly becomes the default choice for SQL on Hadoop, significantly reducing the cost of data processing in Hadoop and becoming widely adopted. At the same time, Hive attracts some analysts who are not very good at programming but are proficient in SQL, thus expanding the popularity of Hadoop.

Spark Can Also Use SQL

In previous articles, we said MapReduce is slow and thus introduced Spark. Can we only improve one of the two: development efficiency and execution efficiency?

We can certainly improve both. As mentioned above, Hive can be divided into (meta) data storage and computing. Since the computing in MapReduce is slow, why don't we perform computing with Spark?

As shown in the figure above, changing the execution engine from MapReduce to Spark is what Shark (another famous product from AMPLab) does. The main developer of Shark is also the core developer of Spark.

Spark is an ambitious project, and the limitations of Hive are increasingly hindering Shark's development. So, the Shark Team stopped developing Shark and started a new business by developing Spark SQL entirely based on Spark. It retains compatibility with Hive.

The Hive community did not stop the evolution of Hive. Inspired by Shark, it has also begun to promote the Hive on XX project. Different execution engines are supported at the bottom layer, including MapReduce, Spark, and Tez. However, the development of Hive on Spark is not as good as Spark SQL.

The preceding figure shows the architecture of Spark SQL. The key points are listed below:

Spark SQL provides three access methods: Spark ThriftServer, Spark SQL CLI, and DataFrame API.
Spark ThriftServer is transformed from Hive ThriftServer and supports JDBC, ODBC, and Thrift Connection.
Spark SQL CLI is a command line tool provided by Spark, and bin/spark-sql can interactively query data. Note: Unlike beeline, this CLI does not connect to Spark ThrfitServer but is an independent driver. Instead, beeline can directly connect to Spark ThriftServer.
The DataFrame/Dataset API can be called by languages (such as Java, Scala, Python, and R) to execute SQL statements.
Both SQL and DataFrame will be parsed and optimized by Catalyst, and metadata is needed. Metadata can be read from Hive or from data files through Data Source API. For example, a Parquet file comes with schema information.
After optimization, Catalyst generates an RDD-based physical execution plan and sends it to Spark Core for execution. Data is read and written through the Data Source API.

It is worth mentioning that Spark SQL's SQL is equivalent to HQL, so you can write SQL statements directly to do interactive queries. While DataFrame/Dataset API is roughly equivalent to MapReduce's Lambda + Stream method:

df.groupBy('word').count()

Also, you can write it like this:

df.sql("select word, count(*) from table tmp.word_count")

TL;DR

The performance optimization in the previous articles has improved execution efficiency, but the problems in development efficiency still plague developers.
The programming in MapReduce is wordy mainly because of the influence of the programming paradigm.
There are many categories of programming paradigms, such as imperative programming, which tells the computer what to do, and declarative programming, which tells people what to do.
However, MapReduce comes from map() and reduce() in functional programming, so it is based on declarative programming but mainly dragged down by Java.
With the introduction of Lambda, Java can write very concise MapReduce programs.
However, such declarative programming is not simple enough. We need DSL, especially SQL, the representative of DSL.
So, Hive was created. Hive receives SQL statements and translates them into MapReduce tasks, and it quickly becomes the default SQL-on-Hadoop solution.
The performance of Spark is better than MapReduce. Therefore, with Spark SQL, both development efficiency and execution efficiency are improved.

You may find that SQL can only process the data with schema. This kind of data is structured data.

For unstructured data, Spark's RDD or Hadoop MapReduce is required for processing.

With Spark SQL for processing structured data and Spark Core (RDD) for processing unstructured data, Spark can replace Hive and MapReduce, respectively. This is also the choice we recommend.

Moreover, the schema information brought by structured data makes it easy for us to process data through SQL and provides the possibility for performance optimization.

In the next article, let's learn about SQL performance optimization.

This is a carefully conceived series of 20-30 articles. I hope to give everyone a core grasp of the distributed system in a storytelling way. Stay tuned for the next one!

Community

Learning about Distributed Systems – Part 18: Run AND Write Fast

Cumbersome MapReduce

Helpful Hive

Spark Can Also Use SQL

TL;DR

Read previous post:

Read next post:

Alibaba Cloud_Academy

You may also like

Comments

Alibaba Cloud_Academy

Related Products

Storage Capacity Unit

Hybrid Cloud Storage

Hybrid Cloud Distributed Storage

Data Lake Storage Solution