By Xin Xianyin(Xinyong)
When asked about the relationship or similarity between MapReduce and Spark, you may say that both MapReduce and Spark are big data processing engines. What about Spark and TensorFlow? You may not have a clear answer to this question. After all, they focus on different fields. Then what about Spark and MPI? This may seem more difficult. Although this may seem a little tedious, they do share a common characteristic. This commonality is the broad topic that we are going to discuss today: distributed computing frameworks.
MapReduce, Spark, and TensorFlow all utilize distributed computing capabilities to perform some calculations and solve specific problems. From this perspective, they all define a distributed computing model, that is, they put forward a computing method that enables the distributed computing of large amounts of data. However, MapReduce, Spark, and TensorFlow differ in the distributed computing model put forward. MapReduce, as its name implies, is a basic map-reduce computing model. Spark defines a set of RDD models, which are essentially a DAG consisting of maps/reduces. The TensorFlow computing model is also a graph, which is more complex than that of Spark. You need to define each node and edge in the TensorFlow graph. These definitions can be used to specify how TensorFlow computes this graph. Acting as a TensorFlow neural network, these specific definitions make TensorFlow suitable for processing a specific type of computations. The RDD model of Spark makes RDD suitable for processing non-correlated parallel data tasks. Is it possible to implement a general-purpose, simple, and high-performance distributed computing model? In my opinion, it is very difficult to implement this kind of computing model. The "general-purpose" feature usually means that the performance cannot be optimized based on specific circumstances. However, a distributed framework written for specific tasks is neither general-purpose nor simple.
By the way, a distributed computing model also involves scheduling. Although scheduling does not receive as much attention, it is an essential part of a distributed computing engine. The MapReduce scheduler is Yarn. Spark uses a built-in scheduler and the same is true for TensorFlow. How about MPI? MPI has almost no scheduling. It assumes that resources are available in clusters and depends on SSH to pull all tasks. In fact, scheduling is divided into resource scheduling and task scheduling. A resource scheduler applies for hardware resources from resource managers and a task scheduler sends tasks in a computation graph to these remote sources. In fact, this is what is called two-phase scheduling. In recent years, many projects have been developed, for example, TensorFlowOnSpark. These projects use the resource scheduling of Spark and the computing model of TensorFlow.
After we write a standalone application and encounter data volume issues, we may naturally consider whether we can run it in a distributed environment. It would be great if the standalone application is made distributed without modifications or with few modifications. However, the truth hurts. Generally, we need to manually write the distributed version of an ordinary application and use frameworks like MPI to control data distribution and aggregation, and implement disaster recovery for failed tasks (usually no disaster recovery). If a target task is to batch process a collection of data, we can use APIs predefined in MapReduce or Spark. For these tasks, the computing framework has already implemented a part (scaffolding, which is not irrelevant) for us that is irrelevant to the business. Similarly, if the task is to train a neural network, we can simply use frameworks such as TensorFlow and Pytorch. What this paragraph tries to convey is that if an existing framework can solve our problem, we should use it. What if no existing frameworks can be used to solve our problem? In addition to implementing a proper framework by ourselves, do we have any other options?
Today, I noticed a project called Ray. The project claims that you only need to slightly modify your code to make your standalone application distributed. In fact, this project was released a long time ago, but I didn't pay much attention to it in the past. Of course, the code is limited to Python, for example:
| **Basic Python** | **Distributed with Ray** |
+------------------------------------------------+----------------------------------------------------+
| | |
| # Execute f serially. | # Execute f in parallel. |
| | |
| | @ray.remote |
| def f(): | def f(): |
| time.sleep(1) | time.sleep(1) |
| return 1 | return 1 |
| | |
| | |
| | ray.init() |
| results = [f() for i in range(4)] | results = ray.get([f.remote() for i in range(4)]) |
+------------------------------------------------+----------------------------------------------------+
Is it this simple? I think of openmp (not openmpi). Let's see the following example:
#include<iostream>
#include"omp.h"
using namespace std;
void main() {
#pragma omp parallel for
for(int i = 0; i < 10; ++i) {
cout << "Test" << endl;
}
system("pause");
}
This code segment is immediately executed in parallel simply by importing the header file and adding a pre-processing command. Of course, openmp is not distributed. It uses the compiler to compile the code segments that need to be parallelized so that multiple threads can be run. It is a process itself. Therefore, the level of parallelism is limited by the number of CPU threads. A dual-thread CPU can only support 2× acceleration. Some servers may have a single-core and 32-thread CPU, which enables 32× acceleration (for parallelized part). However, this is not important. You may find that this method in Ray is somewhat similar to the method in openmp. To run code in parallel, you do not have to make too many code modifications. This is particularly true for openmp, because it is just a line of comments for compilers that do not support openmp.
How does it implement this feature? In fact, Ray simply defines some APIs similar to the communication primitives defined in MPI. When you use Ray, these APIs will be injected into code. At this point, the code is actually the mix of the user code and the API calls in the Ray framework. The whole code is actually a computation graph. Then you only need to wait until Ray completes and returns the computation graph. The Ray paper provides an example:
@ray.remote
def create_policy():
# Initialize the policy randomly.
return policy
@ray.remote(num_gpus=1)
class Simulator(object):
def __init__(self):
# Initialize the environment.
self.env = Environment()
def rollout(self, policy, num_steps):
observations = []
observation = self.env.current_state()
for _ in range(num_steps):
action = policy(observation)
observation = self.env.step(action)
observations.append(observation)
return observations
@ray.remote(num_gpus=2)
def update_policy(policy, *rollouts):
# Update the policy.
return policy
@ray.remote
def train_policy():
# Create a policy.
policy_id = create_policy.remote()
# Create 10 actors.
simulators = [Simulator.remote() for _ in range(10)]
# Do 100 steps of training.
for _ in range(100):
# Perform one rollout on each actor.
rollout_ids = [s.rollout.remote(policy_id)
for s in simulators]
# Update the policy with the rollouts.
policy_id = update_policy.remote(policy_id, *rollout_ids)
return ray.get(policy_id)
The generated computation graph is as follows:
What you need to do is add proper Ray API calls into your code to make your code a distributed computation graph. By contrast, let's see how the graph is defined in TensorFlow.
import tensorflow as tf
# ??????:y = W * x + b,??W?b?????,x??????
x = tf.placeholder(tf.float32)
W = tf.Variable(1.0)
b = tf.Variable(1.0)
y = W * x + b
with tf.Session() as sess:
tf.global_variables_initializer().run() # Operation.run
fetch = y.eval(feed_dict={x: 3.0}) # Tensor.eval
print(fetch) # fetch = 1.0 * 3.0 + 1.0
??:
4.0
As we can see, TensorFlow explicitly and clearly defines the nodes, placeholder variables, and some other elements in the graph (all these are the specific types of graph nodes). However, the graph in Ray is implicitly defined. I think the latter is a more natural solution. From the developer's perspective, with the former approach, it looks like you have to use your code to adapt to this wheel when using TensorFlow.
Is Ray the general-purpose, simple and flexible distributed computing framework that we have been looking for? It is hard to say because I do not have much experience in using Ray. According to the official document, the several APIs are simple enough. It is hard to say whether these APIs alone can implement versatility and flexibility. In essence, the graph defined in TensorFlow is also general enough, but it is not a general-purpose computing framework. Some problems do not lie in frameworks, but the fact that implementing the distributed feature is very hard for these problems themselves. Therefore, it may be a pseudo-proposition to solve the standalone problem by looking for a general-purpose distributed computing framework.
This is far off topic. What if Ray allows us to easily run applications in a distributed manner? Recently, Databricks published Koalas as a new open-source project to try to parallelize Pandas by using the RDD framework. Like Spark, Pandas deals with data analysis scenarios. In addition, both Pandas and Spark have similar bottom-layer storage structures and concepts. Therefore, it is feasible to make Pandas distributed by using RDD. In my opinion, if Ray was easy-to-use and simple enough, adding some Ray API calls to Pandas would be less time-consuming and more cost-effective than developing Koalas. However, adding Ray API calls to Pandas will bind Pandas to Ray. This is also true even in standalone environments, because Ray is not like openmp, which does not affect the code execution if not supported.
These paragraphs seem verbose. What I want to say is that we should consider what a distributed computing framework is, how each framework is designed, what a distributed computing framework can do, and what the advantages and disadvantages are. To end this article, I want to mention an expert's view on this topic. In a speech titled "New Golden Age For Computer Architecture", David Patterson said that hardware is increasingly approaching its limit, and to implement higher efficiency, we need to design Domain Specific Architectures. In the current era, new computing architectures are constantly being developed. Each architecture is developed to solve problems in specific fields and features special optimizations for these problems. For users, universality is not an approach to solving problems, but more like the "wishful thinking" of framework designers. Users always focus on the domains. From this perspective, domain specific architectures are the correct direction.
Note: The content in this article is based on my personal opinion and may be inaccurate. Comments and feedback are appreciated.
Use EMR Spark Relational Cache to Synchronize Data Across Clusters
Using Data Preorganization for Faster Queries in Spark on EMR
61 posts | 6 followers
Followzjffdu - October 24, 2019
Apache Flink Community China - November 6, 2020
Alibaba Clouder - October 16, 2019
Alibaba Container Service - July 16, 2019
Alibaba Clouder - June 8, 2020
Alibaba Cloud Community - August 30, 2024
61 posts | 6 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba EMR