Alibaba Cloud introduced the deep learning tool Arena to the open-source community in July 2018. Now, data scientists can run deep learning on the cloud without having to learn to manipulate low-level IT resources. They can start a deep learning task within a minute, and create a heterogeneous computing cluster within fifteen minutes.
Today, KubeFlow is the most popular deep learning solution within the Kubernetes community, so isn't Arena just reinventing the wheel? KubeFlow is a combinable, portable, and expandable machine learning technology stack built on Kubernetes. It is an end-to-end solution that supports Jupyter Hub development, TFJob model training to TF-serving, and Seldon prediction. However, KubeFlow requires a mastery of Kubernetes. For example, writing a yaml file to deploy a TFJob is quite challenging for the primary users of a machine learning platform — data scientists.
Such tasks diverge from the expectations of data scientists, who care only about three things:
Data scientists are familiar with and enjoy the work method of writing a few simple scripts and running machine learning code on their desktops. However, the space limitations of their hard drives limit the quantity of data they can process, and their computing power is limited when they have no way to take advantage of distributed training.
This is why we developed Arena. This command line tool shields you from the complexities of low-level resources, environment administration, task scheduling, and GPU scheduling and assignment. Arena helps data scientists submit training tasks and check training progress in the straightforward way with which they are already familiar. When data scientists call Arena, they can designate the data source, code to download, and whether to use TensorBoard to check training results.
Arena currently supports standalone training and PS-Worker model distributed training. On the backend, it relies on the TFJob provided by KubeFlow. Soon, it will be expanded to support MPIJob and PytorchJob also.
It also supports real-time training operations and maintenance including:
In the future we hope to provide a deep learning production line through Arena that covers the whole process, including integrated training data management, experiment management, model development, continuous training, evaluation, and online prediction.
The goal of Arena is allow data scientists to unleash the power of KubeFlow as easily as training on a desktop, while also giving them control over cluster-level scheduling and administration. We have published our source code on GitHub to better share and cooperate with the open-source community: https://github.com/AliyunContainerService/arena Everybody is welcome to check it out and use it. If you like it, please star it. We also welcome your contributions to the code.
The open-source tool Arena was born as Alibaba Cloud's Deep Learning Solution. It already supports many deep learning frameworks (such as TensorFlow, Caffe, Hovorod, and Pytorch), and it supports the whole deep learning production line from start to finish (including the steps of integrated training data management, experiment management, model development, continuous training and evaluation, and online prediction).
This solution deeply integrates the resources and services of Alibaba Cloud. It efficiently utilizes heterogeneous resources like the CPU and GPU, and it centralizes containerization, orchestration, and management, also providing monitoring warnings and a platform for operation and maintenance.
Zhang Kai, a senior technical solution architect at Alibaba Cloud said,
"Deep learning has brought about a revolutionary leap in the development of artificial intelligence, yet it has also sharply increased our reliance on computing and data resources. Alibaba Cloud provides end-to-end support for large-scale training, and we are continuously polishing this deep learning solution to make it easier to use and give it more powerful features."
Alibaba Cloud Serverless Kubernetes Service Enters Beta Testing Phase
Kubernetes Demystified: Restrictions on Java Application Resources
177 posts | 31 followers
FollowAlibaba Cloud Native Community - September 18, 2023
Alibaba Container Service - March 29, 2019
Alibaba Developer - June 30, 2020
Alibaba Cloud Native Community - September 20, 2023
Alibaba Container Service - December 4, 2024
Alibaba EMR - April 27, 2021
177 posts | 31 followers
FollowVisualization, O&M-free orchestration, and Coordination of Stateful Application Scenarios
Learn MoreServerless Application Engine (SAE) is the world's first application-oriented serverless PaaS, providing a cost-effective and highly efficient one-stop application hosting solution.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Container Service