Deploying Kubeflow Pipelines on Alibaba Cloud

By Bi Ran

Overview

Due to software development issues, machine learning projects are always complex and complicated. In addition, machine learning projects are data-driven. This creates other challenges, such as long workflows, inconsistent data versions, difficulties to trace experiments and recur experiment results, and high model iteration costs. To resolve these issues, many enterprises have built an internal machine learning platform to manage the machine learning lifecycle, such as the Google Tensorflow Extended platform, Facebook FBLearner Flow platform, and Uber Michelangelo platform.

However, these platforms depend on the infrastructure of the enterprises. This means that they cannot be completely open-source. These platforms use the machine learning workflow framework. This framework enables data scientists to define their own machine learning workflows more flexibly. They can use the existing data processing and model training capabilities to manage the machine learning lifecycle.

Google has substantial experience in building machine learning workflow platforms. TensorFlow Extended (TFX) is a Google-production-scale machine learning platform used to support the core businesses of Google, including their search, translation, and video businesses. More importantly, TFX improves the efficiency of creating machine learning projects.

The Kubeflow team of Google made Kubeflow Pipelines (KFP) completely open-source at the end of 2018. KFP adopts the design of Google TensorFlow Extended. The only difference between KFP and TFX is that KFP runs in Kubenretes and TFX runs in Borg.

What Is Kubeflow Pipelines

The Kubeflow Pipelines platform consists of the following components:

A console for running and tracing experiments.
The workflow engine Argo for scheduling multi-step machine learning workflows.
An SDK for defining workflows. Currently, the SDK only supports Python.

You can use Kubeflow Pipelines to achieve the following goals:

End-to-end orchestration: enables and simplifies the orchestration of machine learning pipelines. Pipelines can be triggered directly, at a scheduled time, by event, or even by data changes.
Easy experiment management: makes it easy for you to try numerous ideas and techniques and manage your experiments. Kubeflow Pipelines also makes the transition from experiments to production much easier.
Easy re-use: enables you to re-use components and pipelines to quickly create end-to-end solutions without the need to rebuild experiments each time.

Deploy Kubeflow Pipelines on Alibaba Cloud

You may want to get started with Kubeflow Pipelines after learning all of its features. To use Kubeflow Pipelines, you must overcome the following challenges:

Pipelines are deployed by using Kubeflow. However, Kubeflow has many built-in components and it is complex to use Ksonnet to deploy Kubeflow. 2. Pipelines depend on the Google cloud platform. They cannot run on other cloud platforms or bare metal instances.

The Alibaba Cloud Container Service team has provided a Kubeflow Pipelines deployment solution based on Kustomize for users in China. Unlike basic Kubeflow services, Kubeflow Pipelines depends on stateful services such as MySQL and Minio. Therefore, data persistence and backup are required. In this example, Alibaba Cloud SSD cloud disks are used to store MySQL and Minio data.

You can also deploy the latest version of Kubeflow Pipelines on Alibaba Cloud without deploying other services.

Prerequisites

You must install Kustomize.

If your operating system is Linux or Mac OS, run the following commands:

opsys=linux  # or darwin, or windows
curl -s https://api.github.com/repos/kubernetes-sigs/kustomize/releases/latest |\
  grep browser_download |\
  grep $opsys |\
  cut -d '"' -f 4 |\
  xargs curl -O -L
mv kustomize_*_${opsys}_amd64 /usr/bin/kustomize
chmod u+x /usr/bin/kustomize

If your operating system is Windows, download kustomize_2.0.3_windows_amd64.exe.

For more information about creating a Kubernetes cluster in Alibaba Cloud Container Service, click here.

Procedure

1. Connect to the Kubernetes cluster through SSH. For more information, click here.

2. Download the source code.

yum install -y git
git clone --recursive https://github.com/aliyunContainerService/kubeflow-aliyun

3. Configure security settings.

3.1 Configure a TLS certificate. If you do not have a TLS certificate, run the following commands to generate one:

yum install -y openssl
domain="pipelines.kubeflow.org"
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key -out kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt -subj "/CN=$domain/O=$domain"

If you have a TLS certificate, upload the private key and certificate to kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key and kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt, respectively.

3.2 Set a password for the admin account.

yum install -y httpd-tools
htpasswd -c kubeflow-aliyun/overlays/ack-auto-clouddisk/auth admin
New password:
Re-type new password:
Adding password for user admin

4. Use Kustomize to generate a YAML configuration file.

cd kubeflow-aliyun/
kustomize build overlays/ack-auto-clouddisk > /tmp/ack-auto-clouddisk.yaml

5. Check the region and zone of the Kubernetes cluster, and replace the zone ID with the zone ID of the cluster. For example, if your cluster is in the cn-hangzhou-g zone, then run the following commands:

sed -i.bak 's/regionid: cn-beijing/regionid: cn-hangzhou/g' \
    /tmp/ack-auto-clouddisk.yaml

sed -i.bak 's/zoneid: cn-beijing-e/zoneid: cn-hangzhou-g/g' \
    /tmp/ack-auto-clouddisk.yaml

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml configuration file is updated.

6. Change the container image address from gcr.io to registry.aliyuncs.com.

sed -i.bak 's/gcr.io/registry.aliyuncs.com/g' \
    /tmp/ack-auto-clouddisk.yaml

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml configuration file is updated.

7. Set the disk space, for example, to 200 GB.

sed -i.bak 's/storage: 100Gi/storage: 200Gi/g' \
    /tmp/ack-auto-clouddisk.yaml

8. Verify the Kubeflow Pipelines YAML configuration file.

kubectl create --validate=true --dry-run=true -f /tmp/ack-auto-clouddisk.yaml

9. Use kubectl to deploy the Kubeflow Pipelines service.

kubectl create -f /tmp/ack-auto-clouddisk.yaml

10. Use Ingress to query the connection information of the Kubeflow Pipelines service. In this example, the IP address of the Kubeflow Pipelines service is 112.124.193.271. The connection URL of the Kubeflow Pipelines console is https://112.124.193.271/pipeline/.

kubectl get ing -n kubeflow
NAME             HOSTS   ADDRESS           PORTS     AGE
ml-pipeline-ui   *       112.124.193.271   80, 443   11m

11. Log on to the Kubeflow Pipelines console.

If a self-signed certificate is used, the system cautions that the connection is not private. Click Advanced and then click visit the website.

Enter username admin and the password set in step 2.2.

You can then manage and run training tasks in the Kubeflow Pipelines console.

FAQ

1. Why are Alibaba Cloud SSD cloud disks used in this example?

With Alibaba Cloud SSD cloud disks, you can periodically back up data to prevent Kubeflow Pipelines metadata loss.

2. How can I back up a cloud disk?

If you want to back up the data stored in a cloud disk, you can manually create snapshots for the cloud disk or apply an automatic snapshot policy to it to automatically create snapshots on schedule.

3. How can I remove the Kubeflow Pipelines deployment?

Complete the following tasks to remove the Kubeflow Pipelines deployment:

Delete the Kubeflow Pipelines components.

kubectl delete -f /tmp/ack-auto-clouddisk.yaml

Use the release cloud disks function to release the MySQL and Minio cloud disks.

4. How can I use an existing cloud disk to store data if I do not want the system to automatically create a cloud disk for me?

For more information, click here.

Summary

This document introduces the background of Kubeflow Pipelines, the major issues that Kubeflow Pipelines resolves, and the procedure of using Kustomize to deploy Kubeflow Pipelines for machine learning on Alibaba Cloud. To learn more, see the document about how to use Kubeflow Pipelines to develop a machine learning workflow.

Community

Deploying Kubeflow Pipelines on Alibaba Cloud

Overview

What Is Kubeflow Pipelines

Deploy Kubeflow Pipelines on Alibaba Cloud

Prerequisites

Procedure

FAQ

Summary

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Platform For AI

Container Service for Kubernetes

ACK One

Epidemic Prediction Solution