An Innovative Paper on Alibaba Cloud Serverless Scheduling from ACM SoCC

Recently, an innovative paper on serverless scheduling written by the Alibaba Cloud Function Compute Product Team was included in the ACM SoCC International Conference.

Last year, the Alibaba Cloud Function Compute Team proposed a decentralized fast image distribution technology in the FaaS scenario. The paper was included by the top conference USENIX ATC'21 in the field of computer systems and was selected for the list of Class A international conferences recommended by the China Computer Federation (CCF). This year, Alibaba Cloud Function Compute (FC) has made continuous breakthroughs. Its scheduling algorithm paper based on function profile was included by ACM SoCC, the premier international conference on cloud computing. The paper ensures high-performance stability while improving the utilization of function resources.

ACM Symposium on Cloud Computing (SoCC) is an academic conference sponsored by the American Computer Association, focusing on cloud computing technology, which is the premier conference for cloud computing. It brings together researchers, developers, users, and practitioners interested in cloud computing. It is the only conference jointly sponsored by the Special Interest Group on Management Of Data (SIGMOD) and the Special Interest Group on Operating Systems (SIGOPS). This conference has flourished in recent years, aiming to gather scholars in the fields of database and computer systems to jointly promote the research and development of cloud computing technology in the industry.

The paper included this time was entitled Owl: Performance-Aware Scheduling for Resource-Efficient Function-as-a-Service Cloud.

This paper was inspired by Function Compute from Alibaba Cloud's Serverless products. Function Compute is a Function-as-a-Service product of Alibaba Cloud. Alibaba Cloud Function Compute (FC) is a fully managed event-driven computing service. Function Compute allows users to focus on writing and uploading code without managing infrastructure (such as servers). Function Compute prepares computing resources for you, runs your code flexibly and reliably, and provides functions (such as log query, performance monitoring, and alerting). At this stage, it covers actual business scenarios (such as event-driven, audio and video processing, games, IoT, new retail, and AI). It serves multiple businesses or projects (such as Alibaba Cloud, Amap, Alipay, Taobao, and CBU).

The preceding figure shows a classic FaaS scheduling system architecture. The scheduler load schedules different function instances onto nodes in the cluster. Due to a large number of functions, small function granularity, and short execution time of FaaS products, the resource utilization of nodes is low. Simply scheduling more instances to the same node can improve resource utilization to a certain extent, but it arouses resource competition and performance degradation.

Given this problem, the paper innovatively proposes a scheduling algorithm based on function profile, achieving good performance stability while improving resource utilization.

For functions called at high frequencies, the scheduler identifies the performance of different function instances when they are co-located on the same node to guide the scheduling of function instances.
For functions called at low frequency, the scheduler will count the actual resource consumption during the execution to guide the scheduling of the function instance. At the same time, the scheduler will monitor the execution latency of the function and mitigate it using isolation when the latency prolongs.
The scheduler also migrates idle instances from nodes with low utilization to those with high utilization to free up idle nodes.

The paper abstracts ten functions according to the typical function load of the production environment to evaluate the effect of the algorithm, which covers different programming languages, resource consumption, execution duration, and external dependencies:

The experimental results show that the OWL scheduling algorithm can save 43.8% of resources at a scale of 100 nodes while the function execution latency does not increase significantly.

There is no significant increase in scheduling latency.

The function profile capability of OWL has been applied to Function Compute online environments with good results. Being included in ACM SoCC marks another innovation for Alibaba Cloud in the field of serverless scheduling.

Attached Paper Information

Title of the Paper:

Owl: Performance-Aware Scheduling for Resource-Efficient Function-as-a-Service Cloud

Authors: Tian Huangshi, Li Suyi, Wang Ao, Wang Wei, Wu Tianlong, Yang Haoran

Abstract: Function-as-a-Service (FaaS) is gaining increasing popularity in cloud computing. All major cloud providers have FaaS platforms. It commences with our observation that memory and CPU are under-utilized in most FaaS sandboxes. A natural solution is to overcommit VM resources when allocating sandboxes, whereas the ensuing contention may cause performance degradation and compromise user experience. To complicate matters, the degradation in FaaS can arise from external factors, such as failed dependencies of user functions.

We design Owl to achieve both high utilization and performance stability. It introduces a customizable rule system for users to specify their toleration of degradation, and overcommits resources with a dual approach. (1) For lessinvoked functions, it allocates resources to the sandboxes with usage-based heuristic, keeps monitoring their performance, and remedies any detected degradation. It differentiates whether a degraded sandbox is affected externally by separating a contention-free environment and migrating the affected sandbox into there as a comparison baseline. (2) For frequently-invoked functions, Owl profiles the interference patterns among collocated sandboxes and place the sandboxes under the guidance of profiles. The collocation profiling is designed to tackle the constraints that profiling has to be conducted in production. Owl further consolidates idle sandboxes to reduce resource waste. We prototype Owl in our production system and implement a representative benchmark suite to evaluate it. The results demonstrate that the prototype could reduce VM cost by 43.80% and effectively mitigate latency degradation, with negligible overhead incurred.

Community

An Innovative Paper on Alibaba Cloud Serverless Scheduling from ACM SoCC

Attached Paper Information

Read previous post:

Read next post:

Alibaba Cloud Serverless

You may also like

Comments

Alibaba Cloud Serverless

Related Products

Apsara Stack

Function Compute

ECS(Elastic Compute Service)

Super Computing Cluster