The Kuaishou container cloud platform is designed to provide large-scale infrastructure services based on containerized deployment for Kuaishou's growing, changing, and diversified businesses. Kuaishou engineers need to solve challenges (such as elasticity, stability, efficiency, and serverless architecture) to achieve this goal. Among these challenges, the stability and efficiency of image distribution are also one of the most difficult issues.
In order to ensure stable and efficient image distribution on the Kuaishou container cloud platform, the Kuaishou Container Cloud Technical Team works with Alibaba Cloud and Ant Group to adapt and optimize implementation solutions in OpenAnolis. Dragonfly and its sub-project Nydus proved to be the most suitable solution. This solution provides excellent compatibility with existing systems, a smooth transition over existing capabilities, and significant efficiency gains in service delivery.
After Dragonfly is published, the entire cluster builds a distribution network through P2P, and all nodes help centralize Harbor to relieve network bandwidth pressure. The network bandwidth pressure on the Harbor is reduced by more than 70% (on average), and the peak pressure is reduced by more than 80%. The image distribution system has become more stable, reliable, and efficient, and the system can support a larger number of concurrent image pulling requests at the same time. The high-concurrency image pulling warehouse is no longer a bottleneck, especially in the scenarios of DaemonSet deployment and critical, high-volume instance business service updates.
Dragonfly: https://github.com/dragonflyoss/Dragonfly2
Nydus: https://github.com/dragonflyoss/image-service
containerd: https://github.com/containerd/containerd
Harbor: https://github.com/goharbor/harbor
Peak Mitigation over 80% | Over 90% Time Savings in Image Pulling | 50% Time Savings in POD Instance Service |
"In Kuaishou, Dragonfly has effectively solved the problem of mass file distribution," said Wu Hongbin (Head of the Kuaishou Integrated Operation Platform).
Founded in 2011, Kuaishou is China's first short video platform. It provides services to 1 billion users worldwide every month, including more than 180 million users overseas. Its global footprint has rapidly expanded to Latin America, the Middle East, and Southeast Asia. On Kuaishou, any user can record and share their life and show their talents through short videos and livestreaming. Kuaishou works closely with content creators and enterprises. It is mainly engaged in the operation of content communities and social platforms, providing livestreaming services, online marketing services, e-commerce, entertainment, online knowledge sharing, and other value-added services. With the rapid growth of Kuaishou's business and tens of thousands of key services and middleware running on the Kuaishou container cloud platform, the stability and efficiency of the image distribution system have become increasingly important.
For the upgrade and transformation of Kuaishou's image distribution system, the biggest challenges are alleviating the peak pressure of the image warehouse and accelerating image pulling, making the service distribution seamless and smooth, and making the business unaffected by system changes as much as possible. Kuaishou container cloud platform engineers found that Nydus is deeply integrated with the Dragonfly system and supports traditional OCI images. It can provide fast, stable, secure, and convenient access to container images in a compatible and friendly manner. It can easily adapt to the existing work of the container cloud platform and smoothly transition from the existing image usage mode to the new image format. The only thing the platform has to do is to switch the container runtime engine from Docker to containerd because containerd has a better integration experience with Dragonfly. With the efforts of Kuaishou engineers, it is easy to smoothly switch the container engine of large-scale nodes. Both containerd and Dragonfly have been quickly and fully adopted.
Dragonfly provides the perfect answer for stable and efficient image distribution. There are many important services in Kuaishou that need to be scaled up to tens of thousands of instances in a few minutes, such as Kuaishou's business expansion requirements for the 818 Shopping Festival or Double 11 Global Shopping Festival. This scaling requires thousands of gigabytes of bandwidth to download directly from the image repository. In other scenarios, predictive models and search services need to regularly update model parameter files and index files to ensure recommendation and retrieval effectiveness, which technically means that hundreds of gigabytes of files must be immediately distributed to each relevant instance.
Kuaishou engineers deployed Dragonfly components (Dfdaemon and Dfget) on all ECS instances to pull files using the P2P algorithm. At the same time, the independent super node cluster is deployed in each AZ, and Schedule Server is designed for Dfget. Appropriate super nodes are used to avoid cross-AZ or cross-Region traffic. More importantly, engineers implemented P2P transmission of the data stream based on Dragonfly's unique slice management P2P algorithm, reducing disk load. Thanks to Dragonfly, tens of thousands of instances can pull images or download files at the same time without increasing time costs and disk load.
"Advanced technology is the first productivity. After the Kuaishou container cloud platform cooperates with Dragonfly and Nydus, the application delivery efficiency has been greatly improved, bringing more possibilities to business innovation." -- Sun Yin Head of Container Cloud at Kuaishou Technology)
Image pulling is one of the time-consuming steps in the container lifecycle. Engineers continue to enable the Nydus image lazy loading project to help accelerate image distribution and service startup. Many of Kuaishou's services have thousands of Pod instances, and some of them have images over 20 GB or larger. When these services are upgraded or expanded, the huge image and high startup time will severely slow down service startup. Kuaishou needs a solution that can significantly improve the speed of service startups because some services put their training models into images, which can be disastrous for service startups.
Engineers learned about the Nydus project early on, thanks to the application and implementation of Kuaishou on the Dragonfly project. Nydus is a powerful open-source file system solution that builds an efficient image distribution system for cloud-native workloads (such as container images).
Thanks to the new image design of Nydus, each Pod can be started within a few seconds, which can save significant startup time for service deployment instances and enable applications to provide services to users as soon as possible. The work of supporting Nydus is not complicated for each cluster node. It can be successfully completed through a lossless switchover of the container engine (no pod eviction is required) and configuration changes.
In practice, Harbor still plays a very important role as the global image repository center of the Kuaishou container cloud platform. Specifically, we have done the following:
Of course, all the changes above continue to be compatible with the current existing OCI image format and existing system functions.
In summary, Dragonfly and Nydus provide the best solution for the Kuaishou container cloud platform to handle image distribution issues. The deployment time of tens of thousands of Kuaishou services has been significantly reduced, and it is easier for line-of-business R&D engineers to update services.
Both Dragonfly and Nydus are excellent open-source projects from CNCF. Kuaishou will continue investing in the project and working deeply with the community to make the project more powerful and sustainable. Cloud-native technology is a revolution in infrastructure, especially when it comes to elasticity and Serverless architecture. We believe Dragonfly will play an important role in the cloud-native ecosystem.
Cloud-Native SIG Home Page:
https://openanolis.cn/sig/cloud-native
[1] https://www.cncf.io/case-studies/kuaishou-technology/
[2] d7y + Nydus Kuaishou case
[3] https://github.com/containerd/containerd
[4] https://github.com/dragonflyoss/Dragonfly2
[5] https://d7y.io/
Application Observation Practice of SysOM Profiling in the Cloud Environment
84 posts | 5 followers
FollowOpenAnolis - January 11, 2024
Aliware - July 21, 2021
Alibaba Developer - October 15, 2018
OpenAnolis - March 8, 2022
Alibaba Developer - September 16, 2020
Alibaba Cloud Native Community - January 5, 2023
84 posts | 5 followers
FollowAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreLindorm is an elastic cloud-native database service that supports multiple data models. It is capable of processing various types of data and is compatible with multiple database engine, such as Apache HBase®, Apache Cassandra®, and OpenTSDB.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMore Posts by OpenAnolis