Editor's Note: From October 11 to 14, 2017, The Computing Conference will be held once again in Hangzhou's Yunqi township (get your tickets now!). As one of the world's most influential technology expos, this conference will include brilliant lectures by many Alibaba Group's experts and industry leaders. Starting from today, the Yunqi Community will interview a series of conference guests.
Today, we interviewed Alibaba Cloud's virtualization platform director Dr. Zhang Xiantao. During the Computing Conference in October, Dr. Zhang will deliver a speech on the field of heterogeneous computing and its future prospects.
In the world of IT, heterogeneous computing is not some new buzzword.
Over the past 10 years, the computing industry has seen the transition from the 32bit, x86-64, multi-core, and the general-purpose GPGPU heterogeneous computing architectures, particularly the CPU-GPU heterogeneous design in 2010. In recent years, the emergence of computing-intensive fields, such as artificial intelligence, high-performance data analytics, and financial analysis, has put heterogeneous computing under the spotlight.
Since traditional general-purpose computing can no longer satisfy our computing power needs, heterogeneous computing is now recognized as a key technology that can play a pivotal role in boosting computing power. To cope with such demands, Alibaba Cloud has developed several heterogeneous computing products and solutions. At the helm of Alibaba Cloud's heterogeneous computing team is Dr. Zhang Xiantao.
Dr. Zhang Xiantao, nicknamed Xuqing, received his PhD in information security from Wuhan University. Before joining Alibaba, he worked at Intel's Asia-Pacific R&D Center and was a primary contributor to multiple open-source projects, such as Xen and KVM. He served as a maintainer for the Xen/IOMMU and KVM/IA64 projects. At the same time, he was one of the main authors of and contributors to the Intel HAXM accelerator, and thus received the Intel Achievement Award.
In 2014, Dr. Zhang officially joined Alibaba as a senior expert. Currently, his main responsibilities at Alibaba include virtualization technology, high-performance computing products, heterogeneous computing products, and the technology and R&D teams for several innovative products.
In this interview, Dr. Zhang spoke about the pain points for companies using heterogeneous computing solutions and gave a deep dive into Alibaba Cloud's work in the field of balanced heterogeneous computing resources.
Opportunities and Challenges for Heterogeneous Computing
Heterogeneous computing refers to computing systems that are composed of computing elements from different types of command sets and system architectures. At present, the "CPU+GPU" and "CPU+FPGA" architectures are the most popular heterogeneous computing platforms in the industry. The primary advantage of heterogeneous computing is computing performance with higher efficiency and lower latency in comparison with parallel computing using traditional CPUs. Given the industry's demand for ever-greater computing performance, heterogeneous computing is becoming increasingly important. Heterogeneous computing is the result of the collective efforts of the entire computing industry ecosystem. Chip companies make substantial investments in hardware research, development standards for heterogeneous programming are gradually becoming more mature, while mainstream cloud service providers are vigorously make strategic moves in the field. In time, it looks as if heterogeneous computing will replace traditional homogeneous computing.
Dr. Zhang stated that heterogeneous computing can meet the computing needs of computing-intensive fields, such as artificial intelligence, high-performance data analytics, and financial analysis. Heterogeneous computing will gradually replace general-purpose computing in areas where this approach is not advantageous.
However, despite all the hype, the procurement, deployment, and use of heterogeneous computing still put it out of the reach of the vast majority of companies. On this topic, Dr. Zhang discussed the following pain points suffered by users:
1.High procurement costs: Users looking to make small purchases have almost no bargaining power. Particularly for FPGA boards, the unit costs for purchase in small quantities can be very high.
2.Long delivery cycle: Average users will generally spend several months to move from requirement decision, model selection, hardware architecture design, supplier selection, data center selection, to financial approval.
3.No elasticity: The quantity of GPUs/FPGAs purchased completely determine a system's capacity. Therefore, the GPU/FPGA resources may be too high during low traffic, but the same quantity may be insufficient during peak hours.
4.No hardware dividend: After purchasing a certain model, it cannot be changed. Therefore, if a new GPU/FPGA architecture is released, users will have to make another purchase, because the performance of the old GPU/FPGA architecture will not be able to keep up with new applications.
5.Data islands: Offline GPUs/FPGAs cannot communicate with online services.
In addition, he added that the greatest challenge for FPGA products is its poor overall FPGA ecosystem and there are few clients capable of FPGA development, particularly computing acceleration. Therefore, Dr. Zhang and his team plan to build a cloud-based IP development marketplace, introduce a series of FPGA IP partners, and promote the creation of development standards for cloud-based FPGAs. With these efforts, they aim to enrich the overall FPGA development ecosystem and entice more IP developers and partners to release their IPs on the IP development marketplace in order to serve their end users. Ultimately, this will further enrich the overall FPGA ecosystem."
Within a short timeframe, Alibaba Cloud successively introduced elastic GPU and FPGA heterogeneous computing solutions. They aim to lower the barriers to heterogeneous computing resource, so that companies that require high-performance computing can easily buy and use resources as needed.
Yunqi Community learned that Alibaba Cloud's elastic GPU product is primarily designed for scenarios such as artificial intelligence, data analytics, scientific computing, film rendering, video and image processing, and video transcoding. It has already been applied successfully for behavior data analysis, targeted advertising, video recognition, image recognition, and object classification. Alibaba Cloud's elastic FPGA product is primarily oriented to artificial intelligence, semiconductor design, genetic computing, video image processing, data analytics decision making, and other such scenarios. It has already been applied successfully for deep learning reasoning, deep learning model cutting, non-regular data computing, video and image processing, and hardware semiconductor design.
Alibaba Cloud's Journey in Heterogeneous Computing
It is widely recognized that, compared with CPUs, GPUs and FPGAs have many advantages. GPUs offer better parallel performance, higher maximum computing power for each machine, and more efficient computing. The main advantages of FPGAs are their higher per-watt performance, better performance for non-regular data computing, improved hardware acceleration performance, and lower inter-device latency.
However, when used in cloud solutions, the advantages of these technologies are even more apparent. Dr. Zhang explained that Alibaba Cloud's GPU and FPGA heterogeneous computing solutions possess the following features:
1.GPU/FPGA resource are ready for use upon purchase and can be elastically scaled.
2.The ultra-large scale resource pool can meet the need for greater numbers of GPU/FPGA resources during business peaks.
3.Users enjoy heterogeneous computing hardware dividends, with performance growth in excess of Moore's Law. They can use GPU/FPGA instances with improved performance at the same price.
4.Alibaba Cloud offers the most comprehensive line of heterogeneous products, meeting the needs of scenarios such as artificial intelligence training and reasoning, image processing, and video processing.
5.Product integration: These products are deeply integrated into the overall Alibaba Cloud product system, allowing data to flow between products.
These features provide a perfect solution for the difficulties users encounter when using heterogeneous computing solutions. Dr. Zhang also revealed that, for model training on a single machine, most customers require several weeks to a month. To save the time spent on training, Alibaba Cloud plans to release an ultra-high performance heterogeneous cluster product.
"This product directly links GPU and FPGA using 25/100 GB ROCE to RDMA protocol. It can adopt a multi-host, multi-chip method to train a model using a cluster of a great number of GPU/FPGA devices. This approach significantly reduces the required training time from several weeks or a month to a day or even a few hours."
It is worth noting that Alibaba Cloud's heterogeneous computing solutions also provide a more friendly user experience for developers.
For GPU programming, Alibaba Cloud will release a distributed multi-host, multi-chip training framework and other GPU performance optimization services. These will significantly lower the barrier for clients to adopt a multi-host, multi-chip approach, and thereby reduce their cloud-based deep learning model training time.
As regards to FPGA, Alibaba Cloud will establish an IP development marketplace and encourage a series of FPGA IP partners to release their proprietary IP product series. With this prosperous IP marketplace, Alibaba hopes to give more end users the opportunity to enjoy FPGA performance acceleration.
In addition, Alibaba Cloud has already released IaaS+ services, including an E-HPC product for heterogeneous cluster resource scheduling, account management, and elastic scaling. Using Alibaba Cloud's Container Service, users can enjoy one-click deployment, distributed training, and elastic scaling, and use XDL to analyze behavior data. Finally, users can take advantage of Alibaba Cloud's self-developed GPU assembler to optimize and enhance application performance, increase the utilization of heterogeneous computing equiptment, and reduce resource procurement costs.
A Future Dominated by a Trio of GPU, FPGA, and ASIC
Artificial intelligence and other emerging application fields demand computing capacity that already exceeds the growth of CPU power as described by Moore's Law. Moreover, the rate of performance growth for heterogeneous computing can satisfy the needs of these emerging trends. It is likely that, in the future, heterogeneous computing will occupy an increasing share of data center resources.
At the macro level, the development of heterogeneous computing benefits from the driving force of national strategies. For instance, China has recently issued an artificial intelligence development plan, elevating AI to the level of national strategy. This is bound to stimulate demand for heterogeneous computing. Naturally, Dr. Zhang also admitted that, although the demand for heterogeneous computing is growing, there will always been a need for general-purpose computing as well. These two computing structures will coexist over the long term.
Without doubt, GPU processors from the heterogeneous computing field have already attained mainstream status. However, as regards future trends, Dr. Zhang stated that, "as the FPGA ecosystem came into existence and continues to develop, and the ASIC chip technology gradually matures, the world of heterogeneous computing will present a tripartite division among the GPU, FPGA, and ASIC chip technologies. These technologies will build on their respective unique advantages and applications, as well as their very own customer base."
This trend is also of great interest to Dr. Zhang and his team. In the future, the team will release 8-card/16-card GPU products, next-gen Volta architecture GPU products, and a new generation of FPGA products. In addition, R&D is underway for cloud-based ASIC chip products.
At present, Dr. Zhang's team has two main goals. First, they are committed to transforming heterogeneous computing into a computing resource users can easily purchase and use. To this end, they seek to provide the most comprehensive range of heterogeneous computing products and solutions. Second, they are committed to giving users the means to make good use of heterogeneous resources. In this way, users can take full advantage of the processing capabilities of these resources to make their products more competitive. The team is promoting the transformation of heterogeneous computing into a universal computing capability.
Computing Conference Highlights
During the Hangzhou Computing Conference, there will be special forums on heterogeneous computing/high-performance computing and virtualization technology. At both of these forums, Dr. Zhang will deliver keynote speeches. Prior to the formal start of the conference, he revealed important news to the Yunqi Community: Alibaba Cloud will launch several important heterogeneous computing products. These products will involve heterogeneous computing, general-purpose computing, high-performance computing, and other fields. He explained that these products were designed to solve the difficulties users face when using Alibaba Cloud, including cluster management and scheduling problems, license problems when elastically using paid software on the cloud, instances that need both the flexibility of virtual hosts and the performance of physical hosts, and the use of multi-host, multi-card distributed training for shorter training times.
The following is the transcript of our interview with Dr. Zhang:
Yunqi: Heterogeneous computing can provide more efficient and lower latency computing performance compared with parallel computing using traditional CPUs. Does this mean that it will replace CPU computing? What do you think about the future trends of these two technologies?
Dr. Zhang: The demand for general-purpose computing will continue to exist alongside the demand for heterogeneous computing. General-purpose computing will not be completely replaced. However, as heterogeneous computing can better meet the computing needs of emerging computing-intensive fields, such as artificial intelligence, high-performance data analytics, and financial analysis, heterogeneous computing will gradually replace general-purpose computing in areas where the traditional approach does not have an edge. Going with the tide, Alibaba Cloud released elastic GPU and FPGA heterogeneous computing solutions, so as to better meet the increasing heterogeneous computing demands of artificial intelligence, data analytics, and business intelligence. These products allow customers to easily purchase and use the resources they need. In this way, heterogeneous computing will no longer be an extremely expensive resource, but become a universal basic computing resource. This will promote the development of artificial intelligence and other such industries.
Yunqi: In January of this year, Alibaba Cloud released elastic GPU and FPGA heterogeneous computing solutions. What application scenarios are these solutions designed for? At present, which fields have seen successful applications of these solutions?
Dr. Zhang: First, compared with CPUs, GPUs offer better parallel performance, higher maximum computing power for each machine, and more efficient computing. Alibaba Cloud's elastic GPU product is primarily designed for scenarios such as artificial intelligence, data analytics, scientific computing, film rendering, video and image processing, and video transcoding. It has already been applied successfully in behavior data analysis, targeted advertising, video recognition, image recognition, and object classification.
Second, the main advantages of FPGAs are their higher per-watt performance, better performance for non-regular data computing, higher hardware acceleration performance, and lower inter-device latency. Alibaba Cloud's elastic FPGA product is primarily oriented to artificial intelligence, semiconductor design, genetic computing, video image processing, data analytics decision making, and other such scenarios. It has already been applied successfully in deep learning reasoning, deep learning model cutting, non-regular data computing, video and image processing, and hardware semiconductor design.
In addition, for model training on a single machine, most clients require several weeks to a month. To help these customers, we plan to release an ultra-high performance heterogeneous cluster product. This product directly links GPU and FPGA using 25/100 GB ROCE to RDMA protocol. It can adopt a multi-host, multi-chip method to train a model using a cluster of a great number of GPU/FPGA devices. This approach significantly reduces the required training time from several weeks or a month to a day or even a few hours.
Yunqi: Heterogeneous computing solutions have distinct advantages, but they are still in the development stage. What are the greatest challenges heterogeneous computing modes face at present?
Dr. Zhang: Currently, the greatest pain points suffered by users who have already purchased heterogeneous computing products include:
(1) High procurement costs: Users looking to make small purchases have almost no bargaining power. Particularly for FPGA boards, the unit costs for purchase in small quantities can be very high.
(2) Long delivery cycle: Average users will generally spend several months to move from requirement decision, model selection, hardware architecture design, supplier selection, data center selection, to financial approval.
(3) No elasticity: The quantity of GPUs/FPGAs purchased completely determine a system's capacity. Therefore, when there are few computing tasks, it will be a waste of GPU/FPGA resources, but the same quantity may be insufficient during peak hours.
(4) No hardware dividend: After purchasing a certain model, it cannot be changed. Therefore, if a new GPU/FPGA architecture is released, users will have to make another purchase because the performance of the old GPU/FPGA architecture will not been able to keep up with applications.
(5) Data islands: Offline GPUs/FPGAs cannot communicate with online services.
To resolve these problems, Alibaba Cloud released elastic heterogeneous computing solutions that offer the following features: (1) GPU/FPGA resources are ready for use upon purchase and can be elastically scaled. (2) The ultra-large scale resource pool can meet the need for greater numbers of GPU/FPGA resources during business peaks. (3) Users enjoy heterogeneous computing hardware dividends, with performance growth in excess of Moore's Law. They can use GPU/FPGA instances with improved performance at the same price. (4) Alibaba Cloud offers the most comprehensive line of heterogeneous products, meeting the needs of scenarios such as artificial intelligence training and reasoning, image processing, and video processing. (5) Product integration: These products are deeply integrated into the overall Alibaba Cloud product system, allowing data to flow between products.
In addition, the biggest challenge for FPGA products is the poor overall FPGA ecosystem and there are few customers capable of FPGA development, particularly computing acceleration. Therefore, we plan to build a cloud-based IP development marketplace, introduce a series of FPGA IP partners, and promote the creation of development standards for cloud-based FPGAs. With these efforts, we aim to enrich the overall FPGA development ecosystem and entice more IP developers and partners to release their IPs on the IP development marketplace in order to serve their end users. Ultimately, this will further enrich the overall FPGA ecosystem.
Yunqi: For developers, heterogeneous computing programming is difficult and the development costs are high. How does Alibaba Cloud address this?
Dr. Zhang: Concerning GPU programming, Alibaba Cloud will release a distributed multi-host, multi-chip training framework and other GPU performance optimization services. These will significantly reduce the difficulty of adopting a multi-host, multi-chip approach, and thereby reduce customers' cloud-based deep learning model training time. As regards to FPGA, Alibaba Cloud will establish an IP development marketplace and encourage a series of FPGA IP partners to release their proprietary IP product series. With this prosperous IP marketplace, Alibaba hopes to give more end users the opportunity to enjoy FPGA performance acceleration. In addition, Alibaba Cloud has already released IaaS+ services, including an E-HPC product for heterogeneous cluster resource scheduling, account management, and elastic scaling. Using Alibaba Cloud's Container Service, users can enjoy one-click deployment, distributed training, and elastic scaling, and use XDL to analyze behavior data. Finally, users can take advantage of Alibaba Cloud's self-developed GPU assembler to optimize and enhance application performance, increase the utilization of heterogeneous computing equipment, and reduce resource procurement costs.
Yunqi: Can you share some of your thoughts about heterogeneous computing? What valuable experience have you gained at work?
Dr. Zhang: Artificial intelligence and other emerging application fields demand computing capacity that already exceeds the growth of CPU power as described by Moore's Law. Moreover, the rate of performance growth for heterogeneous computing can satisfy the needs of these emerging trends. It is likely that, in the future, heterogeneous computing will occupy an increasing share of data center resources. China has recently issued an artificial intelligence development plan, elevating AI to the level of national strategy. In the future, this will promote comprehensive upgrades to national industries and social progress based on artificial intelligence. Naturally, such activities will involve heterogeneous computing, as this technology is essential to artificial intelligence. In our work, we have two main goals. First, we are committed to transforming heterogeneous computing into a computing resource users can easily purchase and use. To this end, we want to provide the most comprehensive range of heterogeneous computing products and solutions. Second, we are committed to giving users the means to make good use of heterogeneous resources. This way, users can take full advantage of the processing capabilities of these resources to make their products more competitive. We want to promote the transformation of heterogeneous computing into a universal computing capability. In this way, we will also spur the development of artificial intelligence, thereby driving industrial upgrades and social progress. In the end, this will change how we make things and live our lives.
Yunqi: In your opinion, what new changes will the heterogeneous computing field embrace in the future?
Dr. Zhang: GPU processors are the mainstream in the heterogeneous computing field. Going forward, as the FPGA ecosystem came into existence and continues to grow, and ASIC chip technology gradually matures, the world of heterogeneous computing will present a tripartite division among GPU, FPGA, and ASIC chip technologies. These technologies have their respective unique advantages and applications, as well as their very own customer base. In the future, Alibaba Cloud will release more products to expand the heterogeneous computing product family. These will include 8-card/16-card GPU products, next-gen Volta architecture GPU products, and a new generation of FPGA products. In addition, R&D is underway for cloud-based ASIC chip products.
Yunqi: What do you want to share with attendees during this Computing Conference? Can you give us a preview of the topics your will discuss and tell us why you chose them?
Dr. Zhang: During this Computing Conference, we will launch several important products, in the fields of heterogeneous computing, general-purpose computing, and high-performance computing, among others. These products are designed to give users a better experience by resolving some of their challenges, including cluster management and scheduling problems, license problems when elastically using paid software on the cloud, instances that need both the flexibility of virtual hosts and the performance of physical hosts, and the use of multi-host, multi-card distributed training for shorter training times. I hope those interested can pay close attention to the special forums on heterogeneous computing, virtualization technology, and elastic computing to be held during the Computing Conference.
2,599 posts | 762 followers
FollowAlibaba Clouder - November 1, 2019
Alibaba Clouder - June 4, 2018
Alibaba Cloud Community - July 27, 2022
Alibaba Clouder - March 25, 2019
ferdinjoe - June 5, 2024
PM - C2C_Yuan - May 31, 2024
2,599 posts | 762 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreDeploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.
Learn MoreMore Posts by Alibaba Clouder
Raja_KT March 13, 2019 at 5:02 am
Interesting one...GPU, FPGA, and ASIC Chips but beyond these example....->as we journey in -> https://www.alibabacloud.com/blog/alibaba-cloud-polardb-5-stages-of-cloud-database-adoption_594253?spm=a2c65.11461447.0.0.4e971503jqAlmY