By He Yiyu, nicknamed Busi at Alibaba. He Yiyu is a technical director for Yuque at Ant Financial. This article was compiled from He's presentation at the SEE Conference in 2019.
Originally developed as an internal collaboration tool, Yuque is a powerful professional-grade platform for the sharing, editing, and organization of files on the cloud. Roughly translated as "messenger sparrow," Yuque has already been used as the cloud knowledge base and virtual workspace of 100,000 employees at Alibaba Group.
As of this month, Yuque will go on the market as an enterprise-level work collaboration tool in China to be paired alongside Alibaba's mobile workspace application DingTalk. At the same time, this tool will also be opened up to charity organizations, startups, and public education institutions free of charge.
In this article, we're going to discuss how Yuque's technical architecture evolved over time, from an initial idea and prototype to an extensive internal collaboration service and, now, to being launched as an online service open to the public. We will also discuss the challenges faced by Yuque with the transition of its underlying systems to Serverless and full-stack JavaScript as well as the solutions that were proposed to tackle these challenges.
Yuque was first created in 2016, when Ant Financial needed a tool to host its documents. At that time, technical staff at Ant Financial used their spare time to build the documentation tool. In the early stage of the project, no personnel or resource support was available. So, to be able to quickly verify the prototype, the team chose the least costly technical solution. The underlying services were completely based on the BaaS service and container-hosting platform provided by the Technology Experience Department in Alibaba Group:
These services and platforms were built based on a Node.js implementation and were dedicated to internal applications. Using these internal services helped to reduce the overall costs of research and development at Alibaba, allowing engineers to be able to enjoy an environment that was more conducive to new research and innovation. The application-layer server uses the Node.js web framework Egg, which is Ant Financial's internal Chair encapsulation. It was later made open source by the Technology Experience Department. This framework is used for server implementation through a single web application. The application-layer client uses a React technology stack in combination with an internal antd. It also uses CodeMirror to implement an online markdown editor with superior functionality and an elegant experience.
This can be regarded as the "prototype stage" of Yuque. At that time, Yuque was merely a project that was created in the spare time of engineers at Alibaba and used the internal backend-as-a-service (BaaS) services as well as a series of open-source technical solutions dedicated to innovative applications. Later, the team verified the prototype of the online documentation tool.
As the team saw the potential of this online documentation tool continue to grow, the goal of Yuque evolved from simply providing a documentation tool for Ant Financial to having an internal solution that could replace competing products such as Confluence. And then this even further went on to become an important knowledge management platform in Alibaba. Yuque is oriented towards technical innovators, team leaders, and knowledge-base creators. However, there were still hiccups, with the major problem being that simply providing a markdown editor wasn't enough to allow non-technical personnel to be able to use Yuque efficiently.
Although many of us at Alibaba grew to love markdown, we could not overlook the lack of the need for a rich text editor. Unlike Word and other rich text editors, we chose a more "web"-base route and added special functions such as formulas, text graphs, and mind maps to enhance it.
As our team continued to explore the growing and exciting world of knowledge management, a three-layer knowledge management structure, which consists of the relevant team, their knowledge base, and their documentation, began to take shape with Yuque. On top of this, features such as collaboration, sharing, search, and message dynamics were growing in complexity as Yuque grew. With this new torrent of evolutionary change, BaaS services were no longer a feasible backbone for Yuque. So, to cope with these challenges, we need to make several different adjustments and changes.
Although BaaS services are easy to use and relatively cost-effective, their functions are insufficient to keep up the rapid development. Also, this underlying infrastructure provides less than satisfactory stability. Acknowledging all of these faults, we replaced the BaaS architecture with Alibaba Cloud's internal Infrastructure as a Service (IaaS) services, including our database, storage, caching, search, and other services.
Besides this, the web layer still used Node.js and the Egg framework. However, the business layer has become a large-scale standalone application based on the practices of the Rails community. The data model layer was built by introducing ORM to clarify the code hierarchy.
The frontend editor was migrated from codeMirror to Slate. To better implement the functions of the Yuque editor, we forked Slate internally for a more in-depth development. At the same time, we customized an independent content storage format to achieve more efficient data processing and better compatibility.
While providing service internally, Yuque evolved into a formal product, just like any other Ant Financial products. This product was polished and perfected during its use within Alibaba.
With the increasing internal influence of Yuque, some Alibaba alumni who had left the company began to ask Yubo: "Yuque is very useful. Have you ever considered releasing it as a product, so external companies can use it?" After less than 6 months of preparation and refactoring, Yuque became an official product of Alibaba in 2018.
When an application moves out of a single company to the commercial environment, the technical challenges are soon magnified. With this translation, some of the core knowledge and management functions became increasingly complex. And, with the addition of new formats such as tables and mind maps, the requirements of multi-person real-time collaboration posed an even greater challenge to our team. So, to better serve enterprise and individual users, at the Yuque team we had to work hard on to provide better enterprise and member services. As the business in China has continued to grow steadily, Yuque's commercial services have in turn come to require a higher level of quality, security, and stability.
To keep up with this rapid business development, the Yuque architecture had to evolve with it. To do this, we migrated all the underlying dependencies of Yuque to Alibaba Cloud from their original data center systems. As Alibaba's vendor of public cloud services, Alibaba Cloud not only provides several fundamental storage and computing capabilities, but also offers several richer and more advanced service offerings. At the same time, Alibaba Cloud's services come with a guarantee of an extremely high level of service availability and reliability.
Alibaba Cloud's wide range of basic cloud computing services has help us ensure that Yuque's servers can select the storage, queue, search engine, and other basic services that are most suitable for its day-to-day business operations. Moreover the artificial intelligence and machine learning applications and services of Alibaba Cloud has also created several more possibilities for Yuque products, including optical character reader (OCR) image recognition and real-time translation, among other things. Ultimately, all these functions transformed into some of the unique features and assets that make Yuque different from its competitors.
At the application layer, Yuque servers still use large-scale Node.js web applications based on the Egg framework. However, as their functions increased, relatively independent services began to be decoupled from the primary service. These services can be divided into the following items:
As Yuque's rich text editor has become increasingly complex, more and more problems occurred during the development process that was based on Slate. Finally, Yuque chose to develop different editor solution in-house designed specific for different purposes. Among them, we implemented a rich text editor that uses the browser-based contenteditable, a table editor based on canvas, and a mind map editor based on SVG.
In summary, the underlying services of Yuque are fully migrated to the cloud, and cloud services are leveraged to create unique Yuque features. Yuque also provides knowledge creation and management tools for enterprise users and individual knowledge workers.
On social media in China, a lot of people have a rather negative impression of full-stack JavaScript. They think of full-stack engineers as "jacks of all trades and masters of none". So, why did Apsara Stack choose full-stack JavaScript?
In Yuque, we don't define developers who develop in full stack JavaScript as "full-stack engineers," but rather as versatile product engineers. They are the "technical partners" of products. And many of them feel a sense of ownership for the products, participate in product discussion and design with product managers, and provide technical suggestions to product design solutions. They independently complete full-stack development of product functions and track product performance after release.
They also are experts in a certain technical field. Many of them are server, testing, frontend development, or CSS experts. They can improve product R&D efficiency by optimizing their R&D toolchains through their own specialized knowledge.
In Yuque, the product R&D process conducted by product engineers is as follows. In the product design phase, product engineers participate in the discussion and finally produce a final design draft. Since product engineers participate in all preliminary discussions, there are no technical problems caused by a disconnection between the product design draft and the subsequent R&D process.
Next, we perform a documented system analysis and design process in Yuque. Asynchronous review is also initiated on Yuque. Then, experts from other fields review the major technical solutions to ensure that all technical difficulties are clearly identified and organized appropriately.
After a clear system design is formulated, the R&D phase begins. During this phase, automated testing must cover all code. A full-coverage unit test is required for all newly added code and the modified business logic, and an end-to-end test is also required for key functions. After all the code is written, automated testing is mandatory before code review.
Asynchronous code review starts after phased function development and testing are completed. Relevant business leaders and experts in certain fields are invited to review the code. In this stage, the code is reviewed for business logic correctness, security, and maintainability.
When publishing a product, we must ensure that phased release, emergency response, and monitoring are possible. This prevents the risk that function changes will cause problems for a large number of users.
With full stack JavaScript, the team can complete product R&D in a more efficient and high-quality manner. In terms of code, a large amount of code can be reused. For example, the editor can be used on both web clients and the desktop. At the same time, many data processing capabilities can be used on the server.
Next, in terms of product R&D efficiency, full-stack R&D significantly reduces communication costs and is very efficient in Yuque's current stage. Full stack JavaScript means that developers do not have to switch between different languages nor consider how lodash, moment, and other tool classes used by the frontend should be used in other languages. This significantly improves the efficiency of full stack R&D.
Finally, full stack R&D gives engineers the opportunity to fully participate in the entire product R&D process, allowing them to spontaneously come up with new optimization ideas and use technical means to improve the product performance. For example, the OCR image search function recently launched by Yuque was spontaneously developed by full stack engineers who did everything from preliminary technical research to product implementation.
When we talk about full-stack JavaScript, Node.js is a topic that cannot be avoided. As a server runtime that is highly integrated with the frontend, Node.js has become an advocate for full stack development. Yet is Node.js really suitable for large-scale commercial projects? Many people have their doubts.
However, with the development of the JavaScript language, many problems have already been solved. For example, the emergence of the Async Function allows developers to write asynchronous code in synchronous mode, which provides better intuitiveness and easier troubleshooting. At the same time, as the community continues to improve, a large number of high-quality tool modules and frameworks have emerged. Yuque servers are based on the Egg framework and have integrated a large number of modules and services required for web development. The programming model based on Async Function is also simpler.
The emergence of TypeScript has also dispelled many people's doubts about large-scale JavaScript project development. In addition, Yuque uses other measures to ensure code quality and maintainability. Yuque is a pure JavaScript project, with zero TypeScript code. At the very beginning, Yuque determined the boundary between the core system and the external system. Through the hexagonal architecture, also known as the port-and-adapter pattern, the pattern of interactions between the Yuque core system and external systems or users is fixed.
To be specific, input and output are determined by "ports". On the other hand, external systems use "adapters" to connect the system with the ports exposed by Yuque. As long as the implementation follows the definitions of ports, external systems can be replaced easily.
In this model, the Controller is the HTTP adapter exposed to the user interface by Yuque. In the Controller, we verify and convert the format of user request parameters, check user permissions, and format the output.
We have defined a method, which happens to usually be a series of methods, to allow Yuque to interact with third-party platforms and services. Through adapters, different services in different environments are encapsulated into a unified method so that services can be easily called through one method. During calls, call logs are also generated.
The data model layer provides a model for the data layer. For example, the metadata of the Doc model is stored in MySQL, whereas document body data is encrypted and stored in OSS. The core business logic of Yuque has no idea where the underlying storage is located. Furthermore, as long as Yuque uses the SQL to interact with databases, the underlying data can be seamlessly migrated to databases that fully support SQL syntax, such as OceanBase. As such, even minor modifications can also be encapsulated at the model layer.
Lastly, let's look at a document release example. When you call the HTTP interface to interact with Yuque, data is written to the storage, including MySQL and OSS, through the model layer, and the document cache is updated. Sending asynchronous messages to other systems triggers the DingTalk WebHook and synchronizes the data to the search engine. These interactions with external systems can be performed after the adapters are encapsulated, allowing each system to perform its functions such as parameter conversion, permission verification, and logging. This not only ensures that the core logic is concise, but also makes it easier to trace system call routes.
When the system grows to a certain size, should we continue to add functions to large standalone applications or split them into microservices? The coexistence of these two architectures proves that they have their own pros and cons. Your architecture selection should be determined based on your current business scale and team distribution. Following this precise logic, the technical architecture of Yuque became a hybrid architecture along with our evolving business format.
Yuque's primary service is a large Node.js service that integrates all the application business logic. In addition to the primary service, there are some other services in different formats.
Let's look at rendering by mermaid as an example. When you enter mermaid code to call Yuque, Yuque calls a function deployed in Alibaba Cloud Function Compute and runs puppeteer in the function to render the code into svg and return it.
However, why are Serverless tasks separated out here? As mentioned earlier, Node.js is single-thread and unsuitable for CPU-intensive tasks. Based on a serverless architecture, we can migrate tasks with security risks or that consume a large amount of CPU resources to Function Compute.
In this way, such tasks run in a sandbox environment, so no security risks caused by malicious code would occur. This approach also removes these CPU-intensive tasks from the primary service so that they do not block up the primary service during concurrent operations.
The pay-as-you-go billing method can significantly reduce costs because you do not have to deploy a resident service for low-frequency function scenarios. Therefore, we try our best to migrate such services to Serverless services, such as Alibaba Cloud Function Compute.
In addition to the programming language, other aspects need to be considered in any commercial system. Among them, the two most important aspects are security and stability.
Various security risks exist due to a system's dependencies on the frontend, servers, and underlying infrastructure:
There is no easy way to solve all these security problems. They can only be handled individually, but there are some basic principles can be followed. Do not trust any user input:
Develop a standard coding paradigm to handle security risks, and pay special attention to the following during code review:
Yuque has been working with the security team since its commercialization, focusing on internal security awareness training, internal security team testing, internal reds-fight-blues defense drilling, and external white-hat penetration testing.
To ensure the stability of Yuque, we have done a lot of work on the frontend, servers, and cloud services. Just like security, stability is another long-term project that involves all aspects of the system. For Yuque, stability assurance is done in two main areas:
Then, how can we avoid unnecessary strong dependencies?
To give an example from Yuque, MySQL is a strong dependency that cannot be removed, which is not the case for the cache. However, at the beginning, Yuque sessions were stored in the cache. This meant that, if a Redis cluster failed, user data would not be obtained and users would not be able to log on. Therefore, the cache was a strong dependency.
To address this problem, we moved session storage to MySQL, so Redis became a weak dependency and the system could continue to function in the case of a Redis failure.
Another example is the multi-person real-time collaborative editing feature recently launched by Yuque. Before this feature was launched, a document locking method was used to prevent multiple people from editing the same document at the same time. However, with the introduction of the multi-person real-time collaboration service, once the service fails, users cannot edit documents. This meant that this service was a strong dependency of the Yuque system. To solve this problem, when users fail to connect to the collaboration service, the system automatically fails over to the old lock mode. As such, the collaborative service becomes a weak dependency of Yuque.
Over the past few years, the technology behind Yuque has been evolving, but it has always followed several principles. The technology stack must be selected to match the product development stage. Products have different technical requirements at different stages. Earlier stages have higher requirements for iteration efficiency. After reaching a commercial scale, products require better stability and performance. It is not necessary to use the most advanced technical solutions as soon as they are released, but we must consider them in combination with the product stage.
Next, the selection of the technology stack must also take the technical backgrounds of team members into account. The reason why Yuque chose full stack JavaScript was that most members of the Yuque team had JavaScript backgrounds. At the same time, Node.js is a preferred environment in Ant Financial, and comprehensive supporting facilities are available.
So for this, the most important thing is that you must consider the security, stability, and maintenance and scalability of the platform. The technology you end up choosing is secondary to this. The language and services of your choice may change, but basic security awareness, stability awareness, and code maintenance are the fundamental and key factors that determine whether a project can survive over the long term.
Getting Started with Kubernetes | Kubernetes Network and Policy Control
OAM and Crossplane: The Next Stage for Building Modern Application
506 posts | 48 followers
FollowAlibaba Clouder - July 27, 2020
Alibaba Cloud Serverless - February 24, 2021
Alibaba Clouder - November 23, 2020
Alibaba Clouder - April 16, 2020
Key - February 20, 2020
Alibaba Clouder - May 11, 2021
506 posts | 48 followers
FollowMSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn MoreA key value database service that offers in-memory caching and high-speed access to applications hosted on the cloud
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreAn encrypted and secure cloud storage service which stores, processes and accesses massive amounts of data from anywhere in the world
Learn MoreMore Posts by Alibaba Cloud Native Community