This post describes how imgcook recognizes design components and expresses them with front-end components for intelligent code generation.
By Suchuan from F(x) Team of Taobao Technology Department, Alibaba Group
The automatic code generation of imgcook includes two steps: recognizing information from visuals and generating code based on the information.
Essentially, imgcook uses design plug-ins to extract the information described by javascript object notation (JSON) from designs. Then, it processes and converts the JSON information by using intelligent restoration technologies, such as rule system, computer vision, and machine learning. After that, the JSON information that conforms to the code structure and semantics is converted into the front-end code by domain specific language (DSL) converter. The DSL converter is a JS function, which means after inputting JSON information, the code needed will be output.
For example, the React DSL outputs React code that complies with the React development specifications, and its core part is the JSON-to-JSON part. For the JSON formats, see imgcook schema.
The design only has meta information such as images and texts, and the location information is absolute coordinates. The directly generated code of the design is composed of element-level labels, such as div, img, span or View, Image, and Text. However, in actual development, elements in a UI are combined as components, such as basic components like search boxes and buttons, components with business attributes like timers, coupons, videos, and carousel, and UI blocks with greater granularity.
For generating codes with component granularity, components in the visuals should be recognized and be converted to the corresponding componentized code. For example, the position that displays the rice cooker in the following visual contains a video, but only image information can be extracted from the visual. Therefore, the following codes are generated.
The actually generated code needs to be expressed by using a Rax component called rax-video, as shown in the following figure.
import { createElement, useState, useEffect, memo } from 'rax';
import View from 'rax-view';
import Picture from 'rax-picture';
import Text from 'rax-text';
import Video from 'rax-video';
<View className="side">
...
<Video
className="group"
autoPlay={true}
src="//cloud.video.taobao.com/play/u/2979107860/p/1/e/6/t/1/272458092675.mp4"
/>
...
</View>
Then, two things need to be done.
According to the levels of intelligent capabilities, at the Level 1, component codes can be manually generated based on the design protocol. At the Level 2, they can be generated by using rule algorithm to analyze the element style and recognize the components. At the Level 3, the object detection model can be used to recognize components, but the object detection solution cannot avoid the low accuracy problem of models caused by the complex designs. After implementing image classification solutions, the model accuracy is relatively high even for complex designs in specific business domains. Currently, business access costs have been reduced by optimizing algorithm engineering procedures at the Level 4.
Component Recognition Capability Model
The component names are labeled directly in designs. Moreover, the manually set component information in the layer is obtained by parsing the labels when exporting JSON description data by the imgcook plug-in.
(Componentization Codes Generation Based on Manually Set Component Protocol)
This solution requires manual labeling of visuals to fill in the component names and properties. There may be many components on a page, so this solution requires developers to do a lot of additional works. So, it is expected to automatically and intelligently recognize UI in visuals that require componentization.
Some components with common style features can be detected automatically by the rule algorithm. For example, four nodes with the height of rounded corners greater than the width can be considered as buttons. However, the generalization capability of rule-based judgment is very poor and cannot cope with complex and diverse visual demonstrations.
Deep learning technology is well suited to solving serval problems. For example, find the element in the visual that needs componentization, and judge the component it belongs and its position in the DOM tree or in the design. This technology can accept a large amount of sample data, learn and summarize the experience, and predict the category of similar sample components. This technology has strong generalization capability.
The article How Do You Use Deep Learning to Identify UI Components? defines the problem mentioned above as an object detection problem. The object detection can be conducted on the UI interface by using deep learning to find the category and bounding box of the UI that can be componentized in the code. However, the article mainly introduces the ways deep learning solves problems by taking component recognition in the UI as an example. It does not take the actual application into consideration. The following part of this article focuses on solutions for design-to-code (D2C) componentized coding and shares the way to apply the component recognition capability in actual projects.
It is difficult to collect samples of all users and provide a universal component recognition model with higher accuracy. In addition, component categories and styles used by different teams vary greatly. There may be samples of the same category with very different UIs, or samples of different categories with very similar UIs, resulting in poor recognition results. Therefore, it is necessary for users to create a training set based on their components and train a specific component recognition model. This section describes some schemes of component recognition by taking components commonly used in Taobao marketing as examples.
The article How to Use Deep Learning to Recognize UI Components? provides a detailed description of object detection. A model of object detection is trained after inputting the image of the visual to recognize components in the image.
Training and Prediction Process of Object Detection Model
As shown in the preceding figure, training an object detection model requires a large number of input samples. A sample is an entire image in the visual, and components for model recognition in the image need to be labeled. When a new design needs to be recognized, input images into the model for recognition and then obtain the result.
However, object detection still has the following problems.
In the scenario of intelligent code generation of imgcook, the recognition results of components need to be accurate at DOM node level. With this object detection scheme, there is a need to recognize both the correct location and the correct category. The accuracy of the model for offline experiments is low, so the accuracy for online applications will also be relatively low. It cannot determine which DOM node should be recognized.
The JSON description information of the image can be obtained from the design, and every text node and image node in the image already has location information. A relatively reasonable layout tree can be generated after intelligently restored by imgcook. Therefore, based on this layout tree, possible component nodes at the container node level can be cropped.
Training and Prediction Process of Image Classification Model
All the div and view nodes here can be cropped to obtain a small collection of images, and then these images are sent to an image classification model for prediction. By doing so, the object detection issue becomes an image classification issue.
The model allocates a probability value in each category for each image. The higher the probability value of a category, the greater the probability that the image is predicted to be of this category. For example, set a confidence coefficient to 0.7. When the probability value is greater than 0.7, the result is considered to be the final classification result. In the preceding figure, the recognition results of only two images are trusted. If the classification accuracy requirement is high, the confidence coefficient can be a higher value accordingly.
Compared with object detection, image classification allows program-based automatic sample generation without manual labeling. The model only needs to recognize categories. If the category is correct, the location information is absolutely correct. Therefore, the recognition accuracy is greatly improved by using image classification scheme based on layout recognition results.
After the layout is generated by the layout algorithm, the JSON schema enters the layer of component classification and recognition. The component recognition result will be updated in the JSON schema and be introduced to the next layer then.
Position of Component Classification and Recognition in the Technology Layers
The recognition results of the component are intuitively shown in the following figure. The results are mounted to the "smart" property of this node.
Component Classification Results
The figure of Training and Prediction Process of Image Classification Model shows that, for images cropped out according to the layout structure and recognized by components classification, multiple nodes of the videobtn category may be recognized.
Now there are several problems as below in finding the node that needs to be replaced with the component video, based on the results of the component recognition.
Eventually, the results of component recognition will be applied to the engineering process and will support personalized component requirements of users. To solve these problems, an open intelligent material system needs to be provided to support the configuration, recognition, intervention, rendering, and code generation of components.
The entire application process of component recognition is as follows. After configuring the component library, users need to configure the model service for component recognition. In the component recognition phase of visual restoration, the model service is called to recognize the components. When entering the business logic generation phase, the configured logical base is called to express the component recognition result (smart field) as the component (componentName). Moreover, information about component properties available in the visuals is used to supplement the component properties. Finally, rendering is performed in the canvas, which requires pre-configured canvas resources that support component rendering.
Application Process of Component Recognition
The following section is a detailed introduction of how the business logical base takes over the application and demonstration of the component recognition results during the business logic generation phase. In addition, the following section also introduces how the canvas supports component rendering to visually represent the recognition results.
One of the core features of the logical base is that users can define recognition functions and expression functions. In the business logic generation phase, users can also call these functions for each node. Recognition functions are used to determine whether the current node is the desired node. If it is, the corresponding expression logic will be executed.
For example, the component recognition result is placed on the smart node of the D2C schema protocol. Users can define the recognition function to determine whether the current node is recognized as a component. The difficulty here is that there may be multiple nodes that are recognized as components. So, it is necessary to accurately determine the nodes that are ultimate to be expressed as components. The reason is that some nodes are recognized by mistake. Some nodes are recognized correctly, but the componentName of these nodes are not modified directly. Instead, uses need to find the right node.
Multiple recognition results are available for the component videobtn of video time display. Based on these results, the corresponding node that needs to be replaced with VideoBtn of the frontend video time should be found. Meanwhile, the componentName of this node should be replaced with the VideoBtn. The component name behind is associated with the component category videobtn and the label given to the component when inputting the component. In other words, the component category needs to be input at the same time for component recognition.
Therefore, some filtering rules need to be added when defining recognition functions. For example, if multiple nodes with nested dependencies are recognized as videobtn, only the node of the innermost layer is taken as the recognition result.
/*
* Raw data schema of allSchema
* ctx context
* Execution time: Execute once per node. If the return value is true, the recognition is succeeded and the expression logic can be executed.
*/
async function recognize(allSchema, ctx) {
// ctx.curSchema - Schema of currently selected node
// ctx.activeKey - Currently selected Key
// Determine whether the node is recognized as videobtn node
const isVideoBtnComp = (node) => {
return _.get(node, 'smart.layerProtocol.component.type', '') === 'videobtn';
}
// Determine whether a child node is recognized as videobtn node
const isChildVideoBtnComp = (node)=>{
if(node.children){
for(var i=0; i<node.children.length; i++){
const _isChildVideoBtn = isVideoBtnComp(node.children[i]);
if (_isChildVideoBtn) {
return true;
}
return isChildVideoBtnComp(node.children[i]);
}
}
return false;
}
// If the current node is the videobtn node that users need, which means the node is recognized as in videobtn category while its child node is not,
// returns true. The node enters the logic of the expression function.
const isMatchVideoBtn = isVideoBtnComp(ctx.curSchema) && !isChildVideoBtnComp(ctx.curSchema);
return isMatchVideoBtn;
}
Customize an expression function. If the output value of the recognition function is true for a certain node, the corresponding expression function will be executed. In the following codes, change the componentName to VideoBtn in the custom expression function and extract the time information as the property value of VideoBtn.
/*
* Raw data schema of JSON
* ctx context
*/
async function logic(json, ctx) {
getTime = (node) => {
for(var i=0; i<node.children.length; i++) {
if(_.get(node.children[i], 'componentName', '') === 'Text') {
return _.get(node.children[i], 'props.text', '');
}
}
return "00:00";
}
// Set the node name to the name VideoBtn of component @ali/pcom-imgcook-video-58096
_.set(ctx.curSchema, 'componentName', 'VideoBtn');
// Obtain the time as component property value
const time = getTime(ctx.curSchema);
// Set the obtained time as the value of the data property of the VideoBtn
_.set(ctx.curSchema, 'props.data', {time: time});
// Delete the child node of the VideoBtn
ctx.curSchema.children = [];
return json
}
The componentized schema can be obtained after the component recognition results are expressed at the business logic layer, and the final componentized codes can be generated.
The above part is an example of recognizing the position of videobtn in a visual based on the component classification model and generating codes by using front-end component @ali/pcom-imgcook-video-58096.
Users may want to replace the commodity images with videos after recognizing the videobtn category in the visuals, for example, generating codes by rax-video. They can add a defined expression function to find the image node at the same level as the videobtn, and replace it with rax-video.
/*
* Raw data schema of JSON
* ctx context
*/
async function logic(json, ctx) {
const getBrotherImageNode = (node) => {
const pKey = node.__ctx.parentKey;
const parentNode = ctx.schemaMap[pKey];
for(var i=0; i<parentNode.children.length; i++){
if (parentNode.children[i].componentName == 'Picture') {
return parentNode.children[i];
}
}
}
const videoNode = getBrotherImageNode(ctx.curSchema);
_.set(ctx.curSchema, 'componentName', 'Video');
_.set(ctx.curSchema, 'props.poster', _.get(ctx.curSchema, 'props.source.uri');
_.unset(ctx.curSchema, 'props.source');
return json
}
It is beneficial to apply the component recognition results by using the business logical base in some aspects. Component recognition can be decoupled from business logic. The user's components are uncertain, and the name and property of each component are different, as well as the logic of the application after recognition. The business logical base can support users to customize the component application. Otherwise, the component recognition results cannot be used.
If the canvas in the editor does not support rendering of components, the component nodes will be rendered as empty nodes and cannot be presented in the canvas. In other words, the desired effect cannot be seen after restoring the visual. So, though not essential, it is necessary for canvas to support component rendering.
Currently, components can be packaged into canvas resources as NPM packages. The open rendering engine SDK of iceluna provides users of imgcook to customize the editor canvas. Users can select the components for packaging, and obtain the canvas resources based on configurations.
Architecture of Editor Canvas Creation
Currently, a specific component recognition model has been trained for common carousel and video components of Taobao. This model supports the configuration, recognition, rendering, intervention, and code generation of components in the whole online procedure. The model has been applied in businesses such as the Double 11 and Juhuasuan. This model trained on domain-specific component samples has a high recognition accuracy of 82% and is more feasible for online applications.
(Demonstration of Component Recognition Application in the Whole Procedure)
The application of component recognition requires users to configure components, train models for the recognition, and create canvas resources for rendering. Component configuration and canvas creation are relatively simple, but the user-defined component library requires corresponding component sample images for model training. Currently, samples used for component recognition model training of Taobao require manual collection or automatic generation by programs. It would be more costly for users to collect samples or create programs for sample generation.
Some users hope to integrate the capability of component recognition. However, this capability depends on the model generalization, which depends on the samples used for model training. A universal model that can recognize all components cannot be provided for now. Therefore, customize models and capability of automatic sample generation need to be provided for users to minimize the cost.
All-in-one Management Prototype for Sample Management, Model Training, and Model Service Applications
At present, the sample generator is capable of automatically generating training samples for users by uploading designs, and the algorithm model service is also available for online training. However, the whole process is not completed online by now. In the future, the F(x) team plans to put the whole process online and support the automatic iteration of models based on data feedbacks of online users.
Imgcook 3.0 Series: How Does Design-based Code Generation Recognize Icons?
Imgcook 3.0 Series: Layout Algorithm – Design-based Code Generation
66 posts | 3 followers
FollowAlibaba F(x) Team - June 7, 2021
Alibaba F(x) Team - June 21, 2021
Alibaba F(x) Team - June 20, 2022
Alibaba F(x) Team - June 3, 2021
Alibaba F(x) Team - February 25, 2021
Alibaba Clouder - December 31, 2020
66 posts | 3 followers
FollowExplore Web Hosting solutions that can power your personal website or empower your online business.
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreExplore how our Web Hosting solutions help small and medium sized companies power their websites and online businesses.
Learn MoreSelf-service network O&M service that features network status visualization and intelligent diagnostics capabilities
Learn MoreMore Posts by Alibaba F(x) Team