Is Unlabeled Data Not So Useless After All? See What Alibaba's Doing with It

By Qi Xiang, nicknamed Yizhi at Alibaba.

Often unlabeled data cannot be effectively used. This is a major problem in the industry. Combatting this problem, at Alibaba we proposed a deep learning risk control algorithm named Auto Risk to address any of our business scenarios with insufficiently labeled data and a large amount of unlabeled data that cannot be effectively used. This algorithm was directed at behavior sequence data. As part of this initiative, we also proposed the use of agent tasks to learn general features from unlabeled data.

Our idea in many ways follows the same lines of pre-training models like the leading Bidirectional Encoder Representation from Transformers (BERT) model in the natural language processing field. However, behavior sequence data and business is quite distinct from the data typically seen with natural language processing. Therefore, the design and implementation of our model had to be different.

Our model has been implemented in real business scenarios, delivering some real world improvements. Experimental verification showed that the model's broad capabilities can be applicable to a variety of industry scenarios. And, compared with purely supervised learning, this model also delivers significant improvements in scenarios with a small number of samples.

For context, behavior sequence data like the browsing data collected from Taobao shoppers and risk control events in Alipay are common type of data at Alibaba Group. And, in fact, this kind of data serves as an important source-level input for intelligent services we offer like our product recommendation and risk control algorithms.

Consider the following examples. Say we are given the transaction sequence of a user and asked to predict what the user would buy next, or we are given the sequence of a risk control event and asked to predict whether a product is legal or illegal. The reality is that, for both these scenarios, we are require to characterize a list of behavior sequences into vectors and classify the sequences. This is where our algorithm comes into play.

A behavior sequence diagram

Traditionally, many features such as triggers and accumulations are designed based on experience and entered manually. Then, classifiers such as a Gradient Boosting Decision Tree (GBDT) are trained based on these features. In recent years, a relatively successful method has been to use neural networks like recurrent (RNN) and convolutional neural networks (CNN), as well as attention mechanisms. With these, behavior sequences are used directly as the input, and classification results or eigenvectors are the output. This method can be essentially summarized as an everything-to-vector idea. It has the advantage that it avoids the tedious work of manual characterization.

So, it was along these lines that our team proposed the Detail Risk framework for the algorithm. This framework converts the data for the behavior sequence of a user into classification vectors through multiple network layers, which involve discrete field embedding, text convolution, multi-field integration, event convolution, and attention. We have implemented this framework in multiple scenarios to much success. Overall, this framework has greatly reduced the manual work needed to process and use data and has also improved the performance of the model.

Detail Risk framework diagram

However, regardless of all of this success, most of these methods still use supervised learning and cannot completely avoid the problem of insufficient labeled samples. That is, a small number of samples cannot fully make use of the large capacity that is the advantage of a neural network model. But, if multi-task labels are introduced, the migration capabilities between tasks need to be carefully evaluated and these two factors balanced.

In addition, a massive amount of unlabeled data is constantly accumulated in our business. Therefore, if we can find a way to use this unlabeled data to train models and learn general upper-layer features, we will be able to leave limited labels to downstream scenarios to train a simple classifier. This will ultimately provide great improvements to our overall data utilization. Another point is that eigenvectors generated in unsupervised learning are different from manually designed features, but the integration of the two can also help to achieve better results.

Pre-training

Similar problems also exist in other parts of our business. Last year, a solution was developed based on natural language processing research, which specifically was solution using pre-training technology. Pre-training technology uses readily available agent tasks that embody knowledge, a large amount of unlabeled data, as well as deeper networks. All of this allows the model to learn effective upper-layer features without any manual assistance. Moreover, with such features, the entire system can achieve better results after being fine-tuned in downstream tasks.

Along these lines, in 2018, pre-training models such as Embeddings from Language Model (ELMo), Generative Pre-training (GPT), BERT, GPT2, and Enhanced Representation through Knowledge Integration (ERNIE) constantly redefined the state-of-the-art (SOTA) models for basic natural language processing problems and drove rapid development in the field. Among these new models, BERT set 11 new records at once, which caused algorithm engineers to sit up in their seats and take notice.

The "AND" training diagram

In the computer vision field, pre-training large networks with ImageNet can be traced back to as far as 2014 when deep learning was just starting to get adopted. In natural language processing, the word2vec or Global Vectors for Word Representation (Glove) algorithm are typically used to pre-train word vectors. However, it wasn't until recently that we were also able to pre-train a large model like BERT. This change meant that significant improvements in the performance of downstream tasks could be achieved even with unlabeled samples. We believe that this technology was made possible mainly because of the following conditions:

Accumulation of agent tasks: Agent tasks are not selected at random, but are selected by in ascending order of difficulty. For example, simple agent tasks such as Cbow and SkipGram are selected first, and then difficult tasks such as Masked-LM and next sentence prediction are left to the end. The more difficult the task, the more upper-layer abstract knowledge they can obtain. This is the key to pre-training.
Deep network: The deeper a large-capacity network, the more upper-layer features the network can obtain, and the more knowledge the network can store. The emergence of ResNet, various norm technologies, and ShortCut, BottleNeck, Branches, and BatchNorm (SBBB) construction modes have simplified the construction and training of deep networks.
Attention: It is necessary to separately introduce attention. Attention provides dedicated memory for neural networks and various functions based on memory access, including alignment, combination, remote dependency, and global vision. It can enrich the representation capabilities of a model. The advantage of attention is especially significant for the sequence data that we are concerned about.
The rise of CNN streams: In contrast to RNN, CNN was not originally used for sequence data. However, CNN provides intrinsic support for parallel structures, is well suited to industrial applications, and is easy to stack and suitable for building deep models. And, the global vision it lacks can be provided by attention. Essentially, a transformer is also a CNN stream, which is a convolution core with a width of 1 plus self-attention. In the past two years, various industrial frameworks using CNN streams, such as ConvS2S, ByteNet, WaveNet, SliceNet, and Transformer, have gradually taken the place of RNN.

The technology involved in BERT pre-training

Problem Analysis

So far, BERT has been used in some of Alibaba's internal NLP products, but not in other products. But, there are several problems hindering its wider adoption:

The data formats are different. Data in risk control and recommendation scenarios is not text, but behavior sequences. This means that the pre-training parameters provided originally by Google cannot be used, so our model must adapt to the following data features: multiple fields input at each time T, fields of different modes, and huge sequence lengths without natural separation by sentences.
The training overhead is huge. Models designed for NLP are heavy-weight: A transformer can have up to 16 million parameters at each layer and over a dozen layers and has at least 200 million parameters in total. Self-attention will cause out-of-memory (OOM) when the input sequence length at a single layer exceeds 1,000. Even if the memory is sufficient, the convergence speed is very slow. Even when eight of the latest graphics processing units (GPUs) are used, the training time will be several months. These models are worthwhile when it comes natural language parsing and composition, but are extravagant for behavior sequence data.

Therefore, in order to benefit from pre-training, we must design and implement our own pre-training model based on the characteristics of our data and business. This article presents the pre-training framework we designed and implemented for unsupervised behavior sequences and verifies its effectiveness in actual business scenarios. Our business scenario is risk control and therefore we called the framework Auto Risk model.

Model Design

Agent Tasks

The pre-training model does not need the labels of any actual tasks, but only requires readily available agent tasks to drive the training. The design of agent tasks determines the knowledge that the model can explore. In our study, we compared behavior sequence data with text, considering each point of time T as a word and each continuous sequence 1:T as a document, which is similar to BERT. We also designed the following two types of agent tasks:

Two types of agent tasks

Masked language model for word- and event-level agent tasks: We masked the value of an input sequence at time T and required that the model output word-level vectors that cover the masked value at T. This task drives the model to explore associations before and after a sequence and put behaviors in context for investigation.
Quick thought for sentence- and sequence-level agent tasks: We divided each sequence into two sub-sequences by sampling, used twin networks to encode each of the two sub-sequences into a vector, and then randomly combined the sub-sequences in a batch. Then, we used the model to predict whether the combined pairs of sub-sequences have the same source. This method is derived from Skip Gram's sentence-level generalization: Skip Thought. Then, the slow encoder and decoder in Skip Thought are replaced with fast twin networks, which result in Quick Thought. Quick Thought tasks drive a network to explore the symbolic characteristics of sequences.

Network Structure

Agent tasks provide readily available labels, and the core of the specific model structure is the encoder network. In previous section we already showed that it is inefficient to directly use heavy weight transformers, such as those in the BERT and GPT models. Therefore, acknowledging this, we have proposed a more efficient encoder structure based on convolutions and attention.

At the embedding layer, input fields are converted into vectors. For example, fields such as event type, time, amount, payment channel, and item name are embedded. Then, the fields are integrated through the Add or Concat function. Text fields are displayed in lists. Therefore, they must be embedded and then summarized into single vectors through convolution or the Average operation.
At the convolution layer, local context is captured, which we consider to be the primary feature of behavior sequences in risk control scenarios. Therefore, the local context must be accurately and efficiently captured.
At the attention layer, global context is captured, which we consider to be the secondary features of behavior sequences in risk control scenarios. More importantly, global context provides additional vision and capabilities.
One convolution layer and one attention layer form a block. Multiple blocks stacked in the ResNet format will form an encoder network, which is similar to a BERT model formed by multiple stacked transformers. The blocks can help us obtain abstract and merged information.

Auto Risk model schematic

Improvement at the Convolution Layer

When convolution is used for sequences, multiple convolution layers must be stacked to increase the vision. This causes two side effects: First, the gradients are diffused and optimization becomes difficult. And, second, parameters and computations increase significantly. To overcome these side effects, we replaced general convolutions with two special convolutions: Gated Conv and Depthwise Separable Conv. At first, a gate mechanism similar to Long Short-Term Memory (LSTM) is used to suppress the diffusion of gradients so that more layers can be stacked. Then, a convolution is divided into the depthwise and pointwise steps to reduce the amount of parameters and computations. For example, if the feature dimension D is 256 and the convolution core width K is 5, the number of parameters and computations will decrease by 80% from 320,000 to 60,000. If the convolution core width K is 31, the number will fall to only 3.6% of the original number. The improved convolution layer significantly improves the convergence speed and ultimate performance of the model.

Convolution improvement

Improvement at the Attention Layer

Attention provides excellent vision and capabilities, but requires a great deal of video memory to perform comparisons between sequences. In practice, if the length of a sequence exceeds 1,000, which is not long for a behavior sequence, self-attention at one layer will cause OOM. For practical reasons, we looked into replacing self-attention with fixed size attention or block attention. This optimizes the memory usage to O(2NK) and at the price of slight performance degrade. Finally, three attention layers can be stacked, allowing us to process a sequence with a length of 4,000 on a single GPU. This allowed us to meet our business requirements.

Attention improvement

Many tricks are needed to train such a large network. We will discuss these tricks in later articles.

Training

After the optimizations described above, we can use only one graphics card to:

Train a network stacked with three layers of encoders on sequences with a length of 4,000. In a BERT model based on transformers, only a one-layer network can be trained on sequences with lengths less than 1,000.
Achieve a batch training speed that is two or three times faster than the Transformer (the advantage is great for longer sequences) and has fewer convergence steps so that training on tens of millions of data records can be completed within one day.

The following figure compares the training processes using different encoder structures and shows that:

Using both convolution and attention encoders is better than only using either convolution or attention encoders.
Convolution contributes more to Loss and plays a more important role than attention.
The performance of the framework is better when more block layers are stacked.

Application Results

Business Benefits

First, let's evaluate the business benefits of the Auto Risk model. Our risk control sequence includes key risk control behaviors involved in such actions as logon, password change, and transactions. We select certain users at random as the training set, train a three-layer network with a hidden value of 128, and then deduce the vectors of other users. Finally, we add these vectors to a feature pool to compare the improvement of the area under the curve (AUC). We will compare the following approaches:

SOTA for manual features: SOTA manual features including asset capabilities and creditworthiness are well tested in scenarios such as consumer finance, credit and loan, and post-payment.
AutoRisk pre-training: Original Auto Risk vectors are added based on the SOTA of manual features.
AutoRisk finetune: Auto Risk vectors fine-tuned for specific scenarios are added based on the SOTA of manual features.

As you can probably see, the AUC improves from 3% to 6% after Auto Risk vectors are added, which shows that the unsupervised Auto Risk model can extract useful features from behavior sequences. If network parameters are fine-tuned for specific scenarios, better results will be achieved. This is the same as in models such as BERT. For ease of comparison, this figure only shows the effect of risk control events that are used as sequence data.

Multi-scenario Performance

The pre-training model does not use labels for any specific scenarios during training. Therefore, the knowledge learned by the model is relatively general in nature. We tested different scenarios, including irrelevant gender and age prediction scenarios, by using the simplest logistic regression (LR) classifier without adding any manual features or performing fine-tuning. After training and testing the scenarios, we achieved surprising results. In certain scenarios, the AUC reaches 0.9. One potential business benefit is that we can obtain general supplemental features for various businesses at an extremely low cost.

But, how can we achieve such results by only using LR? One way to think about it is that the Auto Risk model fully retains the information of behavior sequences in a good embedding space. This allows us to find appropriate linear classification interfaces for different tasks. The following figure shows the test set samples and the classification plane for cash-out in a consumer finance scenario. As Umap is used to reduce the 128-dimensional vectors to three-dimensional vectors, the classification performance is degraded by about 6% of AUC. However, we can see that:

There is a clear classification plane between cash-out and non-cash-out merchants.
There is an obvious stream structure in the space. Although we have not analyzed the meaning of each cluster, we are certain that the points in the clusters have similar behavior modes.

Small Sample Learning

The pre-training model also delivers benefits in small sample scenarios. Due to the large number of parameters in deep learning models, they do not perform well when the number of labeled samples is small in number. However, the pre-training model learns most of its knowledge through unsupervised agent tasks. Therefore, we can achieve better results for a scenario with a small number of labeled samples. The pre-training model is more suitable for businesses that require cold start or for which labeling is expensive. We conducted experiments on two types of behavior sequences in post-payment scenario B. The results show that the Auto Risk model can achieve better results than training a supervised learning neural network from scratch. For behavior log data, simply using the Auto Risk model and LR classifier can achieve better results than supervised learning, even if we do not perform fine-tuning. When the number of labeled samples in the training set reaches 40,000¡ªwith 20,000 positive samples and 20,000 negative samples¡ªsupervised learning still cannot catch up with the Auto Risk model and fine-tuning.

Sequence Analogy

Analogy is an interesting characteristic of word embedding. In the embedding space, King - Man = Queen - Woman and China - Beijing = France - Paris. Similar equations prove that the embedding space can indeed capture upper-layer semantics. Do the sequences in our Auto Risk space have similar characteristics? Similarly, we conducted an A - B = C - D experiment. We selected A, B, and D from a set that comprises 1 million samples, and recall C by using cosine similarity with the A - B + D vector. For ease of description, we show different fields separately even though they are trained at the same time.

Event type: A (Payment through Ant Credit Pay on Taobao) - B (Payment through Alibaba account balance) = C (Payment through Ant Credit Pay from outside the Alibaba website) - D (Payment through non-Alibaba account balance). The difference vector for the Taobao payment method is obtained by A - B. When the difference vector is added to D, the payment method is changed from balance to Ant Credit Pay. This indicates that certain directions in the embedding space learned by the model are specifically used to record payment methods.

A=[Create transaction - Taobao physical guarantee, Ant Credit Pay payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee, Ant Credit Pay payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee, Ant Credit Pay payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee, Ant Credit Pay payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee]
B=[Create transaction - Taobao physical guarantee, Balance payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee Balance payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee, Balance payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee, Balance payment - Taobao physical guarantee, Create transaction - Taobao physical guarantee]
C=[Ant Credit Pay payment - Instant transfer from non-Alibaba account, Ant Credit Pay payment - Instant transfer from non-Alibaba account, Ant Credit Pay payment - Instant transfer from non-Alibaba account, PC side - Create transaction, Logon_app_other_, Ant Credit Pay payment - Instant transfer from non-Alibaba account, PC side - Create transaction]
D=[App side - Logon, Balance payment - Instant transfer from non-Alibaba account, Balance payment - Instant transfer from non-Alibaba account, Balance payment - Instant transfer from non-Alibaba account, Balance payment - Instant transfer from non-Alibaba account, Balance payment - Instant transfer from non-Alibaba account]

Amount: A (large-amount user) - B (small-amount user) = C (large-amount user) - D (small-amount user). The recalled values of C are 8000 and 10000, which are the same as those of A. This proves that the model has a good memory for numerical grading.

A=[\\N,\\N,10000.0,10000.0,10000.0,8000.0,8000.0,8000.0,\\N,8000.0,\\N,\\N,8000.0,\\N,8000.0]
B=[\\N,\\N,10.0,10.0,10.0,10.0,10.0,10.0,\\N,10.0,\\N,\\N,10.0]
C=[\\N,\\N,\\N,8000.0,\\N,\\N,8000.0,\\N,\\N,\\N,\\N,\\N,\\N,10000.0,\\N,10000.0]
D=[\\N,1.0,\\N,1.0,\\N,1.0,1.0,\\N,1.0,\\N,1.0,1.0,\\N,1.0,\\N,\\N]

Item name: A (user who frequently takes taxis) - B (user who frequently recharges with coins) = C (user who frequently takes taxis) - D (user who frequently recharges with coins). Note that the item name field is different from other fields and must be converted into vectors through a CNN or an average sub-network. In addition, features such as "bus ticket" and "travel time" are also recalled. These fields are similar but not identical to "DiDi Express". This proves that the network is able to generalize text descriptions.

A=["DiDi Express-Driver Zhou","DiDi Express-Driver Zhou",...,"DiDi Express-Driver Shao","DiDi Express-Driver Shao",...]
B=["Tencent QQ coins recharged by 100 RMB","Tencent QQ coins recharged by 100 RMB",...,"Tencent QQ coins recharged by 100 RMB","Tencent QQ coins recharged by 100 RMB",...]
C=["DiDi Express-Driver Feng","DiDi Express-Driver Feng","Bus ticket**Bus terminal (south district","","Bus ticket**Bus terminal (south district)","","Quick Unicom recharge of 10 RMB","","","DiDi Express-Driver Qi","DiDi Express-Driver Qi","","","0000****-Plate no. [00*****] Travel time: 2019-04-1914:53:02"]
D=["1000 Tencent QQ coins 1","Tencent QQ coins/QQ coin card/",...,"Tencent QQ coins/QQ coin card/","Tencent QQ coins/QQ coin card/",...]

Summary

In this article, we discussed our Auto Risk algorithm for deep learning of behavior sequences. This algorithm does not require specific labels for training, but is based on the concept of agent task pre-training similar to BERT. It explores context associations and symbolic characteristics in a large amount of unlabeled data to generate useful upper-layer features. This solves the problems of insufficient labeled samples and unlabeled samples that are hard to use.

We designed the model structure based on the data and business characteristics to facilitate fast training and deployment. We have implemented this algorithm in our actual business operations and achieved significant performance improvements. No labeling is required during training. Therefore, the model results can be applied in many other scenarios and can significantly improve performance in scenarios with a small number of samples. The sequence analogy experiment proves that upper-layer semantics can be captured in the Auto Risk vector space.

In the future, we will continue our current work, expand the model to suit more types of data sources and application scenarios, and verify other agent tasks, including agent tasks that use multi-scenario labels accumulated previously.

Community

Is Unlabeled Data Not So Useless After All? See What Alibaba's Doing with It

Pre-training

Problem Analysis

Model Design

Agent Tasks

Network Structure

Improvement at the Convolution Layer

Improvement at the Attention Layer

Training

Application Results

Business Benefits

Multi-scenario Performance

Small Sample Learning

Sequence Analogy

Summary

qixiang.qx

You may also like

Comments

qixiang.qx

Related Products

Intelligent Robot

Platform For AI

AI Acceleration Solution

Tongyi Qianwen (Qwen)