Exploration of BERT distillation in garbage public opinion identification
Recently, large-scale pre training models such as BERT have achieved remarkable results in various subtasks in the NLP field, but the massive parameters of the model make it difficult to go online and cannot meet the production requirements. The public opinion review business contains a large number of garbage public opinions, which will consume a lot of manpower. This paper attempts BERT distillation technology in garbage public opinion recognition task to improve the performance of textCNN classifier, and uses its small and fast advantages to successfully land.
The risk samples are as follows:
I. Traditional distillation scheme
At present, there are four technologies for model compression and acceleration:
Parameter pruning and sharing
Low rank factorization
Transfer/compact convolution filter
Knowledge distillation
Knowledge distillation is to transfer the knowledge of the teacher network to the student network, so that the performance of the student network is similar to that of the teacher network. This article focuses on the application of knowledge distillation.
1 soft label
The method of knowledge distillation was first proposed by Caruana et al. in 2014. The purpose of knowledge transfer is achieved by introducing soft tags related to the teacher network (complex network, good effect, but long prediction time) as a part of the overall loss to guide the student network (simple network, poor effect, but low prediction time) to learn. This is a general, simple and different model compression technology.
The category prediction obtained by large-scale neural network includes the similarity between data structures.
A small-scale neural network with a priori can converge with very little new scene data.
The Softmax function is more evenly distributed with the increase of temperature.
Loss formula is as follows:
From this we can see that distillation has the following advantages:
Learn the feature representation ability of the large model, and also learn the information between categories that does not exist in the one pot label.
It has the ability to resist noise, as shown in the figure below. When there is noise, the gradient of the teacher's model has a certain correction to the gradient of the student's model.
To some extent, the generalization of the model is strengthened.
2 using hints
(ICLR 2015) The work of FitNets Romero et al. not only uses the final output logits of the teacher network, but also uses the parameter values of the middle hidden layer to train the student network. Get deep and detailed FitNets.
The middle layer learning loss is as follows:
By adding the middle layer loss and limiting the solution space of the student network by the parameters of the teacher network, the author makes the optimal solution of the parameters closer to the teacher network, so as to learn the higher-order representation of the teacher network and reduce the redundancy of the network parameters.
3 co-training
(arXiv 2019) The work of Route Constrained Optimization (RCO) Jin and Peng et al. was inspired by the curriculum learning, and they knew that the gap between students and teachers led to distillation failure and cognitive bias. They proposed Route Constrained Hint Learning, changed the learning path to the teacher network every time, and output the results to the student network for training. The student network can learn from these intermediate models step by step, from easy to hard.
II Bert2TextCNN distillation scheme
In order to improve the accuracy of the model, ensure the timeliness, and cope with the shortage of GPU resources, we began to build a plan to distill the bert model to the textcnn model.
Scheme 1: offline logit textcnn distillation
The traditional method of Caruana is used for distillation.
Scheme 2: joint training of bert textcnn distillation
Parameter isolation: train the teacher model once and pass the logit to the student. The parameters of the teacher are updated to be affected by the label. The parameters of the student are updated to be affected by the soft label loss of the teacher loigt and the hard label loss of the label.
Scheme 3: joint training of bert textcnn distillation
Parameters are not isolated: Similar to Scheme 2, the main difference is that the gradient of the student's soft label in the previous iteration will be used to update the teacher parameters.
Scheme 4: joint training bert textcnn loss addition
teacher and student train at the same time, using the multi task method.
Scheme 5: Multiple trainers
Most models need to cover the samples of online historical models when updating. The online historical model is used as the teacher to let the model learn the knowledge of the original historical model and ensure a high coverage of the original model.
The experimental results are as follows:
From the above experiments, we can find very interesting phenomena.
1) Scheme 2 and Scheme 3 both use the method of training the teacher first and then the student. However, due to the difference between whether the gradient return updates are isolated, Scheme 2 is lower than Scheme 3. This is because in Scheme 3, the teacher is trained once every time, and the student is trained once. The soft loss learned by the student will be fed back to the teacher, so that the teacher knows how to guide the student is appropriate, and also improves the performance of the teacher.
2) Scheme 4 adopts the method of joint updating and feedback gradient. On the contrary, the performance of textcnn declines rapidly. Although the performance of bert has not declined, it is difficult for bert to correctly guide the feedback of textcnn in every step.
3) The logit of historical textcnn is used in Scheme 5, mainly to replace the online model and maintain a high coverage of the original model. Although the recall is reduced, the overall coverage is 5% higher than that of a single textcnn.
Reference
1.Dean, J. (n.d.). Distilling the Knowledge in a Neural Network. 1–9.
2.Romero A , Ballas N , Kahou S E , et al. FitNets: Hints for Thin Deep Nets[J].
3.Jin X , Peng B , Wu Y , et al. Knowledge Distillation via Route Constrained Optimization[J].
Welcome to join the big security machine intelligence team of Ant Group. We focus on mining the existing financial risks and platform risks facing massive public opinion with big data technology and natural language understanding technology, so as to ensure the safety of user funds and improve the user body under the ant ecosystem.
The risk samples are as follows:
I. Traditional distillation scheme
At present, there are four technologies for model compression and acceleration:
Parameter pruning and sharing
Low rank factorization
Transfer/compact convolution filter
Knowledge distillation
Knowledge distillation is to transfer the knowledge of the teacher network to the student network, so that the performance of the student network is similar to that of the teacher network. This article focuses on the application of knowledge distillation.
1 soft label
The method of knowledge distillation was first proposed by Caruana et al. in 2014. The purpose of knowledge transfer is achieved by introducing soft tags related to the teacher network (complex network, good effect, but long prediction time) as a part of the overall loss to guide the student network (simple network, poor effect, but low prediction time) to learn. This is a general, simple and different model compression technology.
The category prediction obtained by large-scale neural network includes the similarity between data structures.
A small-scale neural network with a priori can converge with very little new scene data.
The Softmax function is more evenly distributed with the increase of temperature.
Loss formula is as follows:
From this we can see that distillation has the following advantages:
Learn the feature representation ability of the large model, and also learn the information between categories that does not exist in the one pot label.
It has the ability to resist noise, as shown in the figure below. When there is noise, the gradient of the teacher's model has a certain correction to the gradient of the student's model.
To some extent, the generalization of the model is strengthened.
2 using hints
(ICLR 2015) The work of FitNets Romero et al. not only uses the final output logits of the teacher network, but also uses the parameter values of the middle hidden layer to train the student network. Get deep and detailed FitNets.
The middle layer learning loss is as follows:
By adding the middle layer loss and limiting the solution space of the student network by the parameters of the teacher network, the author makes the optimal solution of the parameters closer to the teacher network, so as to learn the higher-order representation of the teacher network and reduce the redundancy of the network parameters.
3 co-training
(arXiv 2019) The work of Route Constrained Optimization (RCO) Jin and Peng et al. was inspired by the curriculum learning, and they knew that the gap between students and teachers led to distillation failure and cognitive bias. They proposed Route Constrained Hint Learning, changed the learning path to the teacher network every time, and output the results to the student network for training. The student network can learn from these intermediate models step by step, from easy to hard.
II Bert2TextCNN distillation scheme
In order to improve the accuracy of the model, ensure the timeliness, and cope with the shortage of GPU resources, we began to build a plan to distill the bert model to the textcnn model.
Scheme 1: offline logit textcnn distillation
The traditional method of Caruana is used for distillation.
Scheme 2: joint training of bert textcnn distillation
Parameter isolation: train the teacher model once and pass the logit to the student. The parameters of the teacher are updated to be affected by the label. The parameters of the student are updated to be affected by the soft label loss of the teacher loigt and the hard label loss of the label.
Scheme 3: joint training of bert textcnn distillation
Parameters are not isolated: Similar to Scheme 2, the main difference is that the gradient of the student's soft label in the previous iteration will be used to update the teacher parameters.
Scheme 4: joint training bert textcnn loss addition
teacher and student train at the same time, using the multi task method.
Scheme 5: Multiple trainers
Most models need to cover the samples of online historical models when updating. The online historical model is used as the teacher to let the model learn the knowledge of the original historical model and ensure a high coverage of the original model.
The experimental results are as follows:
From the above experiments, we can find very interesting phenomena.
1) Scheme 2 and Scheme 3 both use the method of training the teacher first and then the student. However, due to the difference between whether the gradient return updates are isolated, Scheme 2 is lower than Scheme 3. This is because in Scheme 3, the teacher is trained once every time, and the student is trained once. The soft loss learned by the student will be fed back to the teacher, so that the teacher knows how to guide the student is appropriate, and also improves the performance of the teacher.
2) Scheme 4 adopts the method of joint updating and feedback gradient. On the contrary, the performance of textcnn declines rapidly. Although the performance of bert has not declined, it is difficult for bert to correctly guide the feedback of textcnn in every step.
3) The logit of historical textcnn is used in Scheme 5, mainly to replace the online model and maintain a high coverage of the original model. Although the recall is reduced, the overall coverage is 5% higher than that of a single textcnn.
Reference
1.Dean, J. (n.d.). Distilling the Knowledge in a Neural Network. 1–9.
2.Romero A , Ballas N , Kahou S E , et al. FitNets: Hints for Thin Deep Nets[J].
3.Jin X , Peng B , Wu Y , et al. Knowledge Distillation via Route Constrained Optimization[J].
Welcome to join the big security machine intelligence team of Ant Group. We focus on mining the existing financial risks and platform risks facing massive public opinion with big data technology and natural language understanding technology, so as to ensure the safety of user funds and improve the user body under the ant ecosystem.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00