How does Magic Horse Search improve the timeliness of search?
I. Definition of the problem
How to understand timeliness? As the old saying goes, "The four sides are called Yu, and the past and present are called Zhou". Time runs through everything, and the only criterion for time perception is change. I understand that the more difficult the relative change is, the slower the time will be. When approaching the speed of light, the time will slow down because it is difficult to change it.
Understanding timeliness from the content side
The criteria for defining timeliness in information scenarios also change.
The value of information will change with time. Generally, the value of information will decline and become invalid. This is very similar to radioactivity in physics. From the point of view in physics, we define a concept of timeliness half-life for information: the time required for the relative value of information to decline by half. To define a time for an article, suppose that the amount of information is 100 when the article is first published, then the time required for the amount of information to decay to 50 is the half-life of the article.
Understanding timeliness from the demand side
The criteria for defining timeliness in search scenarios are also changed.
The best answer to the search will change over time. For different queries, the speed of change is different. We define a concept of time sensitivity for the frequency of changes in the best answer to requirements. The higher the user's sensitivity to time, the more they want to get new content, and the higher the correlation between the timeliness of content and overall satisfaction.
Universal timeliness
Timeliness can be roughly divided into three categories from the time distribution of demand emergence: sudden timeliness, periodic timeliness and universal timeliness.
The specific classification can refer to the following figure. The core of this article is to introduce and solve the universal timeliness.
Different from the burst and periodic timeliness, the time distribution of universal timeliness query and general search query is basically the same, and they are basically stationary from the time series analysis.
For example, the West Lake in Hangzhou is restricted, Alibaba's market value, how to get to Alibaba Xixi Park from Xiaoshan Airport is convenient, what style of women's clothing is the most popular this year, which shops in Hangzhou are better, and the recommendation of Lingyin Temple attached home stay.
Two quantitative indicators related to timeliness on the content side and demand side are introduced above: timeliness half-life and time sensitivity. For convenience, we classify the intensity of these two indicators into five grades:
II. Evaluation Criteria
The precondition for optimization of any problem is to know the measurement standard, and timeliness is no exception. Before optimization, a reasonable evaluation plan should be developed. First, it can be used to find out the status quo, and carry out case classification and proportion analysis for targeted optimization. Second, after optimization, it can be concluded that the optimization will improve the overall indicators through comparison.
Before the timeliness optimization evaluation, the search itself has a comprehensive satisfaction evaluation score, but the comprehensive satisfaction score does not reflect the timeliness very strongly. It only deducts the score when the timeliness obviously does not meet the needs, which is relatively weak. In order to better expose the problem of timeliness, we have separately built a scoring standard for timeliness satisfaction, and the evaluation is conducted according to the two standards at the same time.
Similar to Shenma Search Satisfaction, the full score of 3 is used to evaluate the results of Top3, and the timeliness satisfaction is scored according to the way that the results do not meet the needs.
Deduction criteria
Points will be deducted when the following rules are hit:
• The time has expired for more than 8 years, such as: "How do you know if someone else has operated my computer? The news in 2009.
• The information is invalid, and the webpage content has expired. If the page is news or recruitment, download and other pages with strong time sensitivity, the time should be more strict, and it should be guaranteed within 1-2 years.
• If the time is too old, make a comprehensive judgment according to the query time sensitivity and the update frequency of the results. Generally, whether the time is too old is divided based on the results of more than 5 years.
• Time content and presentation are inconsistent.
• Results are not up-to-date.
Scoring process
• Carry out query and result sensitivity measurement.
• First judge that the time is invalid, then judge that the information is invalid, the time is too old, the internal and external time is inconsistent, and the event is up to date; 1 point will be deducted for each item.
• The dead chain will be deducted separately.
• Simple novel query will not be evaluated temporarily, and download requirements will be evaluated normally.
be careful
• Time sensitive queries are judged according to their sensitivity, such as "the use of the coupon exceeds the use date", "the new question and answer computer troubleshooting - the answer is the method of the XP system" information is invalid.
• About video resource playing and downloading: there is time display and introduction, whether it can be played or not, score according to the time; If there is no time display, corresponding to the introduction, and the information cannot be played is invalid or of low quality, it will be calculated according to the latest timeliness, and no points will be deducted.
Three overall play
Before doing the universal timeliness project, we did the sudden timeliness, which went through four stages: rule optimization ->migration model ->abstract features ->model iteration. It is found that when we make a project, we determine the optimization scheme of the whole project, and then follow the underlying characteristics to the upper model. This scheme is steadily iterative. Although it is difficult to design and optimize the characteristics in the early stage, the effect in the later stage is improved significantly, and the speed is fast, the level is clear and organized, and the problem location and optimization are easy.
Because of the above experience summary, we think we should start from the optimization of basic features, and then carry out sorting and recall model optimization through tagging data.
IV. Optimization of basic features
Web page timeliness characteristics
Time extraction
Time extraction is the most basic feature of timeliness sorting. When extracting time, we should first define the page time. The page time is mainly divided into the following categories: page content time, page update time, page publishing time and page discovery time.
• Web content time: it is the time of the content described in the web page. For example, if a web page describes the end of World War II in 1945, the content time of this page is 1945. If the opening ceremony of the 2018 Beijing Olympic Games is introduced, it is August 8, 2018. The current web content time is the time that can best represent the content from all the time of the web content, It is implemented by a Ranking model based on annotated data.
• Web page publishing time: generally refers to the generation time of the web page, and generally refers to the creation time of the link of the web page. For some content pages, such as news, second-hand transactions, etc., the publishing time of the web page is generally specified on the web page.
• Web page update time: generally refers to the time when the main content of the Web page changed last. For general news and general article pages, they will not be updated after being generated, so the last update time is generally the time when the Web page was published.
• Web page discovery time: generally, it refers to the time when a web page link is found by a search engine crawler. Because the crawler Flowlink needs a certain time window and content islands, the web page discovery time may be seriously behind the web page publishing time.
• The time when the web page first enters the index: because the web page may not be captured and parsed in time after it is discovered by the crawler, the time when the web page first enters the index may seriously lag behind the time when the web page is discovered.
• Web time: It is the time that best represents the value of a web page. Generally, we choose from these five times. At present, we use a regular method to select time.
The main methods of rules are:
• For news and other content pages, if the publishing time of the page can be clearly extracted through the template and rules, the time will be selected as the page time.
• When there are serious inconsistencies in multiple times, the time with high confidence will be preferred, such as the time when the page was found or the time when the page first entered the index.
Time sensitivity (timeliness half-life)
The time sensitivity of a web page is a rate of web page information decay. We use an timeliness half-life to measure the rate of web page time decay, that is, the time required for web page information to decay to half of the current information from the current time.
As a clear time, it is difficult to label the half-life quantitatively. For simplicity, we discretize the half-life qualitatively and learn the half-life of web pages by labeling data.
GuideLine of dimension data:
1) Time sensitivity annotation
In the time sensitivity annotation, in order to enable the annotated students to know as much as possible about the timeliness of the content, we have not defined very clear detailed rules, but have some examples to supplement. The core is still to let the annotated students feel the time sensitivity of the content, to be able to think effectively, rather than to approach the learning of the annotation rules, so as to obtain more high-quality data.
Although the rough GuideLine annotation has improved the data quality, there are also some problems:
• The training cost of the marked students is very high, and it takes a lot of time to train the marked members. At the same time, the case explanation is carried out, which successively lasted about one month.
• The fitting rate of the marks is low. At the beginning of the project, the fitting rate of five people's cross gears (that is, more than three of the five people mark adjacent gears) is less than 50%. Even at the final stage of the project, the fitting rate of five people's single gear (that is, more than three of the five people mark the same gear) is also about 60%, and the cross gear fitting rate fluctuates between 70% and 80%.
2) Time sensitive model
At present, our model uses Pairwise and PointWise.
• Continuous values output by Pairwise model have higher resolution and are more suitable for the basic features of upper ranking.
• PointWise model mainly outputs 0, 1, 2 and above, which are mainly used for index selection and pseudo feedback mark features of upper level sorting. The time sensitivity of Query is inversely deduced by counting the distribution of time sensitive pages of recall results, which will be described in detail later.
Page information failure
Although the time sensitivity of a page is defined, the natural value of some time sensitive pages will become lower after a period of time. This kind of page definition information failure is difficult, but for some pages with clear time boundaries, information failure can be clearly defined.
For example, some information publishing pages, such as second-hand transactions, organization activities, and real estate information, have clear time boundaries. When transactions occur, goods are taken off the shelf, or the activity time expires, they can be clearly defined. We call this information information invalidation page. This page can be considered to have a timeliness value of 0, which requires some severe case suppression. This will also be introduced in the subsequent sorting module.
For the identification of this kind of page, it is currently through some rules to identify pages of different sites and types.
Demand timeliness characteristics
Time sensitivity of requirements
Time sensitivity of Query is the same concept as that of Web pages.
Time sensitivity of Query: the time sensitivity of the pages required by Query. You can reverse the time sensitivity of the pages in the recall results.
The time sensitivity of Query is related to the recall of timeliness results and rough sorting of timeliness results. Therefore, online retrieval cannot be performed through the analysis of recall results. You need to obtain the time sensitivity of Query directly through the analysis on the query side.
The query time sensitivity model has mainly experienced three versions of iteration. Here is a brief introduction:
1) First edition: ABCNN model based on the Attention mechanism of time sensitive words
Some time sensitive Patten do Attention to determine whether Query can be matched with some time sensitive words. If the matching between Query and these time sensitive words is reasonable and common in the search corpus, then the probability of this query being a time sensitive query will be higher. These commonly used matching words are mainly: latest, recent, this year, year, today, etc.
2) Second edition: Distillation technology based on pseudo feedback
As mentioned above, we already have a model of time sensitivity on the web page, because the web page itself has a large amount of information and many structural features, it is easier to do accurately. When we have a more accurate model of web page time sensitivity, we can generate a large number of pseudo labeled samples through the distribution analysis of recall results. These pseudo labeled samples can be used to train a CNN model with a large sample, and the effect is obviously improved compared with the first version.
3) Third edition: Iterative technology of sample annotation based on active learning
A TriggerModel is required when sorting the upper level of timeliness. TriggerModel is used to determine whether a query needs to be adjusted for timeliness and the strength of timeliness adjustment.
This TriggerModel is a model based on manually annotated data. This model uses more features, including the time sensitive words of related searches (because users sometimes modify Query and add time limit words to Query to filter results), the distribution of time sensitivity of web pages, the distribution of time sensitivity of Query, the user's click behavior, and other features. At the same time, it uses ActiveLearing to label critical samples, Increase the resolution of the model.
When we get a TriggerModel of Query with high resolution and accuracy, we can use this model to generate a large number of high-resolution samples. At the same time, we can combine the powerful Bert language model to train a better time sensitive model.
At the same time, because TriggerModel uses the time sensitivity characteristics of the second version of Query, TriggerModel can be retrained to improve the effect when the effect of query sensitivity is improved. At the same time, the new TriggerModel can guide the training of the time sensitivity model of Query, so that iterative training can be improved at the same time.
Timeliness demand intensity
Timeliness demand intensity is juxtaposed with time sensitivity, which is mainly used to judge whether users display the time dimension demand for the results (such as the latest, 2020).
This model is relatively simple. In the early days, it was a rule-based model to identify whether Query had a significant timeliness Pattern. Later, we also generate pseudo label samples for model training through recall results and user behaviors (such as the user's explicit query modified secondary query. When the user searches for "Hangzhou traffic restriction rules", if the results are bad, the user will modify the query to "2020 Hangzhou traffic restriction rules", "the latest Hangzhou traffic restriction rules", etc.). The method is similar to the second version of the time sensitivity model.
V. Data annotation
At present, the core of the upper ranking model of Shenma search is the LTR model based on labeled samples, so a more reasonable scheme for timeliness optimization is to retrain the LTR model by labeling samples.
To train the LTR model, it is necessary to mark the learning objectives of timeliness. There are two main stages in the iterative process. The first stage is to try to integrate the timeliness objectives into AC's five level marking (Perfect, Excellent, Good, Fair, Bad). Later, due to the difficulty of marking, a two-stage independent marking method is adopted.
Timeliness satisfaction is integrated into AC marking
At present, the AC tagging of Shenma Search is divided into five levels (Perfect, Good, Excellent, Fair, Bad=4,3,2,1,0). In order to add the timeliness target to the AC, we have added three levels, namely 2.5, 1.5 and 0.5.
The specific scoring principles are:
• For results with poor timeliness, if satisfaction is affected, the first or second gear will be directly reduced.
• If the timeliness is not particularly ideal, but does not affect the satisfaction, reduce the level by 0.5, and increase the level by 0.5 for the results with excellent timeliness.
Since the original level 5 has been upgraded to level 8, and the tagging of AC is a 7-person fitting, the tagging difficulty has greatly increased. At the same time, the tagging criteria and tagging personnel of Shenma Search AC have been stable for a long time, and the tagging personnel have formed a certain task perception. Let the annotation personnel re learn the new annotation, resulting in a serious decrease in the fitting rate of the annotation personnel, which is less than 60%. After many trainings, there is still no significant improvement. Therefore, we later abandoned the integration of timely annotation into the annotation system of Shenma Search AC, and started a new independent annotation principle.
Satisfaction with individual timeliness
The marking principle of independent timeliness is to mark the samples that have been marked with AC by Shenma Search for the second auxiliary Label. From the marked AC samples of Shenma Search, select a time sensitive query, and mark the timeliness satisfaction of the Q-U results of non-zero files of the query.
GuideLine marked with timeliness satisfaction: Each query will correspond to multiple urls. Our evaluators need to understand the meaning of query - judge whether the page meets the user's needs - and judge the satisfaction of page timeliness.
• Understand the meaning of query and infer the user's needs.
• From the user's needs, judge the extent to which the timeliness of the results meet the user's needs.
• Give a reasonable score according to the criteria mentioned later.
The grading standard 2/1/0 is judged when the results are relevant.
As long as there are time attribute pages, they are scored with 2, 1, and 0. The difference from sensitivity is that the changed pages are not discarded (judged according to the timeliness of the main content).
• 2 - The timeliness of page results is very good, which is the latest/high value results.
• 1 - The timeliness of page results is generally satisfied, not the latest/valuable results, but it has certain reference value.
• 0 -- The timeliness of page results is poor, which is very old, or has no reference value.
Unrelated - The page content is completely irrelevant to query.
• Dead link/spam -- page cheating/content invalidation/blank page. Low quality.
• Timeless demand - query is obviously timeless demand (for example, the full text of the Analects of Confucius).
• Impossible to judge - page content does not contain timeliness factors and cannot be scored according to timeliness (e.g. encyclopedia, long video page, website homepage).
The standard of timeliness satisfaction is the same as that of time sensitivity. We have not set a particularly detailed GuideLine. The core is to let the marked students think mainly and let the marked students perceive the impact of timeliness loss on user satisfaction. The fitting rate of the early annotation is also low, less than 60%. After a long time of training and case explanation, the final fitting accuracy is about 75%~85%.
Six order model
The model of timeliness sorting is mainly divided into four layers.
Timeliness rough sorting
For time sensitive queries, recall some results that are relatively new in time as much as possible at the index recall level. The timeliness rough sorting project was carried out earlier, and the data was not marked at that time. The main way was to use the feature enhancement method to improve the probability of new results ranking.
Shenma search sorting model adds timeliness characteristics
In some AC standards, timeliness is actually taken into account.
• The first category: for example, some news, because many news events, although the characters and locations have not changed, have changed the core things, which will affect the basic satisfaction, which is reflected in the AC standard.
• The second type: the other type is information failure. Information failure is clearly defined in the AC standard and belongs to worthless content, which can directly affect satisfaction. Generally speaking, the probability and time sensitivity of information failure are in direct proportion to the time from the web page to the present. A certain amount of information failure targets can learn some timeliness targets.
Other types: There are many other types, such as "the latest military parade", "51 holiday arrangement", etc.
Because the tagging results of time sensitive queries are related to the tagging time, we must record the tagging time of AC samples in order to accurately calculate the time characteristics. At the same time, we must restore the sample time to the tagging time when Dump features. For this reason, we have changed the process of the Shenma search feature Dump, adding the time restore function to ensure the accuracy of the timeliness feature.
Time Effective Independent Double Label Sorting Model
Because there are two labels for timely label data, we must develop an independent multi label sorting framework. For this reason, we have made some algorithm changes and upgraded the original LightGBM tool to support multi label training.
The main idea is that LambdaMart considers two Label when calculating PairwiseLoss of exchange Doc:
• Method of the first version: When the first label is the same, add the role of the auxiliary label and calculate the loss of the auxiliary label. Later, it was found that this method has some problems in the application. In this case, the auxiliary label can only work on the samples with the same label, and the samples with different labels cannot produce loss.
• Method of the second version: In order to make up for the shortcomings of the first method, we use the method of Label amplification to scale the original Label, change the AC standard to 8, 6, 4, 2, 0, and then change the timeliness target to 3, 2, 1, 0, (1, 0, - 1, - 2). In this way, the timeliness label is added to the AC label, forming a new label target. At the same time, the 2 ^ Label of LightGBM's exchange loss is changed into a 2 * Label (the practice of Shenma Search is referenced here). This is mainly because after the label is enlarged, the 2 ^ Label will make the head loss especially large, resulting in inconsistency with the real online exchange loss.
• Algorithm of the third version: Later, when we observed the samples, we found that the effect of timeliness is actually related to the label of the sample itself. When the label of the sample itself is 1, users don't really care about timeliness. When the label itself is 3, timeliness plays a small role, and users don't care about it either. Timeliness mainly plays a role in marking the second level of the sample. We reduce the timeliness of the third level and the first level and increase the timeliness of the second level to improve the differentiation of the timeliness feature target.
Timeliness independent model dynamically corrects the model of Shenma search
The sorting score calculated by the separate timeliness model cannot directly perform timeliness sorting on the results, because time sensitivity and timeliness demand intensity should be taken into account, such as:
• When the time sensitivity is low, the effect of timeliness is weak.
• When the overall relevance level is not high, the core of sorting is relevance.
• The sample size of timeliness label is much smaller than the AC sample of Shenma search, and the learning ability is weaker than the AC model of Shenma search.
By considering these three aspects, we designed a set of dynamic smoothing methods to smooth the scores of the timeliness model into the ranking scores of Shenma search:
RankScore = RankScoreAC*Lamda + RackScoreTimeliness*(1-Lambda)
The core is lambda calculation. We have explored and attempted three dimensions of lambda calculation:
Early first version: TriggerModel combines manual rules. TriggerModel will calculate the characteristics of time sensitivity (TriggerModel is briefly introduced in the Query signal above), and then feed back to the time sensitivity grading signal according to TriggerModel, and then manually specify the lambda value.
The second version in the middle term: based on the first method, the relationship between the TriggerModel threshold and the Lamba weight is smoothed, and a simple method of harmonic averaging is designed to correlate the Trigger prediction value with the Lamba value, making the adjusted dimension more smooth.
The version being tried: This is the multi-objective fusion algorithm being tried in the current sorting. Through pure Pair annotation samples, multiple multi-objective models (each multi-objective model has learned the target of AC and a separate sub dimension) are fused, and the weights of different multi-objective models are learned through some global statistical features.
Exploration of Universal Timeliness Sequencing Based on IRGAN
Due to the high cost of timeliness annotation samples, there were some companies in the industry that used IRGAN to iterate the model. At the same time, the same group of teams with sudden timeliness obtained benefits under the scenario of sudden timeliness through IRGAN. We also hope to obtain the benefits of timeliness through IRGAN, which has carried out some exploration and attempts.
Red is the marked relevant documents, dark blue is the relevant documents not marked in the recalled documents, and light blue is the irrelevant documents not marked in the recalled documents. The training of G is to grade the unlabeled documents first, send the documents with higher scores to D for judgment, and D judges a pair to determine whether it is selected by G (considered false) or actually marked (considered true), and then returns its score to G for correction, so that G can finally select the documents that are close to the real marked samples.
The current network structure and pre training process of G and D are the same, and they are almost identical models before confrontation (except for different random initialization values before pre training). However, during the training against G, D thinks that the marked doc is always better than the Doc selected by G, even though G has selected the actually better document (G's pre training ability is the same as D's ability, which may have been very good at the beginning), then G will learn worse and worse. Therefore, we can consider the different network structure of D and G sampling. All D needs to do is to judge the positive and reverse order of the pair, which can be simple. G needs to confuse D, and more complex networks can be used.
VII Recall
The recall system of timeliness sort is mainly processed from the query side of general recall, time limited query and independent timeliness index recall.
Query side processing
We found that under the timeliness query, users often search queries like this: this year, March, the latest, and the latest. The weight of such terms in the original Shenma search recall system is generally large. In many literal recalls, only term matching is considered, while the term and web page time are ignored. Because the time limit is not taken into account, some term matching results are often recalled, but the time is very old. In order to deal with this case, we have separately processed the query analysis module of Shenma Search, and added the functions of TimelinessTermWeightReadjust and TimelinessQueryRewrite, mainly to optimize the recall link from the perspective of TermWeight and Query rewriting.
TimelinessTermWeightReadjust
At present, the inverted index, the core engine of Shenma Search, will specify the AND word after query analysis, which is the hit word that must be included when merging the inverted index zippers. In the context of timeliness, we don't really hope that this year, recently, the latest and other words will hit the target. Because this time, the life of the word is relative time, which may be the latest result relative to the time when the web page is published. However, as time goes on, if the content of the web page remains unchanged, the value of this information will be greatly reduced.
The core feature of the AND logic of index query is TermWeight. As long as the weighting of Term decreases, the word will be most likely removed by Rank and will not participate in the merging of zippers. For this reason, we have excavated a number of time bound qualifiers, and lowered the weight of these words in the time bound scenario to improve the recall effect.
TimelinessQueryRewrite
For this year, March, the implied meaning of users is 2020, March 2020. Through query rewriting, an independent query logic specifying absolute time is added, and the time dimension of recall results is guaranteed to be satisfied by time constraint forced matching.
Time limit query
Time limited query is easy to understand. It is to limit the recall results by the half-life of the query's time sensitivity. It is also mentioned in the above description of the time sensitivity characteristics of Query. The time sensitivity characteristics are mainly used in this stage.
When we find that this query is time sensitive, we will launch a separate query. This query will limit the time of the recall results through the Filter syntax. This time is the web page time mentioned above.
If the time sensitivity is 3, that is, the half-life is 1 week, then we need to use the Filter syntax in the query engine to formulate that only the results of the last week will be recalled. Similarly, other sensitive recalls will be queried according to the corresponding half-life to ensure that the timeliness of the recall results is good enough.
Timeliness index recall
The timeliness index recall is mainly to solve the pain points of some business logic. At the same time, in order to balance the performance and effect, we put some new enough content in a single index. When querying, we query the timeliness index separately to increase the recall.
The above time limited query and TimelinessQueryRewrite will both launch a query separately. If the query uses a general library, according to the current triggering standard of ubiquitous query, the query quantity of the index will increase by 50%. This is a huge performance consumption for the index, but the improvement is not necessarily as big.
After independent indexing, the timely data selection and effective logic can be more flexible, and get rid of various restrictions of Shenma search index.
Under the news and strong timeliness, data collection at the level of day, hour, or even minute is required, which cannot be achieved in the Shenma search scenario. A separate timeliness business index is required to carry this.
VIII Collection
The timeliness collection system, in fact, is the most basic core part of sorting. If the link does not include the best sorting algorithm, it will not be useful. The current search collection system is mainly divided into the sudden and strong timeliness oriented collection system and the general hierarchical collection system of Shenma Search.
Directed inclusion system
News scene collection based on seed page
Strong timeliness, especially in the news scene, is often found through the connection of the news seed list page to Flowlink. Take a simple example, Sina Home, Sina NBA Home, Sina Finance Home, Zhihu Hot Topic Page, Weibo Hot Topic Page, etc.
Check whether there are new links on the seed page regularly to find new content. Generally, this seed page is only one layer of Flowlink. This actually involves a lot of content, including the discovery and annotation of seed pages, the scheduling algorithm of seed page fetching, and the regular elimination mechanism of seed pages.
Directed inclusion based on timeliness requirements
At present, the current situation and future trend of Internet discovery are still closed, and the data in various websites and APPs are difficult to access on the Internet. At the same time, due to the advent of the We Media era, everyone is a possible seed page. The original algorithm of seed scheduling cannot schedule such huge content, and even if it can schedule timeliness and revenue, it is difficult to ensure.
In this case, it is generally done through demand oriented collection. In short, each website and app has its own search interface. We capture the data of these websites and apps through the construction of query requests with timeliness requirements to do demand oriented collection.
How to understand timeliness? As the old saying goes, "The four sides are called Yu, and the past and present are called Zhou". Time runs through everything, and the only criterion for time perception is change. I understand that the more difficult the relative change is, the slower the time will be. When approaching the speed of light, the time will slow down because it is difficult to change it.
Understanding timeliness from the content side
The criteria for defining timeliness in information scenarios also change.
The value of information will change with time. Generally, the value of information will decline and become invalid. This is very similar to radioactivity in physics. From the point of view in physics, we define a concept of timeliness half-life for information: the time required for the relative value of information to decline by half. To define a time for an article, suppose that the amount of information is 100 when the article is first published, then the time required for the amount of information to decay to 50 is the half-life of the article.
Understanding timeliness from the demand side
The criteria for defining timeliness in search scenarios are also changed.
The best answer to the search will change over time. For different queries, the speed of change is different. We define a concept of time sensitivity for the frequency of changes in the best answer to requirements. The higher the user's sensitivity to time, the more they want to get new content, and the higher the correlation between the timeliness of content and overall satisfaction.
Universal timeliness
Timeliness can be roughly divided into three categories from the time distribution of demand emergence: sudden timeliness, periodic timeliness and universal timeliness.
The specific classification can refer to the following figure. The core of this article is to introduce and solve the universal timeliness.
Different from the burst and periodic timeliness, the time distribution of universal timeliness query and general search query is basically the same, and they are basically stationary from the time series analysis.
For example, the West Lake in Hangzhou is restricted, Alibaba's market value, how to get to Alibaba Xixi Park from Xiaoshan Airport is convenient, what style of women's clothing is the most popular this year, which shops in Hangzhou are better, and the recommendation of Lingyin Temple attached home stay.
Two quantitative indicators related to timeliness on the content side and demand side are introduced above: timeliness half-life and time sensitivity. For convenience, we classify the intensity of these two indicators into five grades:
II. Evaluation Criteria
The precondition for optimization of any problem is to know the measurement standard, and timeliness is no exception. Before optimization, a reasonable evaluation plan should be developed. First, it can be used to find out the status quo, and carry out case classification and proportion analysis for targeted optimization. Second, after optimization, it can be concluded that the optimization will improve the overall indicators through comparison.
Before the timeliness optimization evaluation, the search itself has a comprehensive satisfaction evaluation score, but the comprehensive satisfaction score does not reflect the timeliness very strongly. It only deducts the score when the timeliness obviously does not meet the needs, which is relatively weak. In order to better expose the problem of timeliness, we have separately built a scoring standard for timeliness satisfaction, and the evaluation is conducted according to the two standards at the same time.
Similar to Shenma Search Satisfaction, the full score of 3 is used to evaluate the results of Top3, and the timeliness satisfaction is scored according to the way that the results do not meet the needs.
Deduction criteria
Points will be deducted when the following rules are hit:
• The time has expired for more than 8 years, such as: "How do you know if someone else has operated my computer? The news in 2009.
• The information is invalid, and the webpage content has expired. If the page is news or recruitment, download and other pages with strong time sensitivity, the time should be more strict, and it should be guaranteed within 1-2 years.
• If the time is too old, make a comprehensive judgment according to the query time sensitivity and the update frequency of the results. Generally, whether the time is too old is divided based on the results of more than 5 years.
• Time content and presentation are inconsistent.
• Results are not up-to-date.
Scoring process
• Carry out query and result sensitivity measurement.
• First judge that the time is invalid, then judge that the information is invalid, the time is too old, the internal and external time is inconsistent, and the event is up to date; 1 point will be deducted for each item.
• The dead chain will be deducted separately.
• Simple novel query will not be evaluated temporarily, and download requirements will be evaluated normally.
be careful
• Time sensitive queries are judged according to their sensitivity, such as "the use of the coupon exceeds the use date", "the new question and answer computer troubleshooting - the answer is the method of the XP system" information is invalid.
• About video resource playing and downloading: there is time display and introduction, whether it can be played or not, score according to the time; If there is no time display, corresponding to the introduction, and the information cannot be played is invalid or of low quality, it will be calculated according to the latest timeliness, and no points will be deducted.
Three overall play
Before doing the universal timeliness project, we did the sudden timeliness, which went through four stages: rule optimization ->migration model ->abstract features ->model iteration. It is found that when we make a project, we determine the optimization scheme of the whole project, and then follow the underlying characteristics to the upper model. This scheme is steadily iterative. Although it is difficult to design and optimize the characteristics in the early stage, the effect in the later stage is improved significantly, and the speed is fast, the level is clear and organized, and the problem location and optimization are easy.
Because of the above experience summary, we think we should start from the optimization of basic features, and then carry out sorting and recall model optimization through tagging data.
IV. Optimization of basic features
Web page timeliness characteristics
Time extraction
Time extraction is the most basic feature of timeliness sorting. When extracting time, we should first define the page time. The page time is mainly divided into the following categories: page content time, page update time, page publishing time and page discovery time.
• Web content time: it is the time of the content described in the web page. For example, if a web page describes the end of World War II in 1945, the content time of this page is 1945. If the opening ceremony of the 2018 Beijing Olympic Games is introduced, it is August 8, 2018. The current web content time is the time that can best represent the content from all the time of the web content, It is implemented by a Ranking model based on annotated data.
• Web page publishing time: generally refers to the generation time of the web page, and generally refers to the creation time of the link of the web page. For some content pages, such as news, second-hand transactions, etc., the publishing time of the web page is generally specified on the web page.
• Web page update time: generally refers to the time when the main content of the Web page changed last. For general news and general article pages, they will not be updated after being generated, so the last update time is generally the time when the Web page was published.
• Web page discovery time: generally, it refers to the time when a web page link is found by a search engine crawler. Because the crawler Flowlink needs a certain time window and content islands, the web page discovery time may be seriously behind the web page publishing time.
• The time when the web page first enters the index: because the web page may not be captured and parsed in time after it is discovered by the crawler, the time when the web page first enters the index may seriously lag behind the time when the web page is discovered.
• Web time: It is the time that best represents the value of a web page. Generally, we choose from these five times. At present, we use a regular method to select time.
The main methods of rules are:
• For news and other content pages, if the publishing time of the page can be clearly extracted through the template and rules, the time will be selected as the page time.
• When there are serious inconsistencies in multiple times, the time with high confidence will be preferred, such as the time when the page was found or the time when the page first entered the index.
Time sensitivity (timeliness half-life)
The time sensitivity of a web page is a rate of web page information decay. We use an timeliness half-life to measure the rate of web page time decay, that is, the time required for web page information to decay to half of the current information from the current time.
As a clear time, it is difficult to label the half-life quantitatively. For simplicity, we discretize the half-life qualitatively and learn the half-life of web pages by labeling data.
GuideLine of dimension data:
1) Time sensitivity annotation
In the time sensitivity annotation, in order to enable the annotated students to know as much as possible about the timeliness of the content, we have not defined very clear detailed rules, but have some examples to supplement. The core is still to let the annotated students feel the time sensitivity of the content, to be able to think effectively, rather than to approach the learning of the annotation rules, so as to obtain more high-quality data.
Although the rough GuideLine annotation has improved the data quality, there are also some problems:
• The training cost of the marked students is very high, and it takes a lot of time to train the marked members. At the same time, the case explanation is carried out, which successively lasted about one month.
• The fitting rate of the marks is low. At the beginning of the project, the fitting rate of five people's cross gears (that is, more than three of the five people mark adjacent gears) is less than 50%. Even at the final stage of the project, the fitting rate of five people's single gear (that is, more than three of the five people mark the same gear) is also about 60%, and the cross gear fitting rate fluctuates between 70% and 80%.
2) Time sensitive model
At present, our model uses Pairwise and PointWise.
• Continuous values output by Pairwise model have higher resolution and are more suitable for the basic features of upper ranking.
• PointWise model mainly outputs 0, 1, 2 and above, which are mainly used for index selection and pseudo feedback mark features of upper level sorting. The time sensitivity of Query is inversely deduced by counting the distribution of time sensitive pages of recall results, which will be described in detail later.
Page information failure
Although the time sensitivity of a page is defined, the natural value of some time sensitive pages will become lower after a period of time. This kind of page definition information failure is difficult, but for some pages with clear time boundaries, information failure can be clearly defined.
For example, some information publishing pages, such as second-hand transactions, organization activities, and real estate information, have clear time boundaries. When transactions occur, goods are taken off the shelf, or the activity time expires, they can be clearly defined. We call this information information invalidation page. This page can be considered to have a timeliness value of 0, which requires some severe case suppression. This will also be introduced in the subsequent sorting module.
For the identification of this kind of page, it is currently through some rules to identify pages of different sites and types.
Demand timeliness characteristics
Time sensitivity of requirements
Time sensitivity of Query is the same concept as that of Web pages.
Time sensitivity of Query: the time sensitivity of the pages required by Query. You can reverse the time sensitivity of the pages in the recall results.
The time sensitivity of Query is related to the recall of timeliness results and rough sorting of timeliness results. Therefore, online retrieval cannot be performed through the analysis of recall results. You need to obtain the time sensitivity of Query directly through the analysis on the query side.
The query time sensitivity model has mainly experienced three versions of iteration. Here is a brief introduction:
1) First edition: ABCNN model based on the Attention mechanism of time sensitive words
Some time sensitive Patten do Attention to determine whether Query can be matched with some time sensitive words. If the matching between Query and these time sensitive words is reasonable and common in the search corpus, then the probability of this query being a time sensitive query will be higher. These commonly used matching words are mainly: latest, recent, this year, year, today, etc.
2) Second edition: Distillation technology based on pseudo feedback
As mentioned above, we already have a model of time sensitivity on the web page, because the web page itself has a large amount of information and many structural features, it is easier to do accurately. When we have a more accurate model of web page time sensitivity, we can generate a large number of pseudo labeled samples through the distribution analysis of recall results. These pseudo labeled samples can be used to train a CNN model with a large sample, and the effect is obviously improved compared with the first version.
3) Third edition: Iterative technology of sample annotation based on active learning
A TriggerModel is required when sorting the upper level of timeliness. TriggerModel is used to determine whether a query needs to be adjusted for timeliness and the strength of timeliness adjustment.
This TriggerModel is a model based on manually annotated data. This model uses more features, including the time sensitive words of related searches (because users sometimes modify Query and add time limit words to Query to filter results), the distribution of time sensitivity of web pages, the distribution of time sensitivity of Query, the user's click behavior, and other features. At the same time, it uses ActiveLearing to label critical samples, Increase the resolution of the model.
When we get a TriggerModel of Query with high resolution and accuracy, we can use this model to generate a large number of high-resolution samples. At the same time, we can combine the powerful Bert language model to train a better time sensitive model.
At the same time, because TriggerModel uses the time sensitivity characteristics of the second version of Query, TriggerModel can be retrained to improve the effect when the effect of query sensitivity is improved. At the same time, the new TriggerModel can guide the training of the time sensitivity model of Query, so that iterative training can be improved at the same time.
Timeliness demand intensity
Timeliness demand intensity is juxtaposed with time sensitivity, which is mainly used to judge whether users display the time dimension demand for the results (such as the latest, 2020).
This model is relatively simple. In the early days, it was a rule-based model to identify whether Query had a significant timeliness Pattern. Later, we also generate pseudo label samples for model training through recall results and user behaviors (such as the user's explicit query modified secondary query. When the user searches for "Hangzhou traffic restriction rules", if the results are bad, the user will modify the query to "2020 Hangzhou traffic restriction rules", "the latest Hangzhou traffic restriction rules", etc.). The method is similar to the second version of the time sensitivity model.
V. Data annotation
At present, the core of the upper ranking model of Shenma search is the LTR model based on labeled samples, so a more reasonable scheme for timeliness optimization is to retrain the LTR model by labeling samples.
To train the LTR model, it is necessary to mark the learning objectives of timeliness. There are two main stages in the iterative process. The first stage is to try to integrate the timeliness objectives into AC's five level marking (Perfect, Excellent, Good, Fair, Bad). Later, due to the difficulty of marking, a two-stage independent marking method is adopted.
Timeliness satisfaction is integrated into AC marking
At present, the AC tagging of Shenma Search is divided into five levels (Perfect, Good, Excellent, Fair, Bad=4,3,2,1,0). In order to add the timeliness target to the AC, we have added three levels, namely 2.5, 1.5 and 0.5.
The specific scoring principles are:
• For results with poor timeliness, if satisfaction is affected, the first or second gear will be directly reduced.
• If the timeliness is not particularly ideal, but does not affect the satisfaction, reduce the level by 0.5, and increase the level by 0.5 for the results with excellent timeliness.
Since the original level 5 has been upgraded to level 8, and the tagging of AC is a 7-person fitting, the tagging difficulty has greatly increased. At the same time, the tagging criteria and tagging personnel of Shenma Search AC have been stable for a long time, and the tagging personnel have formed a certain task perception. Let the annotation personnel re learn the new annotation, resulting in a serious decrease in the fitting rate of the annotation personnel, which is less than 60%. After many trainings, there is still no significant improvement. Therefore, we later abandoned the integration of timely annotation into the annotation system of Shenma Search AC, and started a new independent annotation principle.
Satisfaction with individual timeliness
The marking principle of independent timeliness is to mark the samples that have been marked with AC by Shenma Search for the second auxiliary Label. From the marked AC samples of Shenma Search, select a time sensitive query, and mark the timeliness satisfaction of the Q-U results of non-zero files of the query.
GuideLine marked with timeliness satisfaction: Each query will correspond to multiple urls. Our evaluators need to understand the meaning of query - judge whether the page meets the user's needs - and judge the satisfaction of page timeliness.
• Understand the meaning of query and infer the user's needs.
• From the user's needs, judge the extent to which the timeliness of the results meet the user's needs.
• Give a reasonable score according to the criteria mentioned later.
The grading standard 2/1/0 is judged when the results are relevant.
As long as there are time attribute pages, they are scored with 2, 1, and 0. The difference from sensitivity is that the changed pages are not discarded (judged according to the timeliness of the main content).
• 2 - The timeliness of page results is very good, which is the latest/high value results.
• 1 - The timeliness of page results is generally satisfied, not the latest/valuable results, but it has certain reference value.
• 0 -- The timeliness of page results is poor, which is very old, or has no reference value.
Unrelated - The page content is completely irrelevant to query.
• Dead link/spam -- page cheating/content invalidation/blank page. Low quality.
• Timeless demand - query is obviously timeless demand (for example, the full text of the Analects of Confucius).
• Impossible to judge - page content does not contain timeliness factors and cannot be scored according to timeliness (e.g. encyclopedia, long video page, website homepage).
The standard of timeliness satisfaction is the same as that of time sensitivity. We have not set a particularly detailed GuideLine. The core is to let the marked students think mainly and let the marked students perceive the impact of timeliness loss on user satisfaction. The fitting rate of the early annotation is also low, less than 60%. After a long time of training and case explanation, the final fitting accuracy is about 75%~85%.
Six order model
The model of timeliness sorting is mainly divided into four layers.
Timeliness rough sorting
For time sensitive queries, recall some results that are relatively new in time as much as possible at the index recall level. The timeliness rough sorting project was carried out earlier, and the data was not marked at that time. The main way was to use the feature enhancement method to improve the probability of new results ranking.
Shenma search sorting model adds timeliness characteristics
In some AC standards, timeliness is actually taken into account.
• The first category: for example, some news, because many news events, although the characters and locations have not changed, have changed the core things, which will affect the basic satisfaction, which is reflected in the AC standard.
• The second type: the other type is information failure. Information failure is clearly defined in the AC standard and belongs to worthless content, which can directly affect satisfaction. Generally speaking, the probability and time sensitivity of information failure are in direct proportion to the time from the web page to the present. A certain amount of information failure targets can learn some timeliness targets.
Other types: There are many other types, such as "the latest military parade", "51 holiday arrangement", etc.
Because the tagging results of time sensitive queries are related to the tagging time, we must record the tagging time of AC samples in order to accurately calculate the time characteristics. At the same time, we must restore the sample time to the tagging time when Dump features. For this reason, we have changed the process of the Shenma search feature Dump, adding the time restore function to ensure the accuracy of the timeliness feature.
Time Effective Independent Double Label Sorting Model
Because there are two labels for timely label data, we must develop an independent multi label sorting framework. For this reason, we have made some algorithm changes and upgraded the original LightGBM tool to support multi label training.
The main idea is that LambdaMart considers two Label when calculating PairwiseLoss of exchange Doc:
• Method of the first version: When the first label is the same, add the role of the auxiliary label and calculate the loss of the auxiliary label. Later, it was found that this method has some problems in the application. In this case, the auxiliary label can only work on the samples with the same label, and the samples with different labels cannot produce loss.
• Method of the second version: In order to make up for the shortcomings of the first method, we use the method of Label amplification to scale the original Label, change the AC standard to 8, 6, 4, 2, 0, and then change the timeliness target to 3, 2, 1, 0, (1, 0, - 1, - 2). In this way, the timeliness label is added to the AC label, forming a new label target. At the same time, the 2 ^ Label of LightGBM's exchange loss is changed into a 2 * Label (the practice of Shenma Search is referenced here). This is mainly because after the label is enlarged, the 2 ^ Label will make the head loss especially large, resulting in inconsistency with the real online exchange loss.
• Algorithm of the third version: Later, when we observed the samples, we found that the effect of timeliness is actually related to the label of the sample itself. When the label of the sample itself is 1, users don't really care about timeliness. When the label itself is 3, timeliness plays a small role, and users don't care about it either. Timeliness mainly plays a role in marking the second level of the sample. We reduce the timeliness of the third level and the first level and increase the timeliness of the second level to improve the differentiation of the timeliness feature target.
Timeliness independent model dynamically corrects the model of Shenma search
The sorting score calculated by the separate timeliness model cannot directly perform timeliness sorting on the results, because time sensitivity and timeliness demand intensity should be taken into account, such as:
• When the time sensitivity is low, the effect of timeliness is weak.
• When the overall relevance level is not high, the core of sorting is relevance.
• The sample size of timeliness label is much smaller than the AC sample of Shenma search, and the learning ability is weaker than the AC model of Shenma search.
By considering these three aspects, we designed a set of dynamic smoothing methods to smooth the scores of the timeliness model into the ranking scores of Shenma search:
RankScore = RankScoreAC*Lamda + RackScoreTimeliness*(1-Lambda)
The core is lambda calculation. We have explored and attempted three dimensions of lambda calculation:
Early first version: TriggerModel combines manual rules. TriggerModel will calculate the characteristics of time sensitivity (TriggerModel is briefly introduced in the Query signal above), and then feed back to the time sensitivity grading signal according to TriggerModel, and then manually specify the lambda value.
The second version in the middle term: based on the first method, the relationship between the TriggerModel threshold and the Lamba weight is smoothed, and a simple method of harmonic averaging is designed to correlate the Trigger prediction value with the Lamba value, making the adjusted dimension more smooth.
The version being tried: This is the multi-objective fusion algorithm being tried in the current sorting. Through pure Pair annotation samples, multiple multi-objective models (each multi-objective model has learned the target of AC and a separate sub dimension) are fused, and the weights of different multi-objective models are learned through some global statistical features.
Exploration of Universal Timeliness Sequencing Based on IRGAN
Due to the high cost of timeliness annotation samples, there were some companies in the industry that used IRGAN to iterate the model. At the same time, the same group of teams with sudden timeliness obtained benefits under the scenario of sudden timeliness through IRGAN. We also hope to obtain the benefits of timeliness through IRGAN, which has carried out some exploration and attempts.
Red is the marked relevant documents, dark blue is the relevant documents not marked in the recalled documents, and light blue is the irrelevant documents not marked in the recalled documents. The training of G is to grade the unlabeled documents first, send the documents with higher scores to D for judgment, and D judges a pair to determine whether it is selected by G (considered false) or actually marked (considered true), and then returns its score to G for correction, so that G can finally select the documents that are close to the real marked samples.
The current network structure and pre training process of G and D are the same, and they are almost identical models before confrontation (except for different random initialization values before pre training). However, during the training against G, D thinks that the marked doc is always better than the Doc selected by G, even though G has selected the actually better document (G's pre training ability is the same as D's ability, which may have been very good at the beginning), then G will learn worse and worse. Therefore, we can consider the different network structure of D and G sampling. All D needs to do is to judge the positive and reverse order of the pair, which can be simple. G needs to confuse D, and more complex networks can be used.
VII Recall
The recall system of timeliness sort is mainly processed from the query side of general recall, time limited query and independent timeliness index recall.
Query side processing
We found that under the timeliness query, users often search queries like this: this year, March, the latest, and the latest. The weight of such terms in the original Shenma search recall system is generally large. In many literal recalls, only term matching is considered, while the term and web page time are ignored. Because the time limit is not taken into account, some term matching results are often recalled, but the time is very old. In order to deal with this case, we have separately processed the query analysis module of Shenma Search, and added the functions of TimelinessTermWeightReadjust and TimelinessQueryRewrite, mainly to optimize the recall link from the perspective of TermWeight and Query rewriting.
TimelinessTermWeightReadjust
At present, the inverted index, the core engine of Shenma Search, will specify the AND word after query analysis, which is the hit word that must be included when merging the inverted index zippers. In the context of timeliness, we don't really hope that this year, recently, the latest and other words will hit the target. Because this time, the life of the word is relative time, which may be the latest result relative to the time when the web page is published. However, as time goes on, if the content of the web page remains unchanged, the value of this information will be greatly reduced.
The core feature of the AND logic of index query is TermWeight. As long as the weighting of Term decreases, the word will be most likely removed by Rank and will not participate in the merging of zippers. For this reason, we have excavated a number of time bound qualifiers, and lowered the weight of these words in the time bound scenario to improve the recall effect.
TimelinessQueryRewrite
For this year, March, the implied meaning of users is 2020, March 2020. Through query rewriting, an independent query logic specifying absolute time is added, and the time dimension of recall results is guaranteed to be satisfied by time constraint forced matching.
Time limit query
Time limited query is easy to understand. It is to limit the recall results by the half-life of the query's time sensitivity. It is also mentioned in the above description of the time sensitivity characteristics of Query. The time sensitivity characteristics are mainly used in this stage.
When we find that this query is time sensitive, we will launch a separate query. This query will limit the time of the recall results through the Filter syntax. This time is the web page time mentioned above.
If the time sensitivity is 3, that is, the half-life is 1 week, then we need to use the Filter syntax in the query engine to formulate that only the results of the last week will be recalled. Similarly, other sensitive recalls will be queried according to the corresponding half-life to ensure that the timeliness of the recall results is good enough.
Timeliness index recall
The timeliness index recall is mainly to solve the pain points of some business logic. At the same time, in order to balance the performance and effect, we put some new enough content in a single index. When querying, we query the timeliness index separately to increase the recall.
The above time limited query and TimelinessQueryRewrite will both launch a query separately. If the query uses a general library, according to the current triggering standard of ubiquitous query, the query quantity of the index will increase by 50%. This is a huge performance consumption for the index, but the improvement is not necessarily as big.
After independent indexing, the timely data selection and effective logic can be more flexible, and get rid of various restrictions of Shenma search index.
Under the news and strong timeliness, data collection at the level of day, hour, or even minute is required, which cannot be achieved in the Shenma search scenario. A separate timeliness business index is required to carry this.
VIII Collection
The timeliness collection system, in fact, is the most basic core part of sorting. If the link does not include the best sorting algorithm, it will not be useful. The current search collection system is mainly divided into the sudden and strong timeliness oriented collection system and the general hierarchical collection system of Shenma Search.
Directed inclusion system
News scene collection based on seed page
Strong timeliness, especially in the news scene, is often found through the connection of the news seed list page to Flowlink. Take a simple example, Sina Home, Sina NBA Home, Sina Finance Home, Zhihu Hot Topic Page, Weibo Hot Topic Page, etc.
Check whether there are new links on the seed page regularly to find new content. Generally, this seed page is only one layer of Flowlink. This actually involves a lot of content, including the discovery and annotation of seed pages, the scheduling algorithm of seed page fetching, and the regular elimination mechanism of seed pages.
Directed inclusion based on timeliness requirements
At present, the current situation and future trend of Internet discovery are still closed, and the data in various websites and APPs are difficult to access on the Internet. At the same time, due to the advent of the We Media era, everyone is a possible seed page. The original algorithm of seed scheduling cannot schedule such huge content, and even if it can schedule timeliness and revenue, it is difficult to ensure.
In this case, it is generally done through demand oriented collection. In short, each website and app has its own search interface. We capture the data of these websites and apps through the construction of query requests with timeliness requirements to do demand oriented collection.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00