How does the knowledge graph born for e-commerce respond to user needs
1. Background
Since its launch in June 2017, the e-commerce cognitive map has gradually formed a relatively complete e-commerce data cognitive system through continuous exploration from practice to systematization.
In the context of the current group's continuous expansion of business boundaries, the demand for data interconnection is becoming stronger and stronger, because this is the basis for cross-domain search discovery, shopping guidance and interaction, and it is also the basic condition for users to truly "shopping". But before that, we need to do an analysis of the current problem.
1.1 Questions
More complex data application scenarios are not only traditional e-commerce, but now we are faced with new retail, multilingual, online and offline complex shopping scenarios, and the data used are often beyond the scope of previous texts. These data are often Has some characteristics:
A large amount of data on the unstructured Internet is scattered in various sources and is basically expressed in unstructured text. The current category system has done a lot of work for a long time from the perspective of commodity management, but it is still just an iceberg covering a large amount of data. Of course, this is far from enough to understand the real user needs.
Full of noise: Different from traditional text analysis, most of the current data in the group are queries, titles, comments, strategies, etc. Due to user habits and business demands, these data will have a grammatical structure that is very different from ordinary text, and also due to There is a lot of noise and dirty data for profit reasons, which also brings great difficulties to truly discovering user needs and structuring them.
Multi-modal, multi-source: With the business expansion of the group, the current search recommendation not only accommodates the text information in the product, but also a large number of videos and pictures are used as content, how to integrate data from various sources, how to associate multi-modal Data is also a difficult point in data construction.
Data is scattered and cannot be interconnected: From the perspective of the current commodity system construction, each department often needs to maintain its own set of cpv system due to the rapid development of business, which is also a very critical part of commodity management and search in the later stage, but due to the application The industry attributes of the scenarios are different. For example, Xianyu’s "bag accessories" is a category that needs to be subdivided due to the high frequency of business scenarios, but in the Taobao department due to the low frequency of transaction searches, "shoes bag accessories" are only second-hand and idle. A small category, which caused each department to laboriously maintain the query and search on their own cpv system, rebuild their own category system every time, re-support storage query, re-associate products, and re-do category prediction wait. How to build a relatively general application-oriented concept system to support the provision of query services based on business needs is imminent.
Lack of in-depth cognition of data: In-depth cognition of data is not to recognize products, but to recognize the relationship between user needs. When clicking a lot of barbecue seasonings and tools, he realizes that he needs to carry out outdoor barbecue, which is currently lacking in the whole group.
1.2 Demand Analysis
Through the following background introduction, we can understand that in order to build a globally unified knowledge representation and query framework, we need the following key tasks.
Data structuring in complex scenarios: In complex scenarios, the first thing we need to do is data cleaning, remove dirty data through frequency filtering, rules, and statistical analysis, and then capture highly available data through phrase mining, information extraction, etc. Take it out for data structuring and hierarchical division.
Unified representation framework for distributed data: For managing distributed data, we first need to define a global schema representation and storage method, and then integrate conceptual data based on the schema, attribute mining and discovery, and data association. A representation learning method to accomplish this.
Data in-depth cognition: in-depth cognition includes two aspects, one is the cognition of the data itself, and the other is the cognition of data association. Through the behavior and information of the product itself, we can recognize the user's intention to purchase the product. Through external data With the input and summary, we will get the association of common sense and user needs outside the commodity system.
1.3 E-commerce cognitive map
In order to solve the above problems, we proposed the E-commerce ConceptNet. The goal is to establish a knowledge system in the field of e-commerce, and realize the relationship between people-goods-market in the e-commerce scene through in-depth understanding of user needs. Linkage to empower business parties and industries.
1.3.1 Module division
In terms of overall division, the cognitive map is divided into four important tasks. By constructing different types of concepts (user, scene, virtual category and item) into a heterogeneous graph, the user-scene-commodity association is realized. :
User map construction In addition to the general user portrait information (age, gender, purchasing power), the user map will also have crowd data such as "elderly" and "children", as well as user category attribute preference data.
1.3.2 Scene graph construction
Scene can be seen as the conceptualization of user needs, identifying user needs from existing queries and titles, generalizing them into a general scene (scene concept), and establishing such as "outdoor barbecue" and "vacation wear". The concept of class is the main work of scene graph. Through the continuous refinement of scene requirements, we abstract the concept representing a class of user needs across categories and categories into shopping scenes (sc).
Mining the concept is equivalent to obtaining the nodes on the graph. On the basis of concept mining, we start to establish the relationship between concepts and categories and categories, concepts and concepts, which is equivalent to establishing directed edges on the graph, and Calculate the strength of the edge, the specific process is as follows:
So far, we have produced 10w+ concepts and 10 times the category association.
1.3.3 Category Refinement
The source of category refinement is that the current category system will be too thick or too thin, which includes two levels from the construction:
Category aggregation: For example, "dress" is a category from a cognitive level, but due to the management of different industries, it will exist in different categories such as "women's clothing", "men's clothing" and "children's clothing". At this time, it will exist Under the two first-level categories, it is necessary to have a common sense system to maintain the cognition of the real "dress".
Category splitting: Category refinement is due to the fact that we found that the existing category system is not enough to aggregate a category of user needs. For example, there is a scene of "Tibet tourism", and we need more details under the category of "Scarves". , At this time, a virtual category called "windproof scarf" is needed. This process also includes entity/concept extraction and relation classification. Currently, we mainly establish relationships between categories and categories.
Up to now, we have already had 68.9w+ pairs of pairs that integrate cpv category tree, category category association, and external network data.
1.3.4 Commodity map construction
Phrase mining: What we need on the product map side is to do more product attribute recognition. We know that the premise of a perfect cpv system is phrase recognition. For this, we have established a cpv mining closed-loop under the bootstrap framework. The goal is It can effectively accumulate cpv data for a long time, and expand the cognition of query and products (this is also one of the data sources for product marking).
for example:
Up to now, we have completed the category review of pv top70, added 12W+ cpv pairs, and the proportion of queries whose terms can be fully recognized has increased from 30% to 60% (due to the current use of medium-grained word segmentation for mining, early stage Analysis of 70% is already the limit, and the follow-up will continue to expand the mining coverage after adding the phrase mining process). At present, the data has been used as category prediction, and the basic data of intelligent interaction is produced daily.
Commodity marking: Commodity marking is the key technology for us to associate knowledge with products. The data generated by the above three points will eventually establish a connection with the item through marking. After the product marking is completed, we can realize the process from query to The entire semantic cognition closed loop of commodities.
It is expected that by the end of March, we will be able to realize the first version of product marking.
2. Knowledge system
In the process of knowledge construction, we gradually discovered the need for a globally unified schema representation system, so we investigated the system construction process of wordnet and conceptnet, and gradually formed our own set of concept representation systems, which is the existing cognition The core of the map (E-commerce ConceptNet), its goal is to understand the user needs in the e-commerce field from the semantic level and conceptualize them (conceptulization), map them to a semantic ontology (ontology), and gradually integrate the ontology through the relationship at the lexical level The relationship between is formalized (specific), the level between concepts is represented by the level between ontologies, and the entity category and relationship are abstracted through the relationship between concepts.
From the perspective of data, if we want to describe an entity, we first need to define it as an instance of a category (instance-of-class), which can usually be represented by a concept. The concept of a concept will have its own different attributes (proeprty). The attribute set of a class of concepts can be called the schema of the concept. Concepts with the same type of schema generally belong to different domains (domains), and domains have their own semantic ontology. (ontoloty), through the hierarchy of ontology (such as "Britain"-is-part-of-"UK"), we can formalize the hierarchy and representation of concepts. Then, from fine to coarse, we define a set of representation methods for the concept system of e-commerce, and through continuous refinement of ontology and concept, as well as the relationship between them, to associate users and products, and even external entities.
3. Technical framework
3.1 Platform Module
Generally speaking, we use a data service platform to support the above graph engine, and then realize the production and use of knowledge through the Qianmo data management platform and the Turing business docking platform.
3.2 Module Details
Qianmo: data labeling and display
As the basic platform of e-commerce knowledge graph, Qianmo currently integrates all knowledge labeling and review processes, and provides data query and visualization, and later algorithmic concept mining services and product marking services will also be provided through Qianmo.
In the process of continuous trial and error for data review, we have established a relatively complete process from initial review to final review, see Qianmo review tool for details.
Visualization: In addition to the audit platform, Qianmo also provides a more specific form of data visualization, through good interaction to facilitate query knowledge Qianmo visualization
3.3 Turing: All services are selected and launched
Since most of our knowledge is currently provided in the form of cards, Turing provides a complete set of business service tools exposed through the cloud theme:
Concept selection:
Users can select all their own themes for sub-channel delivery
3.4 Graph Engine: Data Storage and Query
In terms of storage media, we use mysql for flexible annotation, graph database for full query, and odps for persistent data version management.
Before the data is entered into igraph and biggraph, it will be split into a point table and an edge table for import, and the online query will be performed through gremlin.
On the upper layer of the graph database, we encapsulate a graph engine module to provide scenarios with different triggers and multi-channel multi-hop recall functions for products. At present, user, item_list and query recall are provided, which have been used in Miaoxiaomi, and in the joint debugging with search discovery, you can use the query interface to query and test.
3.5 Technology Landing
Cloud theme (cognitive map) At present, nearly 10,000 scenarios have been launched in the cloud theme in the form of knowledge cards. Compared with the first-guess product, clicks and divergence have been greatly improved compared with products, and data divergence is currently being explored.
Since its launch in June 2017, the e-commerce cognitive map has gradually formed a relatively complete e-commerce data cognitive system through continuous exploration from practice to systematization.
In the context of the current group's continuous expansion of business boundaries, the demand for data interconnection is becoming stronger and stronger, because this is the basis for cross-domain search discovery, shopping guidance and interaction, and it is also the basic condition for users to truly "shopping". But before that, we need to do an analysis of the current problem.
1.1 Questions
More complex data application scenarios are not only traditional e-commerce, but now we are faced with new retail, multilingual, online and offline complex shopping scenarios, and the data used are often beyond the scope of previous texts. These data are often Has some characteristics:
A large amount of data on the unstructured Internet is scattered in various sources and is basically expressed in unstructured text. The current category system has done a lot of work for a long time from the perspective of commodity management, but it is still just an iceberg covering a large amount of data. Of course, this is far from enough to understand the real user needs.
Full of noise: Different from traditional text analysis, most of the current data in the group are queries, titles, comments, strategies, etc. Due to user habits and business demands, these data will have a grammatical structure that is very different from ordinary text, and also due to There is a lot of noise and dirty data for profit reasons, which also brings great difficulties to truly discovering user needs and structuring them.
Multi-modal, multi-source: With the business expansion of the group, the current search recommendation not only accommodates the text information in the product, but also a large number of videos and pictures are used as content, how to integrate data from various sources, how to associate multi-modal Data is also a difficult point in data construction.
Data is scattered and cannot be interconnected: From the perspective of the current commodity system construction, each department often needs to maintain its own set of cpv system due to the rapid development of business, which is also a very critical part of commodity management and search in the later stage, but due to the application The industry attributes of the scenarios are different. For example, Xianyu’s "bag accessories" is a category that needs to be subdivided due to the high frequency of business scenarios, but in the Taobao department due to the low frequency of transaction searches, "shoes bag accessories" are only second-hand and idle. A small category, which caused each department to laboriously maintain the query and search on their own cpv system, rebuild their own category system every time, re-support storage query, re-associate products, and re-do category prediction wait. How to build a relatively general application-oriented concept system to support the provision of query services based on business needs is imminent.
Lack of in-depth cognition of data: In-depth cognition of data is not to recognize products, but to recognize the relationship between user needs. When clicking a lot of barbecue seasonings and tools, he realizes that he needs to carry out outdoor barbecue, which is currently lacking in the whole group.
1.2 Demand Analysis
Through the following background introduction, we can understand that in order to build a globally unified knowledge representation and query framework, we need the following key tasks.
Data structuring in complex scenarios: In complex scenarios, the first thing we need to do is data cleaning, remove dirty data through frequency filtering, rules, and statistical analysis, and then capture highly available data through phrase mining, information extraction, etc. Take it out for data structuring and hierarchical division.
Unified representation framework for distributed data: For managing distributed data, we first need to define a global schema representation and storage method, and then integrate conceptual data based on the schema, attribute mining and discovery, and data association. A representation learning method to accomplish this.
Data in-depth cognition: in-depth cognition includes two aspects, one is the cognition of the data itself, and the other is the cognition of data association. Through the behavior and information of the product itself, we can recognize the user's intention to purchase the product. Through external data With the input and summary, we will get the association of common sense and user needs outside the commodity system.
1.3 E-commerce cognitive map
In order to solve the above problems, we proposed the E-commerce ConceptNet. The goal is to establish a knowledge system in the field of e-commerce, and realize the relationship between people-goods-market in the e-commerce scene through in-depth understanding of user needs. Linkage to empower business parties and industries.
1.3.1 Module division
In terms of overall division, the cognitive map is divided into four important tasks. By constructing different types of concepts (user, scene, virtual category and item) into a heterogeneous graph, the user-scene-commodity association is realized. :
User map construction In addition to the general user portrait information (age, gender, purchasing power), the user map will also have crowd data such as "elderly" and "children", as well as user category attribute preference data.
1.3.2 Scene graph construction
Scene can be seen as the conceptualization of user needs, identifying user needs from existing queries and titles, generalizing them into a general scene (scene concept), and establishing such as "outdoor barbecue" and "vacation wear". The concept of class is the main work of scene graph. Through the continuous refinement of scene requirements, we abstract the concept representing a class of user needs across categories and categories into shopping scenes (sc).
Mining the concept is equivalent to obtaining the nodes on the graph. On the basis of concept mining, we start to establish the relationship between concepts and categories and categories, concepts and concepts, which is equivalent to establishing directed edges on the graph, and Calculate the strength of the edge, the specific process is as follows:
So far, we have produced 10w+ concepts and 10 times the category association.
1.3.3 Category Refinement
The source of category refinement is that the current category system will be too thick or too thin, which includes two levels from the construction:
Category aggregation: For example, "dress" is a category from a cognitive level, but due to the management of different industries, it will exist in different categories such as "women's clothing", "men's clothing" and "children's clothing". At this time, it will exist Under the two first-level categories, it is necessary to have a common sense system to maintain the cognition of the real "dress".
Category splitting: Category refinement is due to the fact that we found that the existing category system is not enough to aggregate a category of user needs. For example, there is a scene of "Tibet tourism", and we need more details under the category of "Scarves". , At this time, a virtual category called "windproof scarf" is needed. This process also includes entity/concept extraction and relation classification. Currently, we mainly establish relationships between categories and categories.
Up to now, we have already had 68.9w+ pairs of pairs that integrate cpv category tree, category category association, and external network data.
1.3.4 Commodity map construction
Phrase mining: What we need on the product map side is to do more product attribute recognition. We know that the premise of a perfect cpv system is phrase recognition. For this, we have established a cpv mining closed-loop under the bootstrap framework. The goal is It can effectively accumulate cpv data for a long time, and expand the cognition of query and products (this is also one of the data sources for product marking).
for example:
Up to now, we have completed the category review of pv top70, added 12W+ cpv pairs, and the proportion of queries whose terms can be fully recognized has increased from 30% to 60% (due to the current use of medium-grained word segmentation for mining, early stage Analysis of 70% is already the limit, and the follow-up will continue to expand the mining coverage after adding the phrase mining process). At present, the data has been used as category prediction, and the basic data of intelligent interaction is produced daily.
Commodity marking: Commodity marking is the key technology for us to associate knowledge with products. The data generated by the above three points will eventually establish a connection with the item through marking. After the product marking is completed, we can realize the process from query to The entire semantic cognition closed loop of commodities.
It is expected that by the end of March, we will be able to realize the first version of product marking.
2. Knowledge system
In the process of knowledge construction, we gradually discovered the need for a globally unified schema representation system, so we investigated the system construction process of wordnet and conceptnet, and gradually formed our own set of concept representation systems, which is the existing cognition The core of the map (E-commerce ConceptNet), its goal is to understand the user needs in the e-commerce field from the semantic level and conceptualize them (conceptulization), map them to a semantic ontology (ontology), and gradually integrate the ontology through the relationship at the lexical level The relationship between is formalized (specific), the level between concepts is represented by the level between ontologies, and the entity category and relationship are abstracted through the relationship between concepts.
From the perspective of data, if we want to describe an entity, we first need to define it as an instance of a category (instance-of-class), which can usually be represented by a concept. The concept of a concept will have its own different attributes (proeprty). The attribute set of a class of concepts can be called the schema of the concept. Concepts with the same type of schema generally belong to different domains (domains), and domains have their own semantic ontology. (ontoloty), through the hierarchy of ontology (such as "Britain"-is-part-of-"UK"), we can formalize the hierarchy and representation of concepts. Then, from fine to coarse, we define a set of representation methods for the concept system of e-commerce, and through continuous refinement of ontology and concept, as well as the relationship between them, to associate users and products, and even external entities.
3. Technical framework
3.1 Platform Module
Generally speaking, we use a data service platform to support the above graph engine, and then realize the production and use of knowledge through the Qianmo data management platform and the Turing business docking platform.
3.2 Module Details
Qianmo: data labeling and display
As the basic platform of e-commerce knowledge graph, Qianmo currently integrates all knowledge labeling and review processes, and provides data query and visualization, and later algorithmic concept mining services and product marking services will also be provided through Qianmo.
In the process of continuous trial and error for data review, we have established a relatively complete process from initial review to final review, see Qianmo review tool for details.
Visualization: In addition to the audit platform, Qianmo also provides a more specific form of data visualization, through good interaction to facilitate query knowledge Qianmo visualization
3.3 Turing: All services are selected and launched
Since most of our knowledge is currently provided in the form of cards, Turing provides a complete set of business service tools exposed through the cloud theme:
Concept selection:
Users can select all their own themes for sub-channel delivery
3.4 Graph Engine: Data Storage and Query
In terms of storage media, we use mysql for flexible annotation, graph database for full query, and odps for persistent data version management.
Before the data is entered into igraph and biggraph, it will be split into a point table and an edge table for import, and the online query will be performed through gremlin.
On the upper layer of the graph database, we encapsulate a graph engine module to provide scenarios with different triggers and multi-channel multi-hop recall functions for products. At present, user, item_list and query recall are provided, which have been used in Miaoxiaomi, and in the joint debugging with search discovery, you can use the query interface to query and test.
3.5 Technology Landing
Cloud theme (cognitive map) At present, nearly 10,000 scenarios have been launched in the cloud theme in the form of knowledge cards. Compared with the first-guess product, clicks and divergence have been greatly improved compared with products, and data divergence is currently being explored.
Related Articles
-
A detailed explanation of Hadoop core architecture HDFS
Knowledge Base Team
-
What Does IOT Mean
Knowledge Base Team
-
6 Optional Technologies for Data Storage
Knowledge Base Team
-
What Is Blockchain Technology
Knowledge Base Team
Explore More Special Offers
-
Short Message Service(SMS) & Mail Service
50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00