In a marketing campaign for inviting new users to register an account or place an order, scalper accounts may affect the campaign effect and the rights and interests of new users. To solve the preceding issue, you can use OneID to identify and locate scalper accounts based on relationship data in graphs. This way, you can formalize the rules for marketing campaigns and prevent the relevant loopholes.
This topic helps you get started with Graph Compute and describes how to build a graph computing application and detect scalper accounts from millions of data records.
What is Graph Compute?
Graph Compute is a distributed graph query and computing solution with high performance and high stability. You can use Graph Compute to process trillions of data records. In addition, Graph Compute is equipped with intelligent O&M and offline systems to implement data integration between data lakes and data warehouses, and it supports fast iteration and management of data in multiple versions. Graph Compute also provides graph technology services for global enterprises and developers based on the knowledge accumulated by Alibaba in multiple industries, such as e-commerce, data security, and social networking.
Background
In the era of the digital economy, data has become a key factor in promoting innovation and development. The opening of data promotes collaboration and innovation across multiple industries and organizations. This generates a new industrial form and business model. However, with the prominence of the value of data, multiple data security issues occur, such as data attacks, theft, abuse, and hijacking. These illegal actions have been industrialized with high technology. Internet enterprises have a strong need to identify and track users. The enterprises need to identify the status of accounts, integrate businesses, and protect data security based on the relationships between accounts and devices. OneID is applicable to this scenario for identifying users.
What is OneID?
OneID is a system that helps you identify and track natural persons across devices and regions. Each person on the Internet has a fixed virtual ID, like the ID card number in real life. The virtual ID is allocated by using specific algorithms.
OneID can identify the following types of IDs owned by a natural person:
Account type: business account, mobile number, and email
Device type: International Mobile Equipment Identity (IMEI), International Mobile Subscriber Identity (IMSI), and Identifier for Advertising (IDFA)
Cookie type: Acookie
OneID can gather seemingly unrelated information based on the relationships between entities. For example, you can detect multiple suspected scalper accounts based on a scalper account by using OneID, because the owner of these suspected accounts is often the same as that of the confirmed scalper account. These suspected accounts may belong to an individual scalper or a scalper team. You can query the information about the individual or the team by using OneID.
Why are graphs recommended for risk detection?
A graph can display relationships and actions in a visualized manner. The industry summarizes and concludes common graph schemas and relationships by using graph computing technology. You can then build a graph schema based on the summarized information and use the schema to prevent frauds and scams. As the confrontation between frauds and risk control systems escalates, the data and behavioral characteristics that are generated by large-scale and industrialized frauds are closer to those of normal users. The traditional risk control machine learning mechanism is based on thresholds. If an unsafe behavior of an account exceeds the relevant threshold, this account is identified as a scalper account. However, thresholds are easy to be broken, and thus the traditional mechanism provides poor protection against frauds and cannot meet the risk control need.
However, a risk control system based on graph computing is hard to be broken and provides better protection capabilities. For example, scalper accounts often have frequent interactions between each other because registering accounts and changing devices are costly, and such interactions can save costs for scalpers. A graph-based system can easily detect such interactions, thereby preventing loopholes. In addition, a traditional risk control system focuses on outliers of a characteristic space and ignores relational data in real life. In real life, entities have rich interactions. A graph schema can provide information gains to prevent frauds. In summary, we recommend that you use graphs to detect frauds.
Scenario
Business background
A technology enterprise sells digitalized AI products and solutions. The enterprise issues coupons for promotion. In this case, the enterprise needs OneID to detect scalper accounts to prevent invalid loss.
The analysis of the abnormal user behaviors finds that a scalper team uses the device resources to the maximum extent to maximize the benefits, and some users register multiple accounts to abuse the promotion. To prevent the preceding issues, the enterprise needs to detect users from multiple perspectives.
The following business logic is summarized based on the business characteristics:
1. The related devices of an account are first queried based on the account, and then more related accounts are queried based on the devices. This link involves two hops.
2. After the related accounts are obtained, weight rules need to be set for the devices and accounts to determine the suspected scalpers. The following figure shows the procedure.
Service analysis
OneID can identify and locate scalper accounts, and determine rules for new user acquisition campaigns based on the graph data in transaction fraud and spam registration scenarios. This specifies campaign rules and prevents loopholes.
Graph Compute uses some offline computing algorithms. It starts its work from the graph propagation algorithm, then to the graph clustering algorithm, and finally to the graph representation algorithm. Graph Compute can detect a wider range of deep risks based on the algorithms. The graph propagation algorithm can detect several risky entities with efficiency and high accuracy. However, the graph propagation algorithm is semi-supervised and can detect risky entities only based on the known risky entities. The graph clustering algorithm is then utilized to detect more risky entities from a global perspective. Not only graph schemas but also entity properties can be used to detect risky entities. Graph Compute can detect deeper risks based on the violation and punishment information, and behavior characteristics of accounts.
In the scalper detection scenario, you can use the first step in the following figure, which is the graph propagation algorithm. This algorithm can detect risky vertices and rate risk scores based on the known risky vertices.
Step 1: Sort log data
Collect the logs about user registration and logon and perform the following operations:
Data definition: confirms data sources and defines the data format and fields that are returned by each data source.
The following information is collected for Log Service:
User logon information
User registration information
Cookie information
Data normalization and mapping: defines the mapping relationships between data in the source and destination tables, and formalizes data in a script formation.
User information table:
Includes information such as registered accounts and registration time.
Device information table:
Includes information about devices, such as mobile numbers, IP addresses, and emails.
Incremental write triggered by an event: triggers an incremental write to the corresponding table by an event.
After a new user registers an account and logs on, the data is exported, and the user information table and device information table are updated.
Data persistence: introduces MaxCompute to store offline full data.
Step2: Select an offline algorithm
Community search is a graph propagation method based on the local information of known seed vertices in the network to detect the local community in which a given seed vertex is located. You can use a semi-supervised method to label unlabeled vertices based on existing labeled vertices and iteratively predict more unlabeled vertices.
In an anti-fraud scenario, only a small amount of labeled data can be obtained. In addition, a large amount of unlabeled data that contains a large number of risky entities also requires processing. The key point is to detect risky entities from a large amount of unlabeled data. A semi-supervised method is constructed based on the risky data provided by the business. The process detects high-risk entities around the risky data and returns the suspected entities to the business for verification. Then, the risky entities are added as data input to detect more risky data. The semi-supervised method can detect only a limited number of samples based on labeled entities and cannot find out groups with a specific structure.
The well-known risky entity detection methods based on semi-supervised relationship network graphs in the industry are GraphRAD proposed by Amazon in 2018 and Risk-alike proposed by Ant Group in 2021. Both methods detect risks on graphs based on the input of risky vertices. You can also use a link analysis method, such as PageRank (PR) algorithm, to calculate the importance of other vertices in the network based on existing labeled vertices.
Example:
Risk_alike
Input: risky vertices and all edges
Step 1: Create a graph based on the vertices within two hops from risky vertices.
Step 2: Use the Louvain algorithm to obtain the information about fraud teams and recall the vertices that can be reached from fraud teams within two hops. Then, set filter conditions to obtain the fraud teams whose proportion of risky vertices is equal to or more than 40% and whose number of total vertices is equal to or less than 200.
Step 3: Calculate the risk score of each vertex based on the PR algorithm and sort the vertices. The risk score of a vertex specifies the importance of the vertex to the relevant labeled risky vertex.
Step 4: Filter high-risk vertices based on the returned results of the Louvain and PR algorithms.
Output: high-risk vertices and the corresponding risk scores
Alibaba Cloud Platform of Artificial Intelligence (PAI) provides built-in anti-fraud algorithms, such as Risk-alike, Louvain, LPA, and PR. You can directly call the algorithms by using PAI. The output results contain the user information table and the isbad field that indicates whether an account is a scalper account. The user information table contains multiple valid information, such as registered accounts, and registration time.
Step 3: Sort and summarize business schemas
Alibaba Cloud designs a variety of graph configuration schemas to build graphs based on the preceding business logic.
Solution 1 | Solution 2 | Solution 3 |
Characteristic: relation heterogeneous table The data schema of the graph is the closest to the original schema, but the data to be configured is large. | Characteristic: relation isomorphic table The graph is easy to configure and needs only the user information table and device information table. | Characteristic: devices as independent vertices The graph is more applicable to detecting popular devices. |
Issue: The device types are fixed and you cannot add new device types. You need to add a device relationship table. The graph lacks extensibility. | Issue: Before you add a user, you need to perform one or more queries to find the relationships between the user and other users. | Issue: The query performance is decreased because the system needs to query the relationships between devices and users. |
Step 4: Optimize business schemas
Solution 3 is selected based on the following business considerations:
1. The graph starts a query by specifying the type of media and filters and calculates data by specifying the type.
2. You can add a user by using Solution 3, while you need to query the user relationships before you add a user by using Solution 1 and Solution 2. If you query information by using Solution 1 or Solution 2, the system needs to write data to a vertex and write data to multiple edges. However, if you query information by using Solution 3, the system needs to only write data to a vertex and write data to two edges. If a new user uses a popular IP address, the system needs to write data to tens of thousands of edges. In this case, the query overhead is too large. For example, you want to add User D, and the existing Users A, B, and C use the same IP address as that of User D.
Solution 1 and Solution 2: Add a vertex for User D and query users who use the same IP address as User D. Then, add edges. The numbers of original vertices and edges are 3 and 6, while the numbers of vertices and edges are 4 and 12 after the data is added.
Solution 3: Add a vertex for User D. If the IP address that User D uses does not exist in the graph, add a vertex for the IP address 192.168.1.1 and add a forward edge and a reverse edge between the User D vertex and IP address vertex. The numbers of original vertices and edges are 4 and 6, while the numbers of vertices and edges are 4 or 5 and 8 after the data is added.
The main disadvantage of Solution 3 is query performance because the number of hops between users is added from one to two. However, Solution 3 can simplify the offline update logic and the effect on query performance is limited.
Build a graph computing application by using Graph Compute
Step 1: Build a graph schema
Build a graph schema based on Solution 3, including a user vertex table, a medium table, and a relationship table that specifies the relationships between users and devices.
Step 2: Select a suitable data source type
You can use one of the following methods based on your business requirements:
Method 1: MaxCompute and API operations. This is the recommended method. If you use MaxCompute to host source data or need to do that in the future, and you want to update graph data in real time for 24 hours a day, select Method 1.
Method 2: API operations. If the source data is hosted in Graph Compute and you want to update graph data in real time for 24 hours by calling API operations, select Method 2.
Method 3: MaxCompute. If the source data is hosted in MaxCompute, a partitioned MaxCompute table is generated on a daily basis, and you do not need to update incremental graph data in real time for 24 hours, select Method 3.
Step 3: Create a data source
If you select Method 1 or 2, perform the following steps:
[Method 1: MaxCompute and API operations]
This method depends on the existing project and data of the source table in MaxCompute.
If you do not have existing data, you can select the demo provided by Alibaba Cloud that contains vertices and edges as a data source.
If your business data exists in CSV files or MaxCompute, you can select the data as the source data.
Source table:
In this example, the igraph_mock.anti_cheating_demo_user_vertex table is used. The vertex table contains 999,000 normal users and 1,000 risky users. If the value of the isbad field is TRUE for a user, the user is risky.
You can refer to the following statements to create a table in MaxCompute:
The isbad field is optional and used to label known scalper users.
You can add more fields based on your business requirements, such as fields related to the name, gender, and registration time.
CREATE TABLE IF NOT EXISTS anti_cheating_demo_user_vertex ( user_id BIGINT COMMENT 'User ID' the isbad BOOLEAN COMMENT 'Specify whether the user is risky." ) COMMENT 'User vertex table' PARTITIONED BY (ds STRING COMMENT 'Partitioning by date')
Medium table
In this example, the igraph_mock.anti_cheating_demo_medium_vertex table is used. The medium table contains 100,000 media and 0.3% of those are used by more than one user.
You can refer to the following statements to create a table in MaxCompute:
The medium_types field can contain multiple information, such as mobile numbers, email addresses, IP addresses, and devices. In this example, only mobile numbers and email addresses are involved.
The weight field is used to specify the importance of a medium. In most cases, the value of the weight field for devices of the same type is the same. If multiple users use the same IP address, the value of the weight field for this IP address is the same.
CREATE TABLE IF NOT EXISTS anti_cheating_demo_medium_vertex ( medium_id BIGINT COMMENT 'Medium ID' , medium_type STRING COMMENT 'Medium type' , weight double COMMENT 'Weight' ) COMMENT 'Medium vertex table' PARTITIONED BY (ds STRING COMMENT 'Partitioning by date')
Relationship table between users and media
In this example, the igraph_mock.anti_cheating_demo_medium_edge table is used. The relationship table contains 100,000 media and 0.3% of those are used by more than one user.
You can refer to the following statements to create a table in MaxCompute:
The score field is used to specify the frequency and importance of a medium for the relevant user. You can assign a value based on the business requirements. By default, the value is 1 if you do not have special business requirements.
CREATE TABLE IF NOT EXISTS anti_cheating_demo_user_medium_edge
(
user_id BIGINT COMMENT 'User ID'
medium_id BIGINT COMMENT 'Medium ID'
, score double COMMENT 'Weight'
)
COMMENT 'Relationship table between users and media'
PARTITIONED BY (ds STRING COMMENT 'Partitioning by date')
Method 2: API operations
You can use SDK or the Graph Compute console to write and update data.
Graph Compute console: Log on to the Graph Compute console. In the left-side navigation pane, click Instances. On the Instance page, find the instance that you want to manage and click the instance ID. On the page that appears, click Graph List in the left-side navigation pane. On the page that appears, find the graph that you want to manage. Choose More > Graph O&M in the Actions column. Select a vertex or edge, right-click the vertex or edge, and then click Write Incremental Data.
Create a MaxCompute source table
To create a MaxCompute table, perform the following steps:
You need to create MaxCompute tables in DataWorks. Each vertex and edge must have a table.
Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click DataStudio in the Actions column.
On the DataStudio page, move the pointer over Create in the top navigation bar. In the shortcut menu that appears, choose Create Table > Table. In the Create Table dialog box, specify a table name.
Configure the parameters of the table.
MaxCompute Engine Instance and Table Name:
the name of the MaxCompute engine instance and table. You cannot modify them, but you shall properly keep the information because you need to grant the permissions on the MaxCompute engine instance and table to your graph computing account.
General section: You can enter the display name and description of the table.
Physical Model section:
Partition Type: the partitioning type of the table. Select Partition Table.
TTL: the retention period of a partition table. Select TTL and configure the TTL (Days) parameter based on your business requirements.
Schema section: You can specify the field properties of the table. You need to configure the required parameters, such as the filed name data type, and whether the filed is a primary key field. We recommend that you select the type of a field based on the characteristics of the field. For example, you can use the FLOAT type instead of the DOUBLE type if you do not require high precision. For a field that contains numbers as IDs, you can use the INT type instead of the STRING type. This reduces the index size of the table and the time required for the returned data.
WarningThe data types supported by Graph Compute are limited. Only the data types of MaxCompute V1.0 and the basic data types of MaxCompute V2.0 are supported. The complex data types of MaxCompute V2.0 are not supported, such as ARRAY, MAP, and STRUCT. For more information about data types and valid value ranges, see MaxCompute V2.0 data type edition.
Step 4: Build a graph computing application
4.1 Configure the following parameters and purchase an instance on the buy_page
Region: the region in which your instance resides. To reduce network latency, we recommend that you select a region that is close to your physical location.
Username and Password: the username and password that you use to log on.
Specifications: Configure the following parameters based on the data size and the complexity of the requests.
We recommend that you select Exclusive General-purpose as Specification.
If your business requires complex computing performance or high performance of the returned fields, we recommend that you increase the number of shards.
Example:
The following figure shows the configurations for an instance that needs to process millions of data records.
If the data size is too large, you can increase the number of shards.
If the online query traffic increases, you can increase the number of replicas.
Wait 15 minutes after the purchase for the instance to be initialized.
Promotion: New enterprise users can use instances for free within the first month.
4.2 Create a business graph schema
After you perform the preceding step, you can create a graph schema. On the configuration page of the instance that you want to manage, click Graph List in the left-side navigation pane. On the page that appears, click Create. In the Create Graph dialog box, configure the Name and Description parameters.
4.3 Create a vertex table
Create a user vertex table
In the upper-left corner, click Add Node. In the Add Data Configuration (Node) panel, configure the required parameters.
Select MaxCompute Data Source and API Update as Data Source.
Configure the Source Table Project and Source Table Name parameters.
In this example, the Source Table Project parameter is set to igraph_mock, and the Source Table Name is set to anti_cheating_demo_user_vertex.
Click Import Fields. The names and data types of the fields in the MaxCompute table are automatically imported.
Select a field as the primary key and configure the Type parameter based on the data size of the field. Take note that the table types of Graph Compute are different from those of MaxCompute. We recommend that you select an appropriate INT type, such as INT8, INT32, and INT64. An appropriate value range can accelerate the response speed of requests.
For Index Type, we recommend that you select KV.
If you need to use inverted indexes, select Inverted INDEX as Index Type and click Add Index in the Index Field section. You can add indexes for global statistics and queries, such as in a scenario where you want to query all risky users.
If you need to add additional fields, click Add Fields in the Field Structure section.
If you need to return a partitioned table, turn on Scan DONE Partition. For more information, see Manage .done partitions.
Create a medium vertex table
In this example, the Source Table Project parameter is set to igraph_mock, and the Source Table Name is set to anti_cheating_demo_medium_vertex.
4.4 Create a relationship table
Create a forward edge from a user vertex to a medium vertex
Select a user vertex, right-click the user vertex, and then select Add Edge. In the Add Data Configuration (Edge) panel, specify the edge name.
Select MaxCompute Data Source and API Update as Data Source.
Configure the Source Table Project and Source Table Name parameters and click Import Fields.
In this example, the Source Table Project parameter is set to igraph_mock, and the Source Table Name is set to anti_cheating_demo_user_medium_edge.
In this example, the user_id field is selected for Source Field, and the medium_id field is selected for Destination Field. Configure the Type parameter based on the data size of the fields.
If you need to add additional fields, click Add Fields in the Field Structure section.
If you need to return a partitioned table, turn on Scan DONE Partition.
Click Submit.
Create a reverse edge from a medium vertex to a user vertex
Select MaxCompute Data Source and API Update as Data Source.
Configure the Source Table Project and Source Table Name parameters and click Import Fields.
In this example, the Source Table Project parameter is set to igraph_mock, and the Source Table Name is set to anti_cheating_demo_user_medium_edge.
In this example, the user_id field is selected for Destination Field, and the medium_id field is selected for Source Field. Configure the Type parameter based on the data size of the fields.
4.5 Publish the graph schema
After you perform the preceding steps, click Save and Publish.
4.6 Build an index
After you publish the graph schema, click Batch Backflow to build an index.
Configure the Select Partition parameter based on your business requirements.
The Incremental Data Timestamp parameter is used to set the start time when incremental data is retrieved after the data switching feature is enabled.
The time required to build an index depends on the size of the data. In most cases, an index for millions of data records can be built within 15 to 20 minutes. If the vertices and edges in the graph schema are green, the index is built.
4.7 Determine the risk level of an account
You can determine the risk level of an account based on multiple factors, such as the size of the scalper team.
Query graph data in Graph Compute
After you perform the preceding steps, you can query and analyze graph data. In the left-side navigation pane, click Graph Exploration. You can use one of the following methods to query data: Exploratory Query and Query in the Console.
Example:
Direct query
// Query the information about a user to determine whether the user is risky.
g("anti_cheating").V("-889411487137524591").hasLabel("user")
Indirect query
// Query the number of risky users with which a specific medium is associated.
g("anti_cheating").V("-3161643561846490971").hasLabel("medium")
.outE().inV()
.filter(isbad=\"true\"").count()
// Query the number of risky users who use the same medium as a specific user.
g("anti_cheating").V("-189711352665847917").hasLabel("user").
.outE().outE().inV()
.filter("isbad=\"true\"").count()
Enable the weight feature
// Some media may have hundreds of thousands of users, such as an IP address of a university.
// In this case, we recommend that you run the sample(n).by("score") command to randomly select one or more associated users to check. The higher the score, the more likely the user is to be selected.
g("anti_cheating").V("-189711352665847917").hasLabel("user").
.outE().sample(10).by("score").outE().inV()
.filter("isbad=\"true\"").count()
Calculate risk scores after weighting
g("anti_cheating").withSack(supplier(normal,"0.0"),Splitter.identity,Operator.sum).
V("2532895489060363835").hasLabel("user").outE().sack(Operator.assign).by("to_double(score)").
inV().sack(Operator.mult).by("to_double(weight)").outE().inV()
.filter("isbad=\"true\"").barrier().sack()
Why Graph Compute
Multiple graph computing services with unique advantages are on the market. To handle different scenarios in different industries, you need to know the characteristics of each service before you select an appropriate service. Graph Compute has strong write performance in real time and specializes in massive graph data storage and quick queries. Graph Compute allows you to query graph data with high stability by using the iGraph engine and optimized operator logic. Graph Compute provides a fault tolerance mechanism for the data import feature by using the data constraint logic and Alibaba Cloud-developed operators. If you write specific data that has been written to Graph Compute, the new data overwrites the previous data. The one-stop intelligent O&M capacity of Graph Compute provides an engine for complex and distributed graph data processing and helps you reduce O&M costs.
Benefits of Graph Compute
1. High cost effectiveness
The proxy and search layers of Graph Compute increase the maximum workload of a cluster and improve resource utilization. You can save 50% of resources and double the maximum number of queries per second (QPS) of cluster workload.
2. High performance
Graph Compute allows you to separate vertices and provides multiple index types. You can classify data when you build indexes. Graph Compute also provides custom logic for truncation to ensure query performance. The iGraph engine has rich practice experience with hot keys based on multi-level cache policies. Graph Compute also provides features such as dynamic scaling. The response time of a query is reduced by 100% to 500% compared with an open source solution.
3. Data update for millions of records within seconds
For enterprises in the finance and Internet industries, risk control rules for natural persons are required. Enterprises need to detect whether a single person has violations. In this scenario, you need to update a large amount of data and have high requirements for the query performance of graph data. Online transaction processing (OLTP) capability is a core demand. Graph Compute implements eventual consistency that supports millions of update QPS on a single vertex within 1 or 2 seconds. Graph Compute helps you update the risk control data within seconds and improves the detection accuracy.
4. Connection to integrated data warehouses
You can use an offline processing platform to access data. The risk control and security business is analyzed by using a complete big data application that is provided by the algorithm and data teams. You can seamlessly connect to data sources based on MaxCompute data warehouses. You can also rapidly iterate full data in warehouses. The iteration cycle is improved from on a daily basis to on an hourly basis.