This article introduces how to use Alibaba Cloud's PolarDB for PostgreSQL and its AGE extension for graph data analysis, especially for financial transaction fraud detection. PolarDB for PostgreSQL version combined with AGE extension supports efficient processing and querying of graph data, including the use of the Cypher query language. The article describes in detail the process from data preparation, graph structure creation to specific query examples, demonstrating how to discover the relationships of fraudulent transactions through graph queries, calculate Jaccard similarity between transactions, and thus provide fraud warnings.
Graph analysis is an important field in data science, focusing on representing data through a graph structure and performing various calculations and analysis tasks. The graph structure consists of nodes (or vertices) and edges, where nodes typically represent entities and edges represent relationships between entities. Graph computing is widely used in social network analysis, recommendation systems, knowledge graphs, path optimization, and other fields.
PolarDB for PostgreSQL (referred to as PolarDB hereafter) is a cloud-native relational database product independently developed by Alibaba Cloud, fully compatible with PostgreSQL, highly compatible with Oracle syntax (public cloud version supports Oracle syntax); it adopts a Shared-Storage-based storage-computing separation architecture, with features such as extreme elasticity, millisecond-level latency, HTAP capability, high reliability, high availability, and elastic scalability, and has enterprise-level database characteristics. At the same time, PolarDB has large-scale parallel computing capabilities to deal with mixed OLTP and OLAP workloads.
The graph analysis capabilities described in this article are based on the output of the Alibaba Cloud native relational database PolarDB For PostgreSQL.
In modern fraud and various financial crimes, fraudsters achieve the goal of evading risk control rules by changing their identities. By establishing a graph structure to track user behavior through a graph database, real-time analysis of discrete data on fraudulent behavior and identification of fraud rings can quickly prevent and resolve fraudulent behavior.
The IEEE-CIS Fraud Detection dataset is a public dataset in the finance transaction field. The data consists of transaction records provided by the e-commerce platform Vesta, including device, address, email, and other related information. This dataset can be used to identify fraudulent transactions and predict risks.
The original data is in CSV format, including transaction information (transaction number, address, email, etc.) and transaction identification information (device information, device type, etc.).
The dataset contains a large amount of information, and in this article, the data model is abstracted as the diagram below. The actual business scenario can be adjusted according to the actual situation.
Among them:
● Nodes:
● Edges:
The above model is centered around transactions and is associated through transaction ID (transactionid).
PolarDB's graph database engine AGE (A Graph Extension) is an extension designed for the PostgreSQL series database, aimed at enhancing its ability to handle graph data. AGE aims to combine the advantages of relational databases and graph databases, providing a high-performance, flexible, and easily scalable solution.
● Fully compatible with PostgreSQL
AGE is an extension of PolarDB's PostgreSQL version, which means it can be used in existing PolarDB databases without the need to rebuild the database. AGE inherits all the powerful features of PolarDB, including transactions, concurrency control, and various indexes and optimization techniques.
● Unified graph and relational queries
AGE allows for the simultaneous processing of relational and graph data, allowing for the mixed use of SQL and graph query languages in the same query. This makes handling complex data models easier and more efficient.
● Support for Cypher query language
AGE supports the use of the Cypher query language, which is a query language designed specifically for graph databases, with a simple and flexible syntax. It provides users with an intuitive way to query and manipulate graph data.
● High performance
By combining PolarDB's optimization techniques and indexes designed for graph data, AGE can efficiently handle large-scale graph data and complex graph queries.
As mentioned above, with the powerful capabilities of AGE, PolarDB can process various graph queries simply and efficiently.
age-viewer is the built-in graph data visualization tool for the AGE project, which can visualize the query results. The age-viewer mirror which is compatible with PolarDB and can be downloaded from here.
To achieve a good POC experience, it is recommended to use the following configuration:
Item | Recomended Configuration |
---|---|
PolarDB Version | Standard Edition Compatible with PostgreSQL 14 |
CPU | >16 Core |
Memory | >64 GB |
Disk | >100GB (AUTOPL) |
Version | >2.0.14.23.1 |
It is necessary to pre-create the AGE extension in the database.
create extension age;
At the beginning of each session, it is necessary to set the search_path and execute SQL to load the extension.
SET search_path = ag_catalog, "$user", public;
select * from get_cypher_keywords() limit 0;
If you do not want to set the search_path in each session, you can perform this operation on the database.
ALTER DATABASE <dbname>
SET search_path = ag_catalog, "$user", public;
ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';
Using the function create_graph, a graph can be created by specifying the name of the graph (here, "fraud_graph" is used as an example).
SELECT create_graph('fraud_graph');
Since the downloaded data is in CSV format, it cannot be directly inserted into the graph database as vertex and edge structured data.
The appendix of this article provides a Python script to convert data into vertex and edge in PolarDB. The converted data is as follows:
● Partial vertex data
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:transaction {transactionid : 2990783, isfraud : 0 } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:product {productid : 158945 } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:addr1 {addr1 : '299.0' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:addr2 {addr2 : '87.0' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:emaildomain {emaildomain : 'gmail.com' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:deviceinfo {deviceinfo : 'SM-G920V Build/NRD90M' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:devicetype {devicetype : 'mobile' } ) RETURN v $$ ) as (n agtype);
...
● Partial edge data
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:product) WHERE a.transactionid = 2990783 AND b.productid = 158945 MERGE (a)-[e:transaction_product]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:addr1) WHERE a.transactionid = 2990783 AND b.addr1 = '299.0' MERGE (a)-[e:transaction_addr1]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:addr2) WHERE a.transactionid = 2990783 AND b.addr2 = '87.0' MERGE (a)-[e:transaction_addr2]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:emaildomain) WHERE a.transactionid = 2990783 AND b.emaildomain = 'gmail.com' MERGE (a)-[e:transaction_emaildomain_p]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:deviceinfo) WHERE a.transactionid = 2999403 AND b.deviceinfo = 'SM-G920V Build/NRD90M' MERGE (a)-[e:transaction_deviceinfo]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:devicetype) WHERE a.transactionid = 2999404 AND b.devicetype = 'mobile' MERGE (a)-[e:transaction_devicetype]->(b) RETURN e$$) as (e agtype);
...
Save the converted result as an SQL file, and use client tools such as psql to complete the data import.
Use examples include simple queries and data statistics:
● Number of nodes
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n)
RETURN count(*)
$$) as (number_of_vertex agtype);
number_of_vertex
----
1076004
● Number of transaction nodes
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)
RETURN count(*)
$$) as (number_of_transaction agtype);
number_of_transaction
----
545591
● Number of transactions identified as fraudulent
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)
WHERE n.isfraud = 1
RETURN count(*)
$$) as (number_of_fraud_transaction agtype);
number_of_fraud_transaction
----
18919
● Number of edges
SELECT *
FROM cypher('fraud_graph', $$
MATCH ()-[r]->()
RETURN count(*)
$$) as (number_of_edge agtype);
number
------
2131254
● Query transaction information with ID 2988706, show all related information.
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)
WHERE n.transactionid = 2988706
RETURN v
$$) as (r agtype);
---------
{"id": 2251799813685249, "label": "addr2", "properties": {"addr2": "87.0"}}::vertex
{"id": 2533274790395906, "label": "emaildomain", "properties": {"emaildomain": "gmail.com"}}::vertex
{"id": 2533274790395906, "label": "emaildomain", "properties": {"emaildomain": "gmail.com"}}::vertex
{"id": 1970324836974595, "label": "addr1", "properties": {"addr1": "325.0"}}::vertex
{"id": 1125899906844295, "label": "product", "properties": {"productid": 137934}}::vertex
Preview the results using the following SQL in age-viewer:
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)
WHERE n.transactionid = 2988706
RETURN [n,r,v]::path
$$) as (r agtype);
The k-nearest neighbors (KNN) method uses the similarity between data points to identify potential fraudulent behavior. KNN can help evaluate the similarity of data points. By finding k neighbors that are similar to the features of the data point, it can determine whether the point is normal. For example, if a transaction shows significant differences in amount, location, and time compared to most neighbors, it may be marked as suspicious.
● Query for transaction record 2988706 with the same address as other transaction records. Later can judge the suspicion of this transaction based on the relevant information from these neighbors.
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[:transaction_addr1]->(:addr1)<-[:transaction_addr1]-(t:transaction)
WHERE n.transactionid = 2988706
RETURN t
$$) as (r agtype);
-----
{"id": 844424930131972, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987001}}::vertex
{"id": 844424930131978, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987007}}::vertex
{"id": 844424930132041, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987070}}::vertex
{"id": 844424930132053, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987082}}::vertex
....
Preview the results using the following SQL in age-viewer:
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r:transaction_addr1]->(a:addr1)<-[r2:transaction_addr1]-(t:transaction)
WHERE n.transactionid = 2988706
RETURN [n,r,a,r2,t]::path
LIMIT 50
$$) as (r agtype);
Note: Only 50 records are returned. Returning too much data will cause the frontend to freeze.
In the identification of fraudulent transactions, the concept of paths mainly involves using graph theory and network analysis methods to identify potential fraudulent behavior. In financial transactions and networks, transactions can often be viewed as a graph, with nodes representing accounts or customers and edges representing transaction activities. By analyzing the structure of this graph, it is possible to discover abnormal patterns and fraudulent behavior. By calculating the paths between transactions, it is possible to quickly identify the transaction chains from a suspicious account to other accounts, helping to reveal hidden connections between seemingly unrelated accounts.
● Retrieve the transaction records 2987000 and 2987172 to identify the related path based on a fraudulent transaction, and find all the related fraudulent transaction paths from two seemingly unrelated transactions.
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)-[r2]->(v2)<-[r3]-(k:transaction)
WHERE n.transactionid = 2987000
and k.transactionid=2987172
and t.isfraud = 1
RETURN t
$$) as (e agtype);
----
{"id": 844424930618281, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3473312}}::vertex
{"id": 844424930626886, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3481917}}::vertex
{"id": 844424930649640, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3504671}}::vertex
{"id": 844424930631805, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3486836}}::vertex
{"id": 844424930641980, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3497011}}::vertex
{"id": 844424930644942, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3499973}}::vertex
Preview the results using the following SQL in age-viewer:
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)-[r2]->(v2)<-[r3]-(k:transaction)
WHERE n.transactionid = 2987000
and k.transactionid=2987172
and t.isfraud = 1
RETURN [n,r,v,r1,t,r2,v2,r3,k]::path
LIMIT 50
$$) as (e agtype);
In fraud transaction identification, the Common Neighbor Judgment is a technique based on social networks or transaction network analysis used to identify possible fraudulent behavior. This method is primarily based on graph theory and involves analyzing the relationship network between transaction participants to identify potentially suspicious transaction patterns. If two transactions have multiple common neighbors, and these neighbors exhibit abnormal behavior in their transactions (e.g. high frequency transactions, abnormal amounts, etc.), then there may be a risk of fraud between these two transaction makers.
● Search for the common neighbors of transaction record 2987000 and transaction record 2987172 in order to find transaction records with similar attributes (such as address, device).
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[]->(v)<-[]-(t:transaction)
WHERE n.transactionid = 2987000 and t.transactionid=2987172
RETURN v
$$) as (r agtype);
----
{"id": 2251799813685249, "label": "addr2", "properties": {"addr2": "87.0"}}::vertex
{"id": 1970324836974594, "label": "addr1", "properties": {"addr1": "315.0"}}::vertex
Preview the results using the following SQL in age-viewer:
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)
WHERE n.transactionid = 2987000 and t.transactionid=2987172
RETURN [n,r,v,r1,t]::path
$$) as (r agtype);
Jaccard similarity is used to measure the similarity between two sets. The formula is:
Where:
● J(A,B) is the Jaccard similarity of sets A and B.
● ∣A∩B∣ is the intersection of sets A and B.
● ∣A∪B∣ is the union of sets A and B.
In fraud detection, Jaccard similarity can be used in various ways, such as:
In this example, a transaction can be calculated for its Jaccard similarity based on information such as managed addresses, email, and addresses. If the overlap of associated information is significant, it indicates a high similarity of the transaction.
Create the following function to obtain all related nodes of a specific transaction and return an array of id for all points.
CREATE OR REPLACE FUNCTION find_ids(transactionid integer)
RETURNS bigint[]
LANGUAGE plpgsql
AS $function$
DECLARE
sql VARCHAR;
ids bigint[];
BEGIN
sql := 'SELECT array_agg(cast(r as bigint)) FROM (
SELECT *
FROM cypher(''fraud_graph'', $$
MATCH (n:transaction)-[]->(v)
WHERE n.transactionid = ' || text($1) ||
'RETURN id(v)
$$) as (r agtype)) as t;';
EXECUTE sql INTO ids;
return ids;
END
$function$;
Create supportive function for performing Union and Intersection operations on arrays.
create or replace function array_union(anyarray, anyarray)
returns anyarray language sql immutable as $$
select array_agg(a order by a)
from (
select distinct unnest($1 || $2) as a
) s;
$$;
create or replace function array_intersection(anyarray, anyarray)
returns anyarray language sql immutable as $$
select array_agg(e)
from (
select unnest($1)
intersect
select unnest($2)
) as dt(e)
$$;
Finally, create a function to calculate the Jaccard similarity of two transactions.
CREATE OR REPLACE FUNCTION jaccardSimilarity(tid1 integer, tid2 integer)
RETURNS float8
LANGUAGE plpgsql
AS $function$
DECLARE
sql VARCHAR;
ids1 bigint[];
ids2 bigint[];
union_list bigint[];
intersection_list bigint[];
BEGIN
ids1 = find_ids($1);
ids2 = find_ids($2);
union_list = array_union(ids1, ids2); -- union
intersection_list = array_intersection(ids1, ids2); -- intersection
RETURN
CASE
WHEN array_length(union_list,1) = 0 THEN 0
ELSE array_length(intersection_list,1) * 1.0/ array_length(union_list,1)
END AS jaccardSimilarity;
END
$function$;
You can specify the IDs of the two transactions when using it.
SELECT jaccardSimilarity(2987000, 2987172);
jaccardsimilarity
----
0.4
If you need to compare the similarity of all transactions, you can take full advantage of the storage procedure capabilities of PolarDB to complete more complex similarity calculation tasks, thereby meeting the needs of detecting fraudulent transactions.
For example, you can use the following SQL to find all transactions that have the same address 1 and address 2, as well as the same email domain as transaction 2987002, and then sort them by Jaccard similarity to find the top 50 similar transactions:
WITH tmp AS
(SELECT cast(r as integer) as transactionid
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[:transaction_addr2]->(:addr2)<-[:transaction_addr2]-(t:transaction)
WHERE n.transactionid = 2987002
MATCH (n:transaction)-[:transaction_addr1]->(:addr1)<-[:transaction_addr1]-(t:transaction)
WHERE n.transactionid = 2987002
MATCH (n:transaction)-[:transaction_emaildomain_p]->(:emaildomain)<-[:transaction_emaildomain_p]-(t:transaction)
WHERE n.transactionid = 2987002
RETURN t.transactionid
$$) as (r agtype) )
SELECT transactionid, jaccardSimilarity(2987002, transactionid) as jaccardSimilarity
FROM tmp
ORDER by jaccardSimilarity DESC
LIMIT 50;
transactionid | jaccardsimilarity
---------------+-------------------
3323911 | 0.6
3328911 | 0.6
3009043 | 0.6
3039416 | 0.6
3039425 | 0.6
2993652 | 0.6
3045027 | 0.6
3037644 | 0.6
3045041 | 0.6
...
Preview the results using the following SQL in age-viewer (results have been pre-calculated, only the top 10 are taken, if needed, they can be put into a list):
SELECT *
FROM cypher('fraud_graph', $$
MATCH (n:transaction)-[r]->(v)<-[r2]-(t:transaction)
WHERE n.transactionid = 2987002
AND t.transactionid IN [3323911, 3328911,3009043,3039416,3039425,2993652,3045027,3037644,3045041,3045049,3045279]
RETURN [n,r,v,r2,t]::path
$$) as (r agtype);
This article introduces how to use the graph analysis capabilities of Alibaba Cloud's cloud-native relational database PolarDB for PostgreSQL edition for graph data analysis. PolarDB, combined with the AGE extension, provides graph data computation and analysis capabilities, including using Cypher query language to efficiently process graph data queries. Using a public dataset in the financial transaction field as an example, the article demonstrates executing graph queries to detect fraudulent transactions, such as querying for address or device information related to fraudulent transactions, identifying the relationships of fraudulent transactions, and calculating the similarity between transactions and fraudulent transactions for fraud detection. PolarDB, built on top of a relational database, provides graph analysis capabilities, offering strong support for unified enterprise data management and analysis.
If you need the data conversion script, please contact the PolarDB for PostgreSQL graphical database engine support team or submit a ticket to obtain it.
Alipay Technology - December 26, 2019
Alibaba Cloud Indonesia - November 27, 2023
Alipay Technology - February 20, 2020
ApsaraDB - April 19, 2019
Alibaba Clouder - July 30, 2019
Alibaba Clouder - September 28, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by ApsaraDB