Graph Database with PolarDB: Practice of Graph Analysis in the Banking and Financial Sector

Abstract This article introduces how to use Alibaba Cloud's PolarDB for PostgreSQL and its AGE extension for graph data analysis, especially for financial transaction fraud detection.

Abstract

This article introduces how to use Alibaba Cloud's PolarDB for PostgreSQL and its AGE extension for graph data analysis, especially for financial transaction fraud detection. PolarDB for PostgreSQL version combined with AGE extension supports efficient processing and querying of graph data, including the use of the Cypher query language. The article describes in detail the process from data preparation, graph structure creation to specific query examples, demonstrating how to discover the relationships of fraudulent transactions through graph queries, calculate Jaccard similarity between transactions, and thus provide fraud warnings.

Introduction

Graph analysis is an important field in data science, focusing on representing data through a graph structure and performing various calculations and analysis tasks. The graph structure consists of nodes (or vertices) and edges, where nodes typically represent entities and edges represent relationships between entities. Graph computing is widely used in social network analysis, recommendation systems, knowledge graphs, path optimization, and other fields.

PolarDB for PostgreSQL (referred to as PolarDB hereafter) is a cloud-native relational database product independently developed by Alibaba Cloud, fully compatible with PostgreSQL, highly compatible with Oracle syntax (public cloud version supports Oracle syntax); it adopts a Shared-Storage-based storage-computing separation architecture, with features such as extreme elasticity, millisecond-level latency, HTAP capability, high reliability, high availability, and elastic scalability, and has enterprise-level database characteristics. At the same time, PolarDB has large-scale parallel computing capabilities to deal with mixed OLTP and OLAP workloads.

The graph analysis capabilities described in this article are based on the output of the Alibaba Cloud native relational database PolarDB For PostgreSQL.

Business Scenario

Background Information

In modern fraud and various financial crimes, fraudsters achieve the goal of evading risk control rules by changing their identities. By establishing a graph structure to track user behavior through a graph database, real-time analysis of discrete data on fraudulent behavior and identification of fraud rings can quickly prevent and resolve fraudulent behavior.

Data and Schema

The IEEE-CIS Fraud Detection dataset is a public dataset in the finance transaction field. The data consists of transaction records provided by the e-commerce platform Vesta, including device, address, email, and other related information. This dataset can be used to identify fraudulent transactions and predict risks.

The original data is in CSV format, including transaction information (transaction number, address, email, etc.) and transaction identification information (device information, device type, etc.).

The dataset contains a large amount of information, and in this article, the data model is abstracted as the diagram below. The actual business scenario can be adjusted according to the actual situation.

Among them:

● Nodes:

transaction (transaction)
product (transaction product)
addr1 (transaction address 1)
addr2 (transaction address 2)
emaildomain (email domain used in transaction)
deviceinfo (transaction device information)
devicetype (transaction device type)

● Edges:

transaction_product (relationship between transaction and product)
transaction_addr1 (relationship between transaction and address 1)
transaction_addr2 (relationship between transaction and address 2)
transaction_emaildomain (relationship between transaction and email domain)
transaction_deviceinfo (relationship between transaction and device information)
transaction_devicetype (relationship between transaction and device type)

The above model is centered around transactions and is associated through transaction ID (transactionid).

Best Practice

Technology Implementation

PolarDB

PolarDB's graph database engine AGE (A Graph Extension) is an extension designed for the PostgreSQL series database, aimed at enhancing its ability to handle graph data. AGE aims to combine the advantages of relational databases and graph databases, providing a high-performance, flexible, and easily scalable solution.

AGE mainly includes the following features:

● Fully compatible with PostgreSQL

AGE is an extension of PolarDB's PostgreSQL version, which means it can be used in existing PolarDB databases without the need to rebuild the database. AGE inherits all the powerful features of PolarDB, including transactions, concurrency control, and various indexes and optimization techniques.

● Unified graph and relational queries

AGE allows for the simultaneous processing of relational and graph data, allowing for the mixed use of SQL and graph query languages in the same query. This makes handling complex data models easier and more efficient.

● Support for Cypher query language

AGE supports the use of the Cypher query language, which is a query language designed specifically for graph databases, with a simple and flexible syntax. It provides users with an intuitive way to query and manipulate graph data.

● High performance

By combining PolarDB's optimization techniques and indexes designed for graph data, AGE can efficiently handle large-scale graph data and complex graph queries.

As mentioned above, with the powerful capabilities of AGE, PolarDB can process various graph queries simply and efficiently.

Age-viewer

age-viewer is the built-in graph data visualization tool for the AGE project, which can visualize the query results. The age-viewer mirror which is compatible with PolarDB and can be downloaded from here.

Recomended Configuration

To achieve a good POC experience, it is recommended to use the following configuration:

Item	Recomended Configuration
PolarDB Version	Standard Edition Compatible with PostgreSQL 14
CPU	>16 Core
Memory	>64 GB
Disk	>100GB (AUTOPL)
Version	>2.0.14.23.1

Database Preparation

It is necessary to pre-create the AGE extension in the database.

create extension age;

At the beginning of each session, it is necessary to set the search_path and execute SQL to load the extension.

SET search_path = ag_catalog, "$user", public;
select * from  get_cypher_keywords() limit 0;

If you do not want to set the search_path in each session, you can perform this operation on the database.

ALTER DATABASE <dbname>
SET search_path = ag_catalog, "$user", public;

ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';

Insert Data

Create Graph

Using the function create_graph, a graph can be created by specifying the name of the graph (here, "fraud_graph" is used as an example).

SELECT create_graph('fraud_graph');

Insert Nodes and Edges

Since the downloaded data is in CSV format, it cannot be directly inserted into the graph database as vertex and edge structured data.

The appendix of this article provides a Python script to convert data into vertex and edge in PolarDB. The converted data is as follows:

● Partial vertex data

SELECT * FROM cypher('fraud_graph', $$ MERGE (v:transaction {transactionid : 2990783, isfraud : 0 } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:product {productid : 158945 } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:addr1 {addr1 : '299.0' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:addr2 {addr2 : '87.0' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:emaildomain {emaildomain : 'gmail.com' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:deviceinfo {deviceinfo : 'SM-G920V Build/NRD90M' } ) RETURN v $$ ) as (n agtype);
SELECT * FROM cypher('fraud_graph', $$ MERGE (v:devicetype {devicetype : 'mobile' } ) RETURN v $$ ) as (n agtype);
...

● Partial edge data

SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:product) WHERE a.transactionid = 2990783 AND b.productid = 158945 MERGE (a)-[e:transaction_product]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:addr1) WHERE a.transactionid = 2990783 AND b.addr1 = '299.0' MERGE (a)-[e:transaction_addr1]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:addr2) WHERE a.transactionid = 2990783 AND b.addr2 = '87.0' MERGE (a)-[e:transaction_addr2]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:emaildomain) WHERE a.transactionid = 2990783 AND b.emaildomain = 'gmail.com' MERGE (a)-[e:transaction_emaildomain_p]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:deviceinfo) WHERE a.transactionid = 2999403 AND b.deviceinfo = 'SM-G920V Build/NRD90M' MERGE (a)-[e:transaction_deviceinfo]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('fraud_graph', $$ MATCH (a:transaction), (b:devicetype) WHERE a.transactionid = 2999404 AND b.devicetype = 'mobile' MERGE (a)-[e:transaction_devicetype]->(b) RETURN e$$) as (e agtype);
...

Save the converted result as an SQL file, and use client tools such as psql to complete the data import.

Use examples include simple queries and data statistics:

Examples

Simple Query

Statistical Data

● Number of nodes

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n)
    RETURN count(*)
$$) as (number_of_vertex agtype);

 number_of_vertex 
----
 1076004

● Number of transaction nodes

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)
    RETURN count(*)
$$) as (number_of_transaction agtype);

number_of_transaction 
----
545591

● Number of transactions identified as fraudulent

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)
    WHERE n.isfraud = 1
    RETURN count(*)
$$) as (number_of_fraud_transaction agtype);

number_of_fraud_transaction 
----
 18919

● Number of edges

SELECT *
FROM cypher('fraud_graph', $$
    MATCH ()-[r]->()
    RETURN count(*)
$$) as (number_of_edge agtype);

number
------
 2131254

Filter Query, Sort Query

● Query transaction information with ID 2988706, show all related information.

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)
    WHERE n.transactionid = 2988706
    RETURN v
$$) as (r agtype);

---------
 {"id": 2251799813685249, "label": "addr2", "properties": {"addr2": "87.0"}}::vertex
 {"id": 2533274790395906, "label": "emaildomain", "properties": {"emaildomain": "gmail.com"}}::vertex
 {"id": 2533274790395906, "label": "emaildomain", "properties": {"emaildomain": "gmail.com"}}::vertex
 {"id": 1970324836974595, "label": "addr1", "properties": {"addr1": "325.0"}}::vertex
 {"id": 1125899906844295, "label": "product", "properties": {"productid": 137934}}::vertex

Preview the results using the following SQL in age-viewer:

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)
    WHERE n.transactionid = 2988706
    RETURN [n,r,v]::path
$$) as (r agtype);

Common Scenario

K Nearest Neighbors

The k-nearest neighbors (KNN) method uses the similarity between data points to identify potential fraudulent behavior. KNN can help evaluate the similarity of data points. By finding k neighbors that are similar to the features of the data point, it can determine whether the point is normal. For example, if a transaction shows significant differences in amount, location, and time compared to most neighbors, it may be marked as suspicious.

● Query for transaction record 2988706 with the same address as other transaction records. Later can judge the suspicion of this transaction based on the relevant information from these neighbors.

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[:transaction_addr1]->(:addr1)<-[:transaction_addr1]-(t:transaction)
    WHERE n.transactionid = 2988706
    RETURN t
$$) as (r agtype);

-----
 {"id": 844424930131972, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987001}}::vertex
 {"id": 844424930131978, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987007}}::vertex
 {"id": 844424930132041, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987070}}::vertex
 {"id": 844424930132053, "label": "transaction", "properties": {"isfraud": 0, "transactionid": 2987082}}::vertex
....

Preview the results using the following SQL in age-viewer:

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r:transaction_addr1]->(a:addr1)<-[r2:transaction_addr1]-(t:transaction)
    WHERE n.transactionid = 2988706
    RETURN [n,r,a,r2,t]::path
    LIMIT 50
$$) as (r agtype);

Note: Only 50 records are returned. Returning too much data will cause the frontend to freeze.

Path Retrieval

In the identification of fraudulent transactions, the concept of paths mainly involves using graph theory and network analysis methods to identify potential fraudulent behavior. In financial transactions and networks, transactions can often be viewed as a graph, with nodes representing accounts or customers and edges representing transaction activities. By analyzing the structure of this graph, it is possible to discover abnormal patterns and fraudulent behavior. By calculating the paths between transactions, it is possible to quickly identify the transaction chains from a suspicious account to other accounts, helping to reveal hidden connections between seemingly unrelated accounts.

● Retrieve the transaction records 2987000 and 2987172 to identify the related path based on a fraudulent transaction, and find all the related fraudulent transaction paths from two seemingly unrelated transactions.

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)-[r2]->(v2)<-[r3]-(k:transaction)
    WHERE n.transactionid = 2987000 
        and k.transactionid=2987172
        and t.isfraud = 1
    RETURN t 
$$) as (e agtype);

----
{"id": 844424930618281, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3473312}}::vertex
 {"id": 844424930626886, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3481917}}::vertex
 {"id": 844424930649640, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3504671}}::vertex
 {"id": 844424930631805, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3486836}}::vertex
 {"id": 844424930641980, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3497011}}::vertex
 {"id": 844424930644942, "label": "transaction", "properties": {"isfraud": 1, "transactionid": 3499973}}::vertex

Preview the results using the following SQL in age-viewer:

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)-[r2]->(v2)<-[r3]-(k:transaction)
    WHERE n.transactionid = 2987000 
        and k.transactionid=2987172
        and t.isfraud = 1
    RETURN [n,r,v,r1,t,r2,v2,r3,k]::path
    LIMIT 50
$$) as (e agtype);

Common Neighbor

In fraud transaction identification, the Common Neighbor Judgment is a technique based on social networks or transaction network analysis used to identify possible fraudulent behavior. This method is primarily based on graph theory and involves analyzing the relationship network between transaction participants to identify potentially suspicious transaction patterns. If two transactions have multiple common neighbors, and these neighbors exhibit abnormal behavior in their transactions (e.g. high frequency transactions, abnormal amounts, etc.), then there may be a risk of fraud between these two transaction makers.

● Search for the common neighbors of transaction record 2987000 and transaction record 2987172 in order to find transaction records with similar attributes (such as address, device).

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[]->(v)<-[]-(t:transaction)
    WHERE n.transactionid = 2987000 and t.transactionid=2987172
    RETURN v
$$) as (r agtype);

----
 {"id": 2251799813685249, "label": "addr2", "properties": {"addr2": "87.0"}}::vertex
 {"id": 1970324836974594, "label": "addr1", "properties": {"addr1": "315.0"}}::vertex

Preview the results using the following SQL in age-viewer:

SELECT *
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)<-[r1]-(t:transaction)
    WHERE n.transactionid = 2987000 and t.transactionid=2987172
    RETURN [n,r,v,r1,t]::path
$$) as (r agtype);

Jaccard Similarity

Jaccard similarity is used to measure the similarity between two sets. The formula is:

FORMULA

Where:

● J(A,B) is the Jaccard similarity of sets A and B.

● ∣A∩B∣ is the intersection of sets A and B.

● ∣A∪B∣ is the union of sets A and B.

In fraud detection, Jaccard similarity can be used in various ways, such as:

Pattern recognition: Fraudulent behavior often exhibits common characteristics. By calculating the Jaccard similarity of user transactions or behavior, similar transaction patterns can be identified, helping to detect potential fraud activities.
Customer group analysis: When analyzing customer behavior, Jaccard similarity can be used to compare the similarity between different customers. Customers with higher similarity may have similar risk characteristics.

In this example, a transaction can be calculated for its Jaccard similarity based on information such as managed addresses, email, and addresses. If the overlap of associated information is significant, it indicates a high similarity of the transaction.

Create the following function to obtain all related nodes of a specific transaction and return an array of id for all points.

CREATE OR REPLACE FUNCTION find_ids(transactionid integer)
RETURNS bigint[]
LANGUAGE plpgsql
AS $function$
DECLARE 
    sql VARCHAR;
    ids bigint[];
BEGIN
  sql := 'SELECT array_agg(cast(r as bigint)) FROM (
SELECT *
FROM cypher(''fraud_graph'', $$
MATCH (n:transaction)-[]->(v)
    WHERE n.transactionid = ' || text($1) || 
'RETURN id(v) 
$$) as (r agtype)) as t;';
    EXECUTE sql INTO ids;
    return ids;
END
$function$;

Create supportive function for performing Union and Intersection operations on arrays.

create or replace function array_union(anyarray, anyarray)
returns anyarray language sql immutable as $$
select array_agg(a order by a)
from (
    select distinct unnest($1 || $2) as a
) s;
$$;

create or replace function array_intersection(anyarray, anyarray)
returns anyarray language sql immutable as $$
select array_agg(e)
from (
    select unnest($1)
    intersect
    select unnest($2)
) as dt(e)
$$;

Finally, create a function to calculate the Jaccard similarity of two transactions.

CREATE OR REPLACE FUNCTION jaccardSimilarity(tid1 integer, tid2 integer)
RETURNS float8
LANGUAGE plpgsql
AS $function$
DECLARE 
    sql VARCHAR;
    ids1 bigint[];
    ids2 bigint[];
    union_list bigint[];
    intersection_list bigint[];
BEGIN
  ids1 = find_ids($1);
  ids2 = find_ids($2);
  union_list = array_union(ids1, ids2); -- union
  intersection_list = array_intersection(ids1, ids2); -- intersection

  RETURN 
    CASE 
        WHEN array_length(union_list,1) = 0 THEN 0
        ELSE array_length(intersection_list,1) * 1.0/ array_length(union_list,1) 
    END AS jaccardSimilarity;
END
$function$;

You can specify the IDs of the two transactions when using it.

SELECT jaccardSimilarity(2987000, 2987172);

jaccardsimilarity 
----
 0.4

If you need to compare the similarity of all transactions, you can take full advantage of the storage procedure capabilities of PolarDB to complete more complex similarity calculation tasks, thereby meeting the needs of detecting fraudulent transactions.

For example, you can use the following SQL to find all transactions that have the same address 1 and address 2, as well as the same email domain as transaction 2987002, and then sort them by Jaccard similarity to find the top 50 similar transactions:

WITH tmp AS
(SELECT cast(r as integer) as transactionid
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[:transaction_addr2]->(:addr2)<-[:transaction_addr2]-(t:transaction)
    WHERE n.transactionid = 2987002
    MATCH (n:transaction)-[:transaction_addr1]->(:addr1)<-[:transaction_addr1]-(t:transaction)
    WHERE n.transactionid = 2987002 
    MATCH (n:transaction)-[:transaction_emaildomain_p]->(:emaildomain)<-[:transaction_emaildomain_p]-(t:transaction)
    WHERE n.transactionid = 2987002 
    RETURN t.transactionid
$$) as (r agtype) )
SELECT transactionid, jaccardSimilarity(2987002, transactionid) as jaccardSimilarity
FROM tmp
ORDER by jaccardSimilarity DESC
LIMIT 50;

 transactionid | jaccardsimilarity 
---------------+-------------------
       3323911 |               0.6
       3328911 |               0.6
       3009043 |               0.6
       3039416 |               0.6
       3039425 |               0.6
       2993652 |               0.6
       3045027 |               0.6
       3037644 |               0.6
       3045041 |               0.6
...

Preview the results using the following SQL in age-viewer (results have been pre-calculated, only the top 10 are taken, if needed, they can be put into a list):

SELECT * 
FROM cypher('fraud_graph', $$
    MATCH (n:transaction)-[r]->(v)<-[r2]-(t:transaction)
    WHERE n.transactionid = 2987002
        AND t.transactionid IN [3323911, 3328911,3009043,3039416,3039425,2993652,3045027,3037644,3045041,3045049,3045279]
    RETURN [n,r,v,r2,t]::path
$$) as (r agtype);

Summary

This article introduces how to use the graph analysis capabilities of Alibaba Cloud's cloud-native relational database PolarDB for PostgreSQL edition for graph data analysis. PolarDB, combined with the AGE extension, provides graph data computation and analysis capabilities, including using Cypher query language to efficiently process graph data queries. Using a public dataset in the financial transaction field as an example, the article demonstrates executing graph queries to detect fraudulent transactions, such as querying for address or device information related to fraudulent transactions, identifying the relationships of fraudulent transactions, and calculating the similarity between transactions and fraudulent transactions for fraud detection. PolarDB, built on top of a relational database, provides graph analysis capabilities, offering strong support for unified enterprise data management and analysis.

Appendix

If you need the data conversion script, please contact the PolarDB for PostgreSQL graphical database engine support team or submit a ticket to obtain it.

Community

Graph Database with PolarDB: Practice of Graph Analysis in the Banking and Financial Sector

Abstract

Introduction

Business Scenario

Background Information

Data and Schema

Best Practice

Technology Implementation

PolarDB

AGE mainly includes the following features:

Age-viewer

Recomended Configuration

Database Preparation

Insert Data

Create Graph

Insert Nodes and Edges

Examples

Simple Query

Statistical Data

Filter Query, Sort Query

Common Scenario

K Nearest Neighbors

Path Retrieval

Common Neighbor

Jaccard Similarity

Summary

Appendix

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution