Insurance Data Analysis Based on Graph Analysis Capabilities Provided by PolarDB

Abstract In this article, we use a publicly available insurance dataset to demonstrate how PolarDB can perform graph queries to identify abnormal cla...

Abstract

In this article, we use a publicly available insurance dataset to demonstrate how PolarDB can perform graph queries to identify abnormal claim records and fraudulent groups in insurance claim scenarios. For example, you can query policies involving the claimant, or analyze the social relationships of the policyholder for potential fraudulent behavior. PolarDB enhances relational databases with graph analysis capabilities, providing support for centralized data management, and analysis within enterprises.

About the Graph Database Engine

Graph analysis is a critical field of data science. It uses graph structures to represent data and perform various computing and analytic tasks. A graph structure consists of nodes (or vertices) and edges, with nodes representing entities and edges representing the relationships between the entities. Graph computing is commonly used in areas including social network analysis, recommendation systems, knowledge graphs, and path optimization.

PolarDB for PostgreSQL is highly compatible with Apache AGE, and supports storage and query of knowledge graphs. It allows you to use both the standard ANSI SQL and openCypher graph query language to query data within the same database cluster.

· Fully compatible with PolarDB for PostgreSQL

AGE is an extension for PolarDB for PostgreSQL. It can be used in existing PolarDB databases without the need for database reconstruction. AGE leverages all features provided by PolarDB, including transaction support, concurrency control, and various indexing and performance optimization features.

· Unified graph and relational queries

AGE enables the concurrent processing of relational and graph data. It allows you to use both SQL and the graph query language in a single query to simplify the handling of complex data models, which improves operation efficiency.

· Cypher query language supported

AGE supports the Cypher query language, which is tailored for graph databases. Its syntax is simple and flexible, providing an intuitive approach to querying and manipulating graph data.

· High performance

By combining the optimization features provided by PolarDB and indexes tailored for graph data, AGE can efficiently manage large-scale graph data and complex graph queries.

In conclusion, based on the capabilities provided by AGE, PolarDB can handle graph queries in a simple and efficient manner.

Scenarios

Description

Insurance claim fraud typically involves data on patients, diseases, and claims held by insurance providers. By analyzing the connections between claim applications, diseases, and other related entities, you can identify abnormal claim records and expose potentially fraudulent groups.

Data and Models

The data is based on an insurance dataset which is publicly available. You can find the dataset here. It contains fundamental elements of the insurance industry. The data model can be abstracted into the following graph:

· Vertices: policyholder, incharge, claim, patient, disease.

· Edges: has_disease, policyholder_of_claim, incharge_of_claim, insured_of_claim, similar_claim, policyholder_connection.

· Properties: name, high_risk, risk_score, disease_name, similarity_score, level, claim_date, charge, and more.

Best Practices

Prepare the Database

The graph engine extension is supported only by PolarDB for PostgreSQL14 whose revision version is 14.12.24.0 or later.

Create the extension.

create extension age;
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname>
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';

Data Import

1. Create a graph by using the create_graph function in the ag_catalog namespace.

SELECT create_graph('graph');

2. Insert vertices and edges. The downloaded data file is in the CSV format and does not contain the necessary ID information. Convert the file before you import the data. A Python script is provided in Appendix to convert the data into vertices and edges that is supported by PolarDB. See the following conversion results:

● POLICYHOLDER

SELECT create_vlabel('graph','policyholder');
SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3068',fname:'ADAM',lname:'OCHSENBEIN',risk_score:'88',high_risk:'1'}) $$ ) as (n agtype);
SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3069',fname:'MALINDA',lname:'MEHSERLE',risk_score:'42',high_risk:'0'}) $$ ) as (n agtype);
SELECT * FROM cypher('graph', $$ CREATE (:policyholder {policyholder_id:'PH3070',fname:'SANDRA',lname:'KUHTA',risk_score:'20',high_risk:'0'}) $$ ) as (n agtype);
...

● CLAIM

- Create vlabel
SELECT create_vlabel('graph','claim');
SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3571',charge:'6517.53',claim_date:'2013-08-11 00:00:00',duration:'13',insured_id:'28523',diagnosis:'no exception',person_incharge_id:'PI23070',type:'services',policyholder_id:'PH9507'}) $$ ) as (n agtype);
SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3572',charge:'49273.65',claim_date:'2017-02-10 00:00:00',duration:'3',insured_id:'1220',diagnosis:'no exception',person_incharge_id:'PI21197',type:'services',policyholder_id:'PH406'}) $$ ) as (n agtype);
SELECT * FROM cypher('graph', $$ CREATE (:claim {claim_id:'C3573',charge:'52005.98',claim_date:'2014-06-29 00:00:00',duration:'27',insured_id:'23735',diagnosis:'no exception',person_incharge_id:'PI22361',type:'services',policyholder_id:'PH7911'}) $$ ) as (n agtype);
...

● Connections between policyholder and claim

SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1528' AND b.policyholder_id = 'PH2963' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1529' AND b.policyholder_id = 'PH1353' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1530' AND b.policyholder_id = 'PH1071' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1531' AND b.policyholder_id = 'PH8102' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
SELECT * FROM cypher('graph', $$ MATCH (a:claim), (b:policyholder) WHERE a.claim_id = 'C1532' AND b.policyholder_id = 'PH4768' CREATE (a)-[e:RELTYPE  ]->(b) RETURN e$$) as (e agtype);
...

Save the transformed results as an SQL file and use client tools such as the PostgreSQL client to import the data.

Examples

Simple Queries

Statistics

● Query the number of nodes of various types.

SELECT count(*) FROM cypher('graph', $$
MATCH (v)
RETURN v
$$) as (v agtype);
 count  
--------
 120567

   
-- claim
SELECT count(*) FROM cypher('graph', $$
MATCH (v:claim)
RETURN v
$$) as (v agtype);
 count  
--------
 100001

-- policyholder
SELECT count(*) FROM cypher('graph', $$
MATCH (v:policyholder)
RETURN v
$$) as (v agtype);
 count 
-------
 10006

-- incharge
SELECT count(*) FROM cypher('graph', $$
MATCH (v:incharge)
RETURN v
$$) as (v agtype);
 count 
-------
 10001

--disease
SELECT count(*) FROM cypher('graph', $$
MATCH (v:disease)
RETURN v
$$) as (v agtype);
count 
-------
   393
  
--patient
SELECT count(*) FROM cypher('graph', $$
MATCH (v:patient)
RETURN v
$$) as (v agtype);

 count 
-------
   166

Query the Number of Claims

● Query the policyholder, incharge, and patient status of claim C4377.

SELECT 'policyholder_id' as type, policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(policyholder:policyholder)
RETURN policyholder.policyholder_id
$$) as (policyholder_id agtype)
UNION
SELECT 'incharge_id', incharge_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:incharge)
RETURN v.incharge_id
$$) as (incharge_id agtype)
UNION
SELECT 'patient_id', patient_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(v:patient)
RETURN v.patient_id
$$) as (patient_id agtype);

      type       | policyholder_id 
-----------------+-----------------
 patient_id      | "11279"
 policyholder_id | "PH3759"
 incharge_id     | "PI26607"

General Scenarios

K-order Neighbors

● Query the claims with the same insured patient as fraudulent policy C4377 to determine whether the claimant is involved in insurance fraud.

SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:patient)<-[]-(c:claim)
RETURN c.claim_id
$$) as (claim_id agtype);

 ?column? | claim_id 
----------+----------
 claim_id | "C28963"
 claim_id | "C3679"
 claim_id | "C96545"
 claim_id | "C26586"
 claim_id | "C26754"
 claim_id | "C87278"
 claim_id | "C87603"
 claim_id | "C69395"
 claim_id | "C67594"
 claim_id | "C96155"
 claim_id | "C10160"

● Query the social relationships of the policyholder of fraudulent policy C4377 to provide early warnings about their claims.

SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(a:policyholder)-[r*1..3]->(p:policyholder)
RETURN p.policyholder_id
$$) as (policyholder_id agtype);

    ?column?     | policyholder_id 
-----------------+-----------------
 policyholder_id | "PH52532"
 policyholder_id | "PH11283"
 policyholder_id | "PH11328"
 policyholder_id | "PH1"
 policyholder_id | "PH5"
 policyholder_id | "PH512"
 policyholder_id | "PH1569"
 policyholder_id | "PH4722"
 policyholder_id | "PH4731"

Path Retrieval

● Query the path between policyholder PH3759 and policyholder PH4722 to check their connections.

SELECT *
FROM cypher('graph', $$                        
    MATCH path = (:policyholder {policyholder_id: 'PH3759'})-[r*1..3]->(:policyholder {policyholder_id: 'PH4722'})
    RETURN path
$$) AS (v agtype);

-------
 [{"id": 844424930136988, "label": "policyholder", "properties": {"fname": "KURTIS", "lname": "ALKEMA", "high_risk": "1", "risk_score": "78", "policyholder_id": "PH3759"}}::vertex, {"id": 2251799813685487, "label": "RELTYPE", "end_id": 844424930133473, "start_id": 844424930136988, "properties": {"level": "65"}}::edge, {"id": 844424930133473, "label": "policyholder", "properties": {"fname": "TERRA", "lname": "SWARB", "high_risk": "0", "risk_score": "25", "policyholder_id": "PH512"}}::vertex, {"id": 2251799813685546, "label": "RELTYPE", "end_id": 844424930138502, "start_id": 844424930133473, "properties": {"level": "62"}}::edge, {"id": 844424930138502, "label": "policyholder", "properties": {"fname": "VETA", "lname": "SEDLACK", "high_risk": "0", "risk_score": "31", "policyholder_id": "PH1569"}}::vertex, {"id": 2251799813685594, "label": "RELTYPE", "end_id": 844424930136281, "start_id": 844424930138502, "properties": {"level": "92"}}::edge, {"id": 844424930136281, "label": "policyholder", "properties": {"fname": "DEANNA", "lname": "BALSER", "high_risk": "0", "risk_score": "36", "policyholder_id": "PH4722"}}::vertex]::path

Common Neighbors

● Query common neighbors of policies C4377 and C67594 to identify shared policyholders.

SELECT 'policyholder_id', policyholder_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(:claim  {claim_id: 'C67594'})
RETURN p.policyholder_id
$$) as (policyholder_id agtype);

    ?column?     | policyholder_id 
-----------------+-----------------
 policyholder_id | "PH3759"

Collaborative Recommendation

● Query policies that share the same policyholder as fraudulent policy C4377 to identify other potentially fraudulent policies.

SELECT 'claim_id', claim_id FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[]->(p:policyholder)<-[]-(c:claim)
RETURN c.claim_id
$$) as (claim_id agtype);

?column? | claim_id 
----------+----------
 claim_id | "C28963"
 claim_id | "C96545"
 claim_id | "C3679"
 claim_id | "C87603"
 claim_id | "C26754"
 claim_id | "C26586"
 claim_id | "C87278"
 claim_id | "C69395"
 claim_id | "C67594"
 claim_id | "C96155"
 claim_id | "C10160"

● Query the top 20 policies most similar to fraudulent policy C4377:

with t as (
SELECT claim_id, replace(trim(both '"' from to_jsonb(properties)::text), '\"', '"') as similarity FROM cypher('graph', $$
MATCH (:claim {claim_id: 'C4377'})-[e]->(c:claim)
RETURN properties(e), c.claim_id
$$) as (properties agtype, claim_id agtype)
)
SELECT claim_id, replace((similarity::jsonb->'similarity_score')::text, '"','')::integer as s 
from t
ORDER BY s DESC
LIMIT 20;

claim_id | s  
----------+----
 "C67594" | 13
 "C69395" | 13
 "C10160" | 13
 "C87603" | 13
 "C28963" | 13
 "C3679"  | 13
 "C26754" | 13
 "C96155" | 13
 "C26586" | 13
 "C87278" | 13
 "C96545" | 13
 "C20113" |  8
 "C70759" |  8
 "C28785" |  8
 "C12793" |  8
 "C59736" |  8
 "C38059" |  8
 "C34068" |  8
 "C71827" |  8
 "C15760" |  8

Conclusion

You can combine the graph analysis capabilities provided by PolarDB for PostgreSQL, and the AGE extension to perform graph data analysis. PolarDB provides graph data computing and analysis features, including Cypher query language support and efficient graph data query processing, to facilitate centralized data management and analysis within enterprises.

Free trial

Visit the Free Trials & Special Offers page, select the PolarDB option, and try the graph computing features of GanosBase provided by PolarDB for PostgreSQL.

Appendix

Data conversion script for converting CSV data.

import csv
import os
def convert_vertex_csv(file_path, graph):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()
    
    # create vlabel
    print("------------------------------------------------")
    print("-- Create vlabel")
    print("SELECT create_vlabel('{}','{}');".format(graph, file_name))

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
            if p != "":
               p += "}"
            print("SELECT * FROM cypher('{}', $$ CREATE (:{} {}) $$ ) as (n agtype);".format(graph, file_name, p))
                

def convert_edge_csv(file_path, graph, from_type, to_type):
    file_name = os.path.splitext(os.path.basename(file_path))[0].lower()

    with open(file_path, 'r') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        header = next(reader)
        for row in reader:
            p = ""
            for h in header:
                if (h.endswith("ID")):
                    continue;
                
                if p != "":
                    p += ","
                else:
                    p += "{"
                p += "{}:'{}'".format(h.lower(), row[header.index(h)].strip())
            
            if p != "":
               p += "}"
            
            print("SELECT * FROM cypher('{0}', $$ MATCH (a:{1}), (b:{2}) WHERE a.{1}_id = '{3}' AND "
                  "b.{2}_id = '{4}' CREATE (a)-[e:RELTYPE {5} ]->(b) RETURN e$$) as (e agtype);".format(graph, from_type, to_type, row[0].strip(), row[1].strip(), p))

def generate_graph_csv(directory, graph):

    print("------------------------------------------------")
    print("-- Create graph")
    print("SELECT create_graph('{}');".format(graph))

    print("------------------------------------------------")
    print("-- Create vertex")
    convert_vertex_csv(directory + "/POLICYHOLDER.csv", graph)
    convert_vertex_csv(directory + "/INCHARGE.csv", graph)
    convert_vertex_csv(directory + "/PATIENT.csv", graph)
    convert_vertex_csv(directory + "/CLAIM.csv", graph)
    convert_vertex_csv(directory + "/DISEASE.csv", graph)

    print("------------------------------------------------")
    print("-- Create edge")
    convert_edge_csv(directory + "/POLICYHOLDER_CONNECTION.csv", graph, 'policyholder','policyholder')
    convert_edge_csv(directory + "/INCHARGE_OF_CLAIM.csv", graph,'claim', 'incharge')
    convert_edge_csv(directory + "/CLAIM_SIMILARITY.csv", graph, 'claim','claim')
    convert_edge_csv(directory + "/POLICYHOLDER_OF_CLAIM.csv", graph,'claim', 'policyholder')
    convert_edge_csv(directory + "/INSURED_OF_CLAIM.csv", graph, 'claim','patient')
    convert_edge_csv(directory + "/HAS_DISEASE.csv", graph, 'patient','disease')

generate_graph_csv("analyzing-insurance-claims-using-ibm-db2-graph-master/data", "graph")

Community

Insurance Data Analysis Based on Graph Analysis Capabilities Provided by PolarDB

Abstract

About the Graph Database Engine

Scenarios

Description

Data and Models

Best Practices

Prepare the Database

Data Import

Examples

Simple Queries

Statistics

Query the Number of Claims

General Scenarios

K-order Neighbors

Path Retrieval

Common Neighbors

Collaborative Recommendation

Conclusion

Free trial

Appendix

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Best Practices

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL