Graph Analysis Based on PolarDB: Fast Import of Graph Data via Tables

Abstract This article introduces methods for importing a large amount of data into the graph database managed by a PolarDB-PostgreSQL database with the Apache AGE (A Graph Extension) plugin.

Abstract

This article introduces methods for importing a large amount of data into the graph database managed by a PolarDB-PostgreSQL database with the Apache AGE (A Graph Extension) plugin. It enables the rapid import of millions of nodes and edges into the graph database, avoiding performance bottlenecks caused by queries during edge insertion.

Background and Introduction

Alibaba Cloud's PolarDB for PostgreSQL can use Apache AGE (A Graph Extension) to manage and query graph data.

When importing data into graph databases, we often need to first convert node data and edge data into file formats such as CSV. Then, we import the vertices first, and subsequently build edges between the vertices.

However, if we use the Cypher MATCH...CREATE statement to create edges, the corresponding nodes on both sides need to be extracted from the vertex table, which reduces the import speed. When importing edges with data volumes exceeding tens of millions, performance often does not meet the requirements.

This article introduces a new method for importing graph data. By directly generating and managing unique IDs for the nodes on the graph, we can skip the MATCH step and directly write edge records. This approach can improve import efficiency by two orders of magnitude. The process includes the following steps:

Import the vertices of the graph as normal tables in the database. Typically, there is a column (or a combination of multiple columns) in the node table that serves as the unique identifier.
Import the edges of the graph as normal tables in the database. The edge table first contains the values of the unique identifier column for the left-side node, followed by the values of the unique identifier column for the right-side node, and finally the properties of the edge itself.
Add a unique identifier column to the node table, and through a join or calculation, add unique identifier columns for the starting point and endpoint to the edge table.
Import the data from the node table into the graph using the unique identifier columns.
Import the data from edge table into the graph using the unique identifiers.

As an example, we have constructed the following simple scenario:

● There are two types of nodes: v_user and v_product.

○ v_user is identified uniquely by two columns, type and uid. The uid may repeat under different types.

○ v_product uses a single product_id as a unique identifier, but it is a complex string that cannot be converted to a numeric ID and needs to be manually assigned a label.

● There is one type of edge, named e_own.

● The data for both nodes and edges has been imported into a table named xxx_raw.

● Both nodes and edges need to be imported into a graph named toys.

The following figure is an illustration of the procedure:

Data Declarations before Importing

● Create the extension

First create the age extension, and add it to the search path.

create extension age;
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname>
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';

● Create the graph and its labels for vertices and edges.

SELECT create_graph('toys');
SELECT create_vlabel('toys','v_user');
SELECT create_vlabel('toys','v_product');
SELECT create_elabel('toys','e_own');

● Create the normal tables to be imported into the graph. The tables can be imported from csv files or DTS tasks in real scenarios.

CREATE TABLE public.v_user_raw(type text, uid text, name text, age integer);

INSERT INTO v_user_raw VALUES
('A','U1', 'Alice', '33'),
('B','U1', 'Bob', '21');

CREATE TABLE public.v_product_raw(product_id text, price double precision);

INSERT INTO v_product_raw VALUES
('INAKLIDAS', '50'),
('ENKUCLKSD', '80'),
('IIUIHAKLS', '320'),
('SDVDSUHEE', '340');

CREATE TABLE public.e_own_raw(user_type text, user_uid text, product_id text, buy_price text);

INSERT INTO e_own_raw VALUES
('A', 'U1', 'INAKLIDAS', '45'),
('B', 'U1', 'ENKUCLKSD', '70'),
('B', 'U1', 'INAKLIDAS', '50'),
('B', 'U1', 'SDVDSUHEE', '330');

Create the Unique Identifier

● We've introduced two different methods for the two vertex table. The first one is to design a function that convert the data columns into a unique identifier, the second is to generate an auto-increase id in the database. Their differences are:

○ The first method is faster.

○ The second method is slower, but has no requirement on the original data.

● For v_user vertices, assume it only has 10 types (A-J), then we can use a function to combine the type and uid into a unique identifier:

CREATE OR REPLACE FUNCTION v_user_idgen(type text, uid text) RETURNS bigint AS
$$
SELECT (ASCII(type) - ASCII('A')) + substring(uid, 2)::bigint * 10
$$
language SQL;

Then, we can use this function to add a column of unique identifiers on the vertices and edge tables start or end on those vertices.

ALTER TABLE v_user_raw ADD COLUMN _id bigint UNIQUE CHECK(_id < 281474976710656 AND _id>=0);
UPDATE v_user_raw SET _id = v_user_idgen(type, uid);

ALTER TABLE e_own_raw ADD COLUMN _fromid bigint CHECK(_fromid < 281474976710656 AND _fromid>=0);
UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid);

● For v_product vertices, as its product id cannot be converted to a bigint value directly, we manually add a bigserial column as its unique identifier. After that, we use joins to apply these unique identifiers to the edge tables.

ALTER TABLE v_product_raw ADD COLUMN _id bigserial UNIQUE CHECK(_id < 281474976710656 AND _id>=0);

CREATE INDEX ON v_product_raw(product_id);

ALTER TABLE e_own_raw ADD COLUMN _toid bigint CHECK(_toid < 281474976710656 AND _toid>=0);
UPDATE e_own_raw SET _toid = v_product_raw._id
FROM v_product_raw 
WHERE v_product_raw.product_id = e_own_raw.product_id;

Fast Import

Create Support Functions

First we create these functions to help generating the unique identifier inside the graph database.

CREATE OR REPLACE FUNCTION age_name_to_idx_start(graph_name text, kind_name text, label_name text)
RETURNS bigint
AS 'SELECT id::bigint<<48 FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)'
language SQL IMMUTABLE STRICT PARALLEL SAFE;

CREATE OR REPLACE FUNCTION age_name_to_seq(graph_name text, kind_name text, label_name text)
RETURNS text
AS 'SELECT graph_name || ''.'' || seq_name::text FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)'
language SQL IMMUTABLE STRICT PARALLEL SAFE;

Importing Nodes

Before importing vertices, it is better to create a unique constraint to prevent the generation of duplicate IDs. However, this step can also be omitted. This index is beneficial for subsequent queries, and there is no need to delete it after use.

CREATE UNIQUE INDEX ON toys.v_user(id);
CREATE UNIQUE INDEX ON toys.v_product(id);

Then we import the data into the graph, we can select the prefered columns in the xxx_raw tables to form the properties of the vertices.

INSERT INTO toys."v_user"
SELECT (age_name_to_idx_start('toys', 'v', 'v_user') +  _id)::text::graphid,
row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM v_user_raw;

SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw));


INSERT INTO toys."v_product"
SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid,
row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw;

SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));

Importing Edges

We create the edges of the graph using the _fromid and _toid columns. We can choose to only include the columns of edges (e.g. buy_price), or we can also include the columns of the referenced vertexes.

INSERT INTO toys."e_own" 
SELECT
(age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid,
row_to_json((select x FROM (select buy_price) x))::text::agtype
FROM e_own_raw;

Result Verification

We first list all the graph nodes using Cypher statement.

SELECT * FROM cypher('toys', $$
MATCH (v:v_user)
RETURN v
$$) as (v agtype);
                                                             v                                                              
----------------------------------------------------------------------------------------------------------------------------
 {"id": 844424930131978, "label": "v_user", "properties": {"age": "33", "uid": "U1", "name": "Alice", "type": "A"}}::vertex
 {"id": 844424930131979, "label": "v_user", "properties": {"age": "21", "uid": "U1", "name": "Bob", "type": "B"}}::vertex
(2 rows)


SELECT * FROM cypher('toys', $$
MATCH (v:v_product)
RETURN v
$$) as (v agtype);
                                                         v                                                         
-------------------------------------------------------------------------------------------------------------------
 {"id": 1125899906842625, "label": "v_product", "properties": {"price": "50", "product_id": "INAKLIDAS"}}::vertex
 {"id": 1125899906842626, "label": "v_product", "properties": {"price": "80", "product_id": "ENKUCLKSD"}}::vertex
 {"id": 1125899906842627, "label": "v_product", "properties": {"price": "320", "product_id": "IIUIHAKLS"}}::vertex
 {"id": 1125899906842628, "label": "v_product", "properties": {"price": "340", "product_id": "SDVDSUHEE"}}::vertex
(4 rows)

Then we perform a query using Cypher statement on the edges to check the discount of each bill.

SELECT buy, sell, sell::int - buy::int as discount FROM cypher('toys', $$
MATCH (v1:v_user)-[e:e_own]-(v2:v_product)
WHERE v1.name = 'Bob'
RETURN e.buy_price, v2.price
$$) as (buy agtype, sell agtype);
  buy  | sell  | discount 
-------+-------+----------
 "70"  | "80"  |       10
 "50"  | "50"  |        0
 "330" | "340" |       10
(3 rows)

Importing Incremental Data

● After the import, the tables named as xxx_raw and the functions named as xxx_idegen can be dropped. It won't influence the graph.

● If you want to import more data using the same methods, you can retain those tables and functions. But you need to skip the records that have already imported into the graph.

○ To import a new vertex, refer those SQLs.

INSERT INTO v_user_raw VALUES
('A','U2', 'Alien', '99', v_user_idgen('A', 'U2'));

INSERT INTO toys.v_user
SELECT (age_name_to_idx_start('toys', 'v', 'v_user') + _id)::text::graphid,
row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM (SELECT (('A','U2', 'Alien', '97', v_user_idgen('A', 'U2'))::v_user_raw).*) raw;

SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw));

INSERT INTO v_product_raw(product_id, price) VALUES
('AIEEEEEEE', '999');

INSERT INTO toys."v_product"
SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid,
row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw WHERE product_id = 'AIEEEEEEE';

SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));

○ To import a new edge, refer to those SQLs. (This actually rebuild all the edges, as we do not know which edge is newly added.)

DELETE FROM e_own_raw;

INSERT INTO e_own_raw VALUES
('A', 'U2', 'AIEEEEEEE', '9999');

UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid);

UPDATE e_own_raw SET _toid = v_product_raw._id
FROM v_product_raw 
WHERE v_product_raw.product_id = e_own_raw.product_id;

INSERT INTO toys."e_own" 
SELECT
(age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid,
row_to_json((select x FROM (select buy_price) x))::text::agtype
FROM e_own_raw;

● You cannot continue to use this import method if you have used the Cypher CREATE statement, because that statement generates identifiers in a different method from us, and may break the consistency of the unique ids.

Summary

This article introduces a method to quickly import nodes and edges from outside a graph database into a graph within a PolarDB for PostgreSQL database via the help of normal tables. Compared to the original MATCH ... CREATE ... approach, import performance is typically doubled.

Next Step

Start to build your own graph application with the PolarDB for PostgreSQL free program.

● PolarDB for PostgreSQL Always Free Program

Community

Graph Analysis Based on PolarDB: Fast Import of Graph Data via Tables

Abstract

Background and Introduction

Data Declarations before Importing

Create the Unique Identifier

Fast Import

Create Support Functions

Importing Nodes

Importing Edges

Result Verification

Importing Incremental Data

Summary

Next Step

Read previous post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

Best Practices

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL