This article introduces methods for importing a large amount of data into the graph database managed by a PolarDB-PostgreSQL database with the Apache AGE (A Graph Extension) plugin. It enables the rapid import of millions of nodes and edges into the graph database, avoiding performance bottlenecks caused by queries during edge insertion.
Alibaba Cloud's PolarDB for PostgreSQL can use Apache AGE (A Graph Extension) to manage and query graph data.
When importing data into graph databases, we often need to first convert node data and edge data into file formats such as CSV. Then, we import the vertices first, and subsequently build edges between the vertices.
However, if we use the Cypher MATCH...CREATE
statement to create edges, the corresponding nodes on both sides need to be extracted from the vertex table, which reduces the import speed. When importing edges with data volumes exceeding tens of millions, performance often does not meet the requirements.
This article introduces a new method for importing graph data. By directly generating and managing unique IDs for the nodes on the graph, we can skip the MATCH
step and directly write edge records. This approach can improve import efficiency by two orders of magnitude. The process includes the following steps:
As an example, we have constructed the following simple scenario:
● There are two types of nodes: v_user
and v_product
.
○ v_user
is identified uniquely by two columns, type
and uid
. The uid may repeat under different types
.
○ v_product
uses a single product_id
as a unique identifier, but it is a complex string that cannot be converted to a numeric ID and needs to be manually assigned a label.
● There is one type of edge, named e_own
.
● The data for both nodes and edges has been imported into a table named xxx_raw
.
● Both nodes and edges need to be imported into a graph named toys
.
The following figure is an illustration of the procedure:
● Create the extension
First create the age
extension, and add it to the search path.
create extension age;
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname>
SET search_path = "$user", public, ag_catalog;
ALTER DATABASE <dbname> SET session_preload_libraries TO 'age';
● Create the graph and its labels for vertices and edges.
SELECT create_graph('toys');
SELECT create_vlabel('toys','v_user');
SELECT create_vlabel('toys','v_product');
SELECT create_elabel('toys','e_own');
● Create the normal tables to be imported into the graph. The tables can be imported from csv files or DTS tasks in real scenarios.
CREATE TABLE public.v_user_raw(type text, uid text, name text, age integer);
INSERT INTO v_user_raw VALUES
('A','U1', 'Alice', '33'),
('B','U1', 'Bob', '21');
CREATE TABLE public.v_product_raw(product_id text, price double precision);
INSERT INTO v_product_raw VALUES
('INAKLIDAS', '50'),
('ENKUCLKSD', '80'),
('IIUIHAKLS', '320'),
('SDVDSUHEE', '340');
CREATE TABLE public.e_own_raw(user_type text, user_uid text, product_id text, buy_price text);
INSERT INTO e_own_raw VALUES
('A', 'U1', 'INAKLIDAS', '45'),
('B', 'U1', 'ENKUCLKSD', '70'),
('B', 'U1', 'INAKLIDAS', '50'),
('B', 'U1', 'SDVDSUHEE', '330');
● We've introduced two different methods for the two vertex table. The first one is to design a function that convert the data columns into a unique identifier, the second is to generate an auto-increase id in the database. Their differences are:
○ The first method is faster.
○ The second method is slower, but has no requirement on the original data.
● For v_user
vertices, assume it only has 10 types (A-J), then we can use a function to combine the type and uid into a unique identifier:
CREATE OR REPLACE FUNCTION v_user_idgen(type text, uid text) RETURNS bigint AS
$$
SELECT (ASCII(type) - ASCII('A')) + substring(uid, 2)::bigint * 10
$$
language SQL;
Then, we can use this function to add a column of unique identifiers on the vertices and edge tables start or end on those vertices.
ALTER TABLE v_user_raw ADD COLUMN _id bigint UNIQUE CHECK(_id < 281474976710656 AND _id>=0);
UPDATE v_user_raw SET _id = v_user_idgen(type, uid);
ALTER TABLE e_own_raw ADD COLUMN _fromid bigint CHECK(_fromid < 281474976710656 AND _fromid>=0);
UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid);
● For v_product
vertices, as its product id cannot be converted to a bigint value directly, we manually add a bigserial
column as its unique identifier. After that, we use joins to apply these unique identifiers to the edge tables.
ALTER TABLE v_product_raw ADD COLUMN _id bigserial UNIQUE CHECK(_id < 281474976710656 AND _id>=0);
CREATE INDEX ON v_product_raw(product_id);
ALTER TABLE e_own_raw ADD COLUMN _toid bigint CHECK(_toid < 281474976710656 AND _toid>=0);
UPDATE e_own_raw SET _toid = v_product_raw._id
FROM v_product_raw
WHERE v_product_raw.product_id = e_own_raw.product_id;
First we create these functions to help generating the unique identifier inside the graph database.
CREATE OR REPLACE FUNCTION age_name_to_idx_start(graph_name text, kind_name text, label_name text)
RETURNS bigint
AS 'SELECT id::bigint<<48 FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)'
language SQL IMMUTABLE STRICT PARALLEL SAFE;
CREATE OR REPLACE FUNCTION age_name_to_seq(graph_name text, kind_name text, label_name text)
RETURNS text
AS 'SELECT graph_name || ''.'' || seq_name::text FROM ag_catalog.ag_label WHERE kind = kind_name and name = label_name and graph = (SELECT graphid FROM ag_catalog.ag_graph WHERE name = graph_name)'
language SQL IMMUTABLE STRICT PARALLEL SAFE;
Before importing vertices, it is better to create a unique constraint to prevent the generation of duplicate IDs. However, this step can also be omitted. This index is beneficial for subsequent queries, and there is no need to delete it after use.
CREATE UNIQUE INDEX ON toys.v_user(id);
CREATE UNIQUE INDEX ON toys.v_product(id);
Then we import the data into the graph, we can select the prefered columns in the xxx_raw
tables to form the properties of the vertices.
INSERT INTO toys."v_user"
SELECT (age_name_to_idx_start('toys', 'v', 'v_user') + _id)::text::graphid,
row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM v_user_raw;
SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw));
INSERT INTO toys."v_product"
SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid,
row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw;
SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));
We create the edges of the graph using the _fromid and _toid columns. We can choose to only include the columns of edges (e.g. buy_price), or we can also include the columns of the referenced vertexes.
INSERT INTO toys."e_own"
SELECT
(age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid,
row_to_json((select x FROM (select buy_price) x))::text::agtype
FROM e_own_raw;
We first list all the graph nodes using Cypher statement.
SELECT * FROM cypher('toys', $$
MATCH (v:v_user)
RETURN v
$$) as (v agtype);
v
----------------------------------------------------------------------------------------------------------------------------
{"id": 844424930131978, "label": "v_user", "properties": {"age": "33", "uid": "U1", "name": "Alice", "type": "A"}}::vertex
{"id": 844424930131979, "label": "v_user", "properties": {"age": "21", "uid": "U1", "name": "Bob", "type": "B"}}::vertex
(2 rows)
SELECT * FROM cypher('toys', $$
MATCH (v:v_product)
RETURN v
$$) as (v agtype);
v
-------------------------------------------------------------------------------------------------------------------
{"id": 1125899906842625, "label": "v_product", "properties": {"price": "50", "product_id": "INAKLIDAS"}}::vertex
{"id": 1125899906842626, "label": "v_product", "properties": {"price": "80", "product_id": "ENKUCLKSD"}}::vertex
{"id": 1125899906842627, "label": "v_product", "properties": {"price": "320", "product_id": "IIUIHAKLS"}}::vertex
{"id": 1125899906842628, "label": "v_product", "properties": {"price": "340", "product_id": "SDVDSUHEE"}}::vertex
(4 rows)
Then we perform a query using Cypher statement on the edges to check the discount of each bill.
SELECT buy, sell, sell::int - buy::int as discount FROM cypher('toys', $$
MATCH (v1:v_user)-[e:e_own]-(v2:v_product)
WHERE v1.name = 'Bob'
RETURN e.buy_price, v2.price
$$) as (buy agtype, sell agtype);
buy | sell | discount
-------+-------+----------
"70" | "80" | 10
"50" | "50" | 0
"330" | "340" | 10
(3 rows)
● After the import, the tables named as xxx_raw
and the functions named as xxx_idegen
can be dropped. It won't influence the graph.
● If you want to import more data using the same methods, you can retain those tables and functions. But you need to skip the records that have already imported into the graph.
○ To import a new vertex, refer those SQLs.
INSERT INTO v_user_raw VALUES
('A','U2', 'Alien', '99', v_user_idgen('A', 'U2'));
INSERT INTO toys.v_user
SELECT (age_name_to_idx_start('toys', 'v', 'v_user') + _id)::text::graphid,
row_to_json((select x FROM (select type, uid, name, age) x))::text::agtype FROM (SELECT (('A','U2', 'Alien', '97', v_user_idgen('A', 'U2'))::v_user_raw).*) raw;
SELECT setval(age_name_to_seq('toys', 'v', 'v_user'), (SELECT max(_id) + 1 FROM v_user_raw));
INSERT INTO v_product_raw(product_id, price) VALUES
('AIEEEEEEE', '999');
INSERT INTO toys."v_product"
SELECT (age_name_to_idx_start('toys', 'v', 'v_product') + _id)::text::graphid,
row_to_json((select x FROM (select product_id, price) x))::text::agtype FROM v_product_raw WHERE product_id = 'AIEEEEEEE';
SELECT setval(age_name_to_seq('toys', 'v', 'v_product'), (SELECT max(_id) + 1 FROM v_product_raw));
○ To import a new edge, refer to those SQLs. (This actually rebuild all the edges, as we do not know which edge is newly added.)
DELETE FROM e_own_raw;
INSERT INTO e_own_raw VALUES
('A', 'U2', 'AIEEEEEEE', '9999');
UPDATE e_own_raw SET _fromid = v_user_idgen(user_type, user_uid);
UPDATE e_own_raw SET _toid = v_product_raw._id
FROM v_product_raw
WHERE v_product_raw.product_id = e_own_raw.product_id;
INSERT INTO toys."e_own"
SELECT
(age_name_to_idx_start('toys', 'e', 'e_own') + nextval(age_name_to_seq('toys', 'e', 'e_own')))::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_user') + _fromid)::text::graphid,
(age_name_to_idx_start('toys', 'v', 'v_product') + _toid)::text::graphid,
row_to_json((select x FROM (select buy_price) x))::text::agtype
FROM e_own_raw;
● You cannot continue to use this import method if you have used the Cypher CREATE
statement, because that statement generates identifiers in a different method from us, and may break the consistency of the unique ids.
This article introduces a method to quickly import nodes and edges from outside a graph database into a graph within a PolarDB for PostgreSQL database via the help of normal tables. Compared to the original MATCH ... CREATE ...
approach, import performance is typically doubled.
Start to build your own graph application with the PolarDB for PostgreSQL free program.
Insurance Data Analysis Based on Graph Analysis Capabilities Provided by PolarDB
ApsaraDB - April 19, 2019
ApsaraDB - December 11, 2024
ApsaraDB - October 24, 2023
ApsaraDB - July 29, 2022
Morningking - September 27, 2023
Alibaba Clouder - July 30, 2019
Follow our step-by-step best practices guides to build your own business case.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreMore Posts by ApsaraDB