PostgreSQL Practices - Application of Graph Search in Content Communities

Background

Generally, a content community website may need to record such data: articles, users, and tags.

The relation between the three should also be recorded, including that the tag belongs to the article, the user has read the article, the user has added the article to Favorites, the user has followed a user, and the user is the author of an article.

The ultimate goal is to achieve transparent inquiries. For example: What articles are those who have read this article also reading? Who may have similar interests as me?

However, in a community website, there are typically tens of millions of articles, and close to 10 million users.

How Can This Requirement Be Implemented?

In fact, this requirement can be easily implemented with arrays and smlar in PostgreSQL. We will start the design and stress testing below.

Arrays are used to store forward and reverse relations, tags, and so on.

Smlar is used to query similar arrays (find users with similar interests).

Design

Metadata

1. User table

create table users(  
  uid int primary key,             -- user ID  
  info text,           -- additional information  
  crt_time timestamp   -- time  
);

2. Tag table

create table tags(  
  tagid int primary key,              -- tag ID  
  info text,              -- additional information   
  crt_time timestamp      -- time  
);

3. Article table

create table arts(  
  artid int primary key,      -- article ID  
  info text,      -- additional information and content  
  uids int[],     -- user IDs (the author may be multiple users, so an array is used)  
  tags int[]      -- tag  
);

Relational Data

1. Forward relation

1.1. Who has read the article?

create table art_uids_view (  
  artid int primary key,  
  uids int[]  
);

1.2. Who has added the article to Favorites?

create table art_uids_like (  
  artid int primary key,  
  uids int[]  
);

2. Reverse relation

2.1. What articles has the user read? And What tags are included in these articles?

create table uid_arts_view (  
  uid int primary key,  
  arts int[],  
  tags int[]  
);

2.2. What articles has the user added to Favorites? And what tags are included in these articles?

create table uid_arts_like (  
  uid int primary key,  
  arts int[],  
  tags int[]  
);

Query

1. Other articles that other users who have read this article are reading (except the current article and articles I have read).

The logic is as follows, written as the UDF:

create extension intarray ;  -- create the intarry plugin to simplify the addition and subtraction of arrays.  
  
select (uids - $current_uid) into v1 from art_uids_view where artid = $current_artid ;  -- retrieve all users who have read the current article (except the current user)  
  
select (unnest(arts) as arts, count(*) as cnt) into v2 from uid_arts_view where uid = any (v1) group by 1 ;  -- obtain other articles that the users, who have read the same article, have also read  
  
select arts into v3 from uid_arts_view where uid= $current_uid ;          -- articles that the current user has read  
  
result = v2.arts - v3 ;   -- in all articles read by other users, the articles read by the current user are removed, and the articles read by other users are obtained.  These articles are sorted by number of overlaps from large to small and recommended to the user

The UDFs are as follows. Indexes can be used for all, and point queries are performed after aggregation. The performance is very good:

create or replace function rec_arts_view(  
  i1 int,  -- article ID  
  i2 int,  -- current user ID  
  i3 int   -- limit  
) returns setof int as 
$$
  
declare  
  res int[];  -- result  
  v1 int[];   --  users who have read the article  
begin  
    
  -- users who have read the article  
  select (uids - i2) into v1 from art_uids_view where artid = i1 ;  
    
  -- From the articles that other users, who have read this article, have also read, exclude the articles that the current user has read, then sort them by repetition rate, and finally return N articles.  
  -- If other users who have read the article have read many other articles, sorting may take a little time.
  return query  
  select t1.arts from   
  (  
    select unnest(arts) arts, count(*) cnt from uid_arts_view where uid = any (v1) group by 1  
  ) t1  
  left join  
  (  
    select unnest(arts) arts, 1 cnt from uid_arts_view where uid= i2   
  ) t2  
  on (t1.arts=t2.arts)   
  where t2.* is null  
  order by t1.cnt desc  
  limit i3;  
  
end;  

$$
 language plpgsql strict;

2. To determine which users who share my interests in terms of read articles, the GIN index performs very well.

create extension smlar;  
  
set smlar.type='overlap';        
set smlar.threshold=? ;             -- set the overlap threshold  
    
select arts into v1 from uid_arts_view where uid = ? ; -- articles I have read    
  
select      
    *,      
    smlar( arts, v1, 'N.i' )       -- number of overlaps between articles that other users have read and articles that I have read  
  from      
    uid_arts_view      
  where      
    arts % v1                      -- where cosine similarity >= smlar.threshold      
;

3. Users who share my interests in terms of the tags in read articles

The specific content is similar to that of 2, which is omitted here.

4. Users who share my interests in terms of favorited articles

The specific content is similar to that of 2, which is omitted here.

5. Users who share my interests in terms of the tags in favorited articles

The specific content is similar to that of 2, which is omitted here.

UDFs for Generating Forward and Reverse Relations

Use UDFs to reduce interactions and complete the following types of business logic operations. UDFs can be compiled using plpgsql, which is very simple. The specific content is introduced in this article: https://www.postgresql.org/docs/10/static/plpgsql.html

1. When new articles are created, tags are automatically generated and the tag table is updated or appended.

insert into tags values ();  
  
insert into arts values ();

2. When reading the article, the forward-inverse relation is modified.

The tags information of the article is obtained from arts

insert into art_uids_view values ();  
  
insert into uid_arts_view values ();

3. When favoriting the article, the forward-inverse relation is modified.

The tags information of the article is obtained from arts

insert into art_uids_like values ();  
  
insert into uid_arts_like values ();

Add an Index

-- smlar similarity query  
create index idx_gin_1 on art_uids_view using gin ( uids _int4_sml_ops );    
create index idx_gin_2 on art_uids_like using gin ( uids _int4_sml_ops );    
  
create index idx_gin_3 on uid_arts_view using gin ( arts _int4_sml_ops );    
create index idx_gin_4 on uid_arts_view using gin ( tags _int4_sml_ops );    
  
create index idx_gin_5 on uid_arts_like using gin ( arts _int4_sml_ops );    
create index idx_gin_6 on uid_arts_like using gin ( tags _int4_sml_ops );    
  
create index idx_gin_7 on art_uids_view using gin ( uids _int4_sml_ops );    
create index idx_gin_8 on art_uids_like using gin ( uids _int4_sml_ops );

Optional index

-- array intersection, and inclusion query  
create index idx_gin_01 on art_uids_view using gin ( uids gin__int_ops );    
create index idx_gin_02 on art_uids_like using gin ( uids gin__int_ops );    
  
create index idx_gin_03 on uid_arts_view using gin ( arts gin__int_ops );    
create index idx_gin_04 on uid_arts_view using gin ( tags gin__int_ops );    
  
create index idx_gin_05 on uid_arts_like using gin ( arts gin__int_ops );    
create index idx_gin_06 on uid_arts_like using gin ( tags gin__int_ops );    
  
create index idx_gin_07 on art_uids_view using gin ( uids gin__int_ops );    
create index idx_gin_08 on art_uids_like using gin ( uids gin__int_ops );

Fill in Test Data

1. Generate 10 million users

insert into users select id, md5(id::text), now() from generate_series(1,10000000) t(id);

2. Generate 100,000 tags

insert into tags select id, md5(id::text), now() from generate_series(1,100000) t(id);

3. Generate 50 million articles

create or replace function gen_arr(int,int) returns int[] as 
$$
  
  select array(select ceil(random()*$1) from generate_series(1,$2))::int[];  

$$
 language sql strict;  
insert into arts select id, md5(id::text),   
  gen_arr(10000000 ,3),    
  gen_arr(100000 ,10)   
from generate_series(1,50000000) t(id);

4. Generate a forward relation, with each article being read by 500 users and favorited by 50 users on average.

insert into art_uids_view select id, gen_arr(10000000, 500) from generate_series(1,50000000) t(id);  
  
insert into art_uids_like select id, gen_arr(10000000, 50) from generate_series(1,50000000) t(id);

5. Generate a reverse relation (in theory, the reverse relation and the forward relation should correspond one-to-one. For the convenience of testing, I will not perform this operation here. The test result is the same.)

Each user reads 1,000 articles on average, involving 500 tags. 100 articles are added to Favorites, involving 50 tags.

insert into uid_arts_view select id, gen_arr(50000000, 1000), gen_arr(100000, 500) from generate_series(1,10000000) t(id);  
  
insert into uid_arts_like select id, gen_arr(50000000, 100), gen_arr(100000, 50) from generate_series(1,10000000) t(id);

Performance Tests

1. Other articles that other users who have read this article are reading (except the current article and articles I have read).

select rec_arts_view(1,2,10);    -- the article ID is 1, the current user ID is 2, and 10 recommended articles are returned to the current user.

Other users have read about 500,000 other articles. It takes 200 milliseconds to obtain and sort this result.

postgres=# select count(*) from rec_arts_view(1,4,1000000);
 count  
--------
 497524
(1 row)

Time: 565.524 ms

postgres=# select count(*) from rec_arts_view(1,4,10);
 count 
-------
    10
(1 row)

Time: 198.368 ms

2. Users who share my interests in terms of read articles

set smlar.type='overlap';        
set smlar.threshold=10;             -- set the overlap threshold  
    
select arts into v1 from uid_arts_view where uid = 1;       -- articles I have read  
  
select      
    *,      
    smlar( arts, v1, 'N.i' )       -- number of overlaps between articles that other users have read and articles that I have read  
  from      
    uid_arts_view      
  where      
    arts % v1                      -- where cosine similarity >= smlar.threshold      
;

Time: 2.4 milliseconds.

 Bitmap Heap Scan on public.uid_arts_view  (cost=933.50.. 29296.93 rows=10000 width=72) (actual time=1.955.. 2.351 rows=2 loops=1)
   Output: uid, arts, tags, smlar(arts, '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[], 'N.i'::text)
   Recheck Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])
   Heap Blocks: exact=2
   Buffers: shared hit=107
   ->  Bitmap Index Scan on idx_gin_3  (cost=0.00.. 931.00 rows=10000 width=0) (actual time=1.506.. 1.506 rows=2 loops=1)
         Index Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])
         Buffers: shared hit=85
 Planning time: 0.110 ms
 Execution time: 2.378 ms
(10 rows)

Pre-Computing and Performance Optimization

The above contents (recommending articles and finding users with similar interests) refer to the performance of real-time query, but in fact, these operations can be pre-computed (because the increment of articles is not too large, and the users reading these articles do not change too much). For example, articles are refreshed once a day, so when users with similar interests, and similar articles are recommended to the user, if pre-computing is performed, the results only need to directly queried, and the performance is improved to 0.0N millisecond-level response. For new articles that have not been pre-computed, real-time queries (and a corresponding update to the pre-computing table) can be conducted, which can also respond in milliseconds.

Pre-computing can also operate in another mode. When someone inquires about this article, you can decide whether to re-inquire and update the table according to the last pre-computed time. (That is, the mode of real-time computing + cache + cache timeout.)

The logic is as follows:

select xxx from pre_view_tbl where xxx=xxx;  -- query the cache, and return

-- write or update the cache
if not found then
  -- write synchronously
  insert into pre_view_tbl select xxxx returning *; -- compute in real time, and return
else if mod_time < (now() - timeout threshold) then
  -- asynchronously
  delete from pre_view_tbl where xxx=xxx;
  insert into pre_view_tbl select xxxx; -- compute in real time
end if;

Summary

Thirty percent development and seventy percent operation. Content websites are similar to social software, and the operation is the priority. The key link in the operation is a "circle", which can gather popularity. The formation of a circle often depends on recommendations, and the source of recommendations is behaviors. The content and users to recommend to the target depend on behaviors. This is the principle that birds of a feather flock together.

With PostgreSQL arrays and smlar, it is very easy to implement efficient classification query and recommendations.

1. Arrays are used to store forward and reverse relations, tags, and so on.

2. Smlar is used to query similar arrays (to find users with similar interests).

It is very convenient and efficient in social operation and content operation scenarios.

They can also be used to easily recommend hot users and hot articles, which has been tested in other cases. See the end of this article for details.

References

https://www.postgresql.org/docs/10/static/plpgsql.html

https://www.postgresql.org/docs/10/static/intarray.html

https://github.com/bitnine-oss/agensgraph

Community

PostgreSQL Practices - Application of Graph Search in Content Communities

Background

How Can This Requirement Be Implemented?

Design

Metadata

Relational Data

Query

UDFs for Generating Forward and Reverse Relations

Add an Index

Fill in Test Data

Performance Tests

Pre-Computing and Performance Optimization

Summary

References

Read previous post:

Read next post:

digoal

You may also like

Comments

digoal

Related Products

ApsaraDB RDS for PostgreSQL

PolarDB for PostgreSQL

ApsaraDB RDS for MySQL

ApsaraDB for OceanBase