Generally, a content community website may need to record such data: articles, users, and tags.
The relation between the three should also be recorded, including that the tag belongs to the article, the user has read the article, the user has added the article to Favorites, the user has followed a user, and the user is the author of an article.
The ultimate goal is to achieve transparent inquiries. For example: What articles are those who have read this article also reading? Who may have similar interests as me?
However, in a community website, there are typically tens of millions of articles, and close to 10 million users.
In fact, this requirement can be easily implemented with arrays and smlar in PostgreSQL. We will start the design and stress testing below.
Arrays are used to store forward and reverse relations, tags, and so on.
Smlar is used to query similar arrays (find users with similar interests).
1. User table
create table users(
uid int primary key, -- user ID
info text, -- additional information
crt_time timestamp -- time
);
2. Tag table
create table tags(
tagid int primary key, -- tag ID
info text, -- additional information
crt_time timestamp -- time
);
3. Article table
create table arts(
artid int primary key, -- article ID
info text, -- additional information and content
uids int[], -- user IDs (the author may be multiple users, so an array is used)
tags int[] -- tag
);
1. Forward relation
1.1. Who has read the article?
create table art_uids_view (
artid int primary key,
uids int[]
);
1.2. Who has added the article to Favorites?
create table art_uids_like (
artid int primary key,
uids int[]
);
2. Reverse relation
2.1. What articles has the user read? And What tags are included in these articles?
create table uid_arts_view (
uid int primary key,
arts int[],
tags int[]
);
2.2. What articles has the user added to Favorites? And what tags are included in these articles?
create table uid_arts_like (
uid int primary key,
arts int[],
tags int[]
);
1. Other articles that other users who have read this article are reading (except the current article and articles I have read).
The logic is as follows, written as the UDF:
create extension intarray ; -- create the intarry plugin to simplify the addition and subtraction of arrays.
select (uids - $current_uid) into v1 from art_uids_view where artid = $current_artid ; -- retrieve all users who have read the current article (except the current user)
select (unnest(arts) as arts, count(*) as cnt) into v2 from uid_arts_view where uid = any (v1) group by 1 ; -- obtain other articles that the users, who have read the same article, have also read
select arts into v3 from uid_arts_view where uid= $current_uid ; -- articles that the current user has read
result = v2.arts - v3 ; -- in all articles read by other users, the articles read by the current user are removed, and the articles read by other users are obtained. These articles are sorted by number of overlaps from large to small and recommended to the user
The UDFs are as follows. Indexes can be used for all, and point queries are performed after aggregation. The performance is very good:
create or replace function rec_arts_view(
i1 int, -- article ID
i2 int, -- current user ID
i3 int -- limit
) returns setof int as
$$
declare
res int[]; -- result
v1 int[]; -- users who have read the article
begin
-- users who have read the article
select (uids - i2) into v1 from art_uids_view where artid = i1 ;
-- From the articles that other users, who have read this article, have also read, exclude the articles that the current user has read, then sort them by repetition rate, and finally return N articles.
-- If other users who have read the article have read many other articles, sorting may take a little time.
return query
select t1.arts from
(
select unnest(arts) arts, count(*) cnt from uid_arts_view where uid = any (v1) group by 1
) t1
left join
(
select unnest(arts) arts, 1 cnt from uid_arts_view where uid= i2
) t2
on (t1.arts=t2.arts)
where t2.* is null
order by t1.cnt desc
limit i3;
end;
$$
language plpgsql strict;
2. To determine which users who share my interests in terms of read articles, the GIN index performs very well.
create extension smlar;
set smlar.type='overlap';
set smlar.threshold=? ; -- set the overlap threshold
select arts into v1 from uid_arts_view where uid = ? ; -- articles I have read
select
*,
smlar( arts, v1, 'N.i' ) -- number of overlaps between articles that other users have read and articles that I have read
from
uid_arts_view
where
arts % v1 -- where cosine similarity >= smlar.threshold
;
3. Users who share my interests in terms of the tags in read articles
The specific content is similar to that of 2, which is omitted here.
4. Users who share my interests in terms of favorited articles
The specific content is similar to that of 2, which is omitted here.
5. Users who share my interests in terms of the tags in favorited articles
The specific content is similar to that of 2, which is omitted here.
Use UDFs to reduce interactions and complete the following types of business logic operations. UDFs can be compiled using plpgsql, which is very simple. The specific content is introduced in this article: https://www.postgresql.org/docs/10/static/plpgsql.html
1. When new articles are created, tags are automatically generated and the tag table is updated or appended.
insert into tags values ();
insert into arts values ();
2. When reading the article, the forward-inverse relation is modified.
The tags information of the article is obtained from arts
insert into art_uids_view values ();
insert into uid_arts_view values ();
3. When favoriting the article, the forward-inverse relation is modified.
The tags information of the article is obtained from arts
insert into art_uids_like values ();
insert into uid_arts_like values ();
-- smlar similarity query
create index idx_gin_1 on art_uids_view using gin ( uids _int4_sml_ops );
create index idx_gin_2 on art_uids_like using gin ( uids _int4_sml_ops );
create index idx_gin_3 on uid_arts_view using gin ( arts _int4_sml_ops );
create index idx_gin_4 on uid_arts_view using gin ( tags _int4_sml_ops );
create index idx_gin_5 on uid_arts_like using gin ( arts _int4_sml_ops );
create index idx_gin_6 on uid_arts_like using gin ( tags _int4_sml_ops );
create index idx_gin_7 on art_uids_view using gin ( uids _int4_sml_ops );
create index idx_gin_8 on art_uids_like using gin ( uids _int4_sml_ops );
Optional index
-- array intersection, and inclusion query
create index idx_gin_01 on art_uids_view using gin ( uids gin__int_ops );
create index idx_gin_02 on art_uids_like using gin ( uids gin__int_ops );
create index idx_gin_03 on uid_arts_view using gin ( arts gin__int_ops );
create index idx_gin_04 on uid_arts_view using gin ( tags gin__int_ops );
create index idx_gin_05 on uid_arts_like using gin ( arts gin__int_ops );
create index idx_gin_06 on uid_arts_like using gin ( tags gin__int_ops );
create index idx_gin_07 on art_uids_view using gin ( uids gin__int_ops );
create index idx_gin_08 on art_uids_like using gin ( uids gin__int_ops );
1. Generate 10 million users
insert into users select id, md5(id::text), now() from generate_series(1,10000000) t(id);
2. Generate 100,000 tags
insert into tags select id, md5(id::text), now() from generate_series(1,100000) t(id);
3. Generate 50 million articles
create or replace function gen_arr(int,int) returns int[] as
$$
select array(select ceil(random()*$1) from generate_series(1,$2))::int[];
$$
language sql strict;
insert into arts select id, md5(id::text),
gen_arr(10000000 ,3),
gen_arr(100000 ,10)
from generate_series(1,50000000) t(id);
4. Generate a forward relation, with each article being read by 500 users and favorited by 50 users on average.
insert into art_uids_view select id, gen_arr(10000000, 500) from generate_series(1,50000000) t(id);
insert into art_uids_like select id, gen_arr(10000000, 50) from generate_series(1,50000000) t(id);
5. Generate a reverse relation (in theory, the reverse relation and the forward relation should correspond one-to-one. For the convenience of testing, I will not perform this operation here. The test result is the same.)
Each user reads 1,000 articles on average, involving 500 tags. 100 articles are added to Favorites, involving 50 tags.
insert into uid_arts_view select id, gen_arr(50000000, 1000), gen_arr(100000, 500) from generate_series(1,10000000) t(id);
insert into uid_arts_like select id, gen_arr(50000000, 100), gen_arr(100000, 50) from generate_series(1,10000000) t(id);
1. Other articles that other users who have read this article are reading (except the current article and articles I have read).
select rec_arts_view(1,2,10); -- the article ID is 1, the current user ID is 2, and 10 recommended articles are returned to the current user.
Other users have read about 500,000 other articles. It takes 200 milliseconds to obtain and sort this result.
postgres=# select count(*) from rec_arts_view(1,4,1000000);
count
--------
497524
(1 row)
Time: 565.524 ms
postgres=# select count(*) from rec_arts_view(1,4,10);
count
-------
10
(1 row)
Time: 198.368 ms
2. Users who share my interests in terms of read articles
set smlar.type='overlap';
set smlar.threshold=10; -- set the overlap threshold
select arts into v1 from uid_arts_view where uid = 1; -- articles I have read
select
*,
smlar( arts, v1, 'N.i' ) -- number of overlaps between articles that other users have read and articles that I have read
from
uid_arts_view
where
arts % v1 -- where cosine similarity >= smlar.threshold
;
Time: 2.4 milliseconds.
Bitmap Heap Scan on public.uid_arts_view (cost=933.50.. 29296.93 rows=10000 width=72) (actual time=1.955.. 2.351 rows=2 loops=1)
Output: uid, arts, tags, smlar(arts, '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[], 'N.i'::text)
Recheck Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])
Heap Blocks: exact=2
Buffers: shared hit=107
-> Bitmap Index Scan on idx_gin_3 (cost=0.00.. 931.00 rows=10000 width=0) (actual time=1.506.. 1.506 rows=2 loops=1)
Index Cond: (uid_arts_view.arts % '{25213311,49886221,45108456,27929256,20760231,35023889,17507871,43947072,48578113,41561690,39557908,26852171,29310889,5029778,24892381,12174141,9191797,41397570,25795440,27806324,28635176}'::integer[])
Buffers: shared hit=85
Planning time: 0.110 ms
Execution time: 2.378 ms
(10 rows)
The above contents (recommending articles and finding users with similar interests) refer to the performance of real-time query, but in fact, these operations can be pre-computed (because the increment of articles is not too large, and the users reading these articles do not change too much). For example, articles are refreshed once a day, so when users with similar interests, and similar articles are recommended to the user, if pre-computing is performed, the results only need to directly queried, and the performance is improved to 0.0N millisecond-level response. For new articles that have not been pre-computed, real-time queries (and a corresponding update to the pre-computing table) can be conducted, which can also respond in milliseconds.
Pre-computing can also operate in another mode. When someone inquires about this article, you can decide whether to re-inquire and update the table according to the last pre-computed time. (That is, the mode of real-time computing + cache + cache timeout.)
The logic is as follows:
select xxx from pre_view_tbl where xxx=xxx; -- query the cache, and return
-- write or update the cache
if not found then
-- write synchronously
insert into pre_view_tbl select xxxx returning *; -- compute in real time, and return
else if mod_time < (now() - timeout threshold) then
-- asynchronously
delete from pre_view_tbl where xxx=xxx;
insert into pre_view_tbl select xxxx; -- compute in real time
end if;
Thirty percent development and seventy percent operation. Content websites are similar to social software, and the operation is the priority. The key link in the operation is a "circle", which can gather popularity. The formation of a circle often depends on recommendations, and the source of recommendations is behaviors. The content and users to recommend to the target depend on behaviors. This is the principle that birds of a feather flock together.
With PostgreSQL arrays and smlar, it is very easy to implement efficient classification query and recommendations.
1. Arrays are used to store forward and reverse relations, tags, and so on.
2. Smlar is used to query similar arrays (to find users with similar interests).
It is very convenient and efficient in social operation and content operation scenarios.
They can also be used to easily recommend hot users and hot articles, which has been tested in other cases. See the end of this article for details.
https://www.postgresql.org/docs/10/static/plpgsql.html
PostgreSQL Graph Search Practices - 10 Billion-Scale Graph with Millisecond Response
PostgreSQL Practices of Elective Selection Relation in Colleges
digoal - July 4, 2019
digoal - July 4, 2019
digoal - July 4, 2019
ApsaraDB - December 11, 2024
ApsaraDB - December 11, 2024
digoal - April 22, 2021
An on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreA financial-grade distributed relational database that features high stability, high scalability, and high performance.
Learn MoreMore Posts by digoal