User Preference Recommendation System - PostgreSQL Approximate Computing Application

The article explains how to use the PostgreSQL HLL plug-in to build a user recommendation system, along with the differences between different design techniques.

By digoal

Background

A recommendation system is a general requirement to improve user stickiness and conversion rate in internet applications.

For example:

eCommerce websites recommend discounted products according to user preferences.
Music websites recommend songs according to users' listening habits.
News websites recommend favorite content according to users' browsing habits.
Appstore websites recommend applications based on users' habits of downloading and using applications.

In this article, a music website is used as an example to describe the design of a recommendation system database.

Design Background

1) Songs have corresponding tags. One song can have multiple tags. Songs that the user has listened to or finished listening to form a one-to-many mapping relationship.

uid ->> tags ->> musics

2) According to the ranking result of the song quantity under each tag, tag hotness can be obtained.

tag(count distinct music)

3) The first five tags and their weights are as follows:

tag1:40%
tag2:20%
tag3:15%
tag4:15%
tag5:10%

4) Exclude songs that the user has listened to from the song library with these tags. Further, recommend new songs to users in proportion based on the recommendation weights of these songs. For example, the number of times that songs are played in reverse order.

Conventional Design

It applies to all types of databases.

create table t_like(     
uid int,  -- 用户id    
tagid int,  -- 歌曲标签id    
vid int,   -- 歌曲id    
mod_time timestamp,  -- 最后一次更新时间, 仅与上次时间超过1天时更新    
primary key (uid,tagid,vid)     
);    
    
insert into t_like values (:uid, :tagid, :vid, :mod_time)     
 on conflict (uid,tagid,vid) do update    
set mod_time=excluded.mod_time    
where    
excluded.mod_time - t_like.mod_time > interval '1 day'    
;    
    
-- 根据tag里面歌曲id的歌手, 统计最近1天的top 10的tag    
select tagid, count(*) from t_like     
where uid=:uid     
and now()-mod_time < interval '1 day'  
group by tagid     
order by count(*) desc limit 10;

Perform Stress testing.

vi test.sql  
\set uid random(1,50000)    
\set tagid random(1,5000)    
\set vid random(1,10000000)    
insert into t_like values (:uid, :tagid, :vid, now())     
 on conflict (uid,tagid,vid) do update    
set mod_time=excluded.mod_time    
where    
excluded.mod_time - t_like.mod_time > interval '1 day';    
  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 240  
  
transaction type: ./test.sql  
scaling factor: 1  
query mode: prepared  
number of clients: 32  
number of threads: 32  
duration: 240 s  
number of transactions actually processed: 80975327  
latency average = 0.095 ms  
latency stddev = 0.340 ms  
tps = 337396.279382 (including connections establishing)  
tps = 337406.018908 (excluding connections establishing)  
statement latencies in milliseconds:  
         0.000  \set uid random(1,50000)    
         0.000  \set tagid random(1,5000)    
         0.000  \set vid random(1,10000000)    
         0.094  insert into t_like values (:uid, :tagid, :vid, now())    
  
db1=# select tagid, count(*) from t_like     
where uid=1        
and now()-mod_time < interval '1 day'  
group by tagid     
order by count(*) desc limit 10;    
 tagid | count   
-------+-------  
  2519 |     4  
  3049 |     4  
  3648 |     4  
  1777 |     3  
  1352 |     3  
  1491 |     3  
  1064 |     3  
   572 |     3  
   692 |     3  
   301 |     3  
(10 rows)  
  
Time: 3.947 ms

Disadvantages of conventional database design

In the case of a large amount of data, the conventional design relies on aggregated queries, resulting in low efficiency.

Design of Approximate Computing based on HLL

Use HLL to store the VID listened by UID in the form of (tag(hll), tag n(hll)). This design has significant advantages over a conventional design, such as:

It stores a small amount of data as the real value replaces the approximate HLL hash aggregation that can represent many rows.
High query efficiency with support for indexing and millisecond-level response. No computing is needed.
Operations such as hash union and add are supported. Applicable for sliding window computing to meet more diversified business requirements.

This design uses the PostgreSQL HLL plug-in. For more details related to the HLL plug-in, please refer to this page.

1) Each tag stores an HLL, which stores the VID hash value of songs that the users have listened to.

create table t_like (    
uid int,     
tagid int, -- 标签    
w1 hll, w1_mod_time timestamp, -- 周一听完的歌曲对应的vid 构成的hash, 周一    
w2 hll, w2_mod_time timestamp, -- 周二 ...    
w3 hll, w3_mod_time timestamp,     
w4 hll, w4_mod_time timestamp,     
w5 hll, w5_mod_time timestamp,     
w6 hll, w6_mod_time timestamp,     
w7 hll, w7_mod_time timestamp,     
whole hll,                   -- 所有    
primary key (uid,tagid)    
);

If you're only concerned about the results of one day, then there is no need for so many fields. Fields, from w1 to w7, can vary as per your business requirements.

2) After a user finishes listening to a track, the recommendation system writes that song's information to the field corresponding to the current time. If the corresponding field already has a value, and the last modified time is not today, then overwrite. Otherwise, append the hash. You can implement the preceding logic by using the insert into conflict syntax.

-- 设置观看历史行为 hash    
insert into t_like (    
uid,    
tagid,    
w5,     
w5_mod_time,    
whole    
)    
values (    
1,  -- uid    
200,  -- 标签id    
hll_hash_integer(12346)||hll_empty(),  -- 观看过的vid, 多个则继续||    
now(),     
hll_hash_integer(12346)||hll_empty()   -- 观看过的vid    
)    
on conflict (uid,tagid)     
do update    
set w5=    
case     
when date(t_like.w5_mod_time) <> current_date     
then excluded.w5     
else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5)    
end,    
w5_mod_time = excluded.w5_mod_time,    
whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole)    
where    
hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty())    
or    
hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty())    
;

Note: It is also possible to merge updates in batches or perform aggregate updates for the single tag of an individual user. To make it happen, perform HLL UNION operation while reducing the update rate.

3) Query the top 10 tags of UID 1 over the last two days.

Example:

select tagid,     
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids     
from t_like    
where uid = 1    
order by 2 desc limit 10;    
        
    
 tagid | vids     
-------+------    
   200 |    2    

(1 row)

4) Support for indexes:

create index idx_t_like_1 on t_like (uid, hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ));

5) Scan indexes:

postgres=# explain select tagid,     
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids    
from t_like    
where uid = 1    
order by 2 desc limit 10;    
                                        QUERY PLAN                                             
-------------------------------------------------------------------------------------------    
 Limit  (cost=0.11..0.15 rows=1 width=12)    
   ->  Index Scan Backward using idx_t_like_1 on t_like  (cost=0.11..0.15 rows=1 width=12)    
         Index Cond: (uid = 1)    
(3 rows)

6) Write tens of millions of data records and perform stress testing to evaluate performance.

vi test.sql    
\set uid random(1,50000)    
\set tagid random(1,5000)    
\set vid random(1,10000000)    
insert into t_like (    
uid,    
tagid,    
w5,     
w5_mod_time,    
whole    
)    
values (    
:uid,    
:tagid,    
hll_hash_integer(:vid)||hll_empty(),    
now(),    
hll_hash_integer(:vid)||hll_empty()    
)    
on conflict (uid,tagid)     
do update    
set w5=    
case     
when date(t_like.w5_mod_time) <> current_date     
then excluded.w5     
else hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5)    
end,    
w5_mod_time = excluded.w5_mod_time,    
whole = hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole)    
where    
hll_union(coalesce(t_like.w5,hll_empty()), excluded.w5) <> coalesce(t_like.w5,hll_empty())    
or    
hll_union(coalesce(t_like.whole,hll_empty()), excluded.whole) <> coalesce(t_like.whole,hll_empty())    
;    
        
pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 120 -f ./test.sql    

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 32    
number of threads: 32    
duration: 120 s    
number of transactions actually processed: 24636321    
latency average = 0.156 ms    
latency stddev = 0.339 ms    
tps = 205301.110313 (including connections establishing)    
tps = 205354.851711 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set uid random(1,5000000)    
         0.001  \set tagid random(1,5000)    
         0.000  \set vid random(1,10000000)    
         0.154  insert into t_like (

7) Execute for multiple rounds.

transaction type: ./test.sql    
scaling factor: 1    
query mode: prepared    
number of clients: 32    
number of threads: 32    
duration: 120 s    
number of transactions actually processed: 23988181    
latency average = 0.160 ms    
latency stddev = 0.335 ms    
tps = 199900.214256 (including connections establishing)    
tps = 199956.049571 (excluding connections establishing)    
statement latencies in milliseconds:    
         0.001  \set uid random(1,50000)    
         0.000  \set tagid random(1,5000)    
         0.000  \set vid random(1,10000000)    
         0.158  insert into t_like (

8) The current number of records is 47.47 million.

 postgres=# select count(*) from t_like ;    
  count       
----------    
 47473788    
(1 row)

9) Query the tag ranking result of a UID. The response time is 0.688 milliseconds.

postgres=# select tagid,     
hll_cardinality( hll_union(coalesce(w4,hll_empty()), coalesce(w5,hll_empty())) ) as vids    
from t_like    
where uid = 1    
order by 2 desc limit 10;    
 tagid | vids     
-------+------    
   200 |    2    
  1413 |    1    
  1996 |    1    
  2642 |    1    
  3664 |    1    
  4340 |    1    
(6 rows)    
    
Time: 0.688 ms

Other requirements: Determine whether a VID is in this hash, requiring no precise operations. This filters songs that the user has already listened to.

select whole || hll_hash_integer(:vid) = whole     
from     
t_like     
where uid=:uid and tagid=:tagid;

For example:

postgres=# select whole || hll_hash_integer(1) = whole        
from     
t_like     
where uid=1 and tagid=200;  -- 返回false表示不包含vid:1    
 ?column?     
----------    
 f    
(1 row)    
    
postgres=# select whole || hll_hash_integer(12345) = whole     
from     
t_like     
where uid=1 and tagid=200;   -- 返回true表示包含vid:12345    
 ?column?     
----------    
 t    
(1 row)

Separately store a copy of accurate values if you need precision.

create table t_like_lossless (    
uid int,    
vid int,    
primary key (uid,vid)    
);

This is a query for the primary key, and its speed is also very fast.

Alibaba Cloud will soon support the HLL plug-in. You are welcome to try it out. Now you can spend only 9.9 yuan to buy PostgreSQL for trial. Refer to the page here.

Summary

PostgreSQL HLL helps implement approximate hash computing. Millisecond-level recommendation queries are achieved based on hundreds of millions of relational data records. In contrast, this saves storage space and decreases the response time from 3.947 milliseconds to 0.688 milliseconds.