By Digoal
User profiles are already widely applied to application reconstruction in marketing. A popular solution is tagging users and discovering target users based on tag combinations.
A profile system may use wide tables and distributed systems. The role of wide tables is to store tags, and each column represents a tag. In fact, this design is not necessarily the best or only one. This document will introduce and discuss the efficiency of an alternative design concept based on PostgreSQL.
Let's assume that there is a 2B real-time user recommendation system, and each APPID represents one business. Service data is comprised of the APPID, USERIDs, and TAGs. (2B user ID, end user IDs, and tags). The service does not need any data exchange operations across APPIDs, so it merely provides user recommendations within the APPID.
Queries are restricted to within a selected APPID and TAG combination, and USERIDs that meet search criteria are found and pushed to the user. There is a total of about 1 billion data points, and a single APPID has about 100 million users at most. The total amount of TAGs is designed to be 10,000.
Query requirements:
I'll introduce four possible solutions and discuss the advantages and disadvantages of each.
The width of a table is usually restricted. Taking PostgreSQL as an example, one record is not allowed to span pages (a variable-length field is stored in TOAST storage to store a column wider than a page, and only points are stored in the page), which restricts the table width.
For example, an 8 KB data block may be stored in nearly 2,000 columns.
If a column is designed for each TAG, a wide table with 10,000 columns is needed.
Similar restrictions seem to exist in other databases. A wide table with 10,000 columns will never meet the requirements unless the database kernel is transformed.
APPID+USERID can be used as the PK and stored into multiple tables so that an unlimited number of TAGs can be stored. For example, a single table with 1,000 columns can store 10,000 TAGs using only 10 tables.
create table t_tags_1(appid int, userid int8, tag1 boolean, tag2 boolean, ...... tag1000 boolean);
.....
create table t_tags_10(appid int, userid int8, tag9001 boolean, tag9002 boolean, ...... tag10000 boolean);
To improve efficiency, indexes should be created for each tag field, resulting in 10,000 indexes.
If the TAG combination spans multiple tables, it will need a JOIN operation.
Advantage
No special optimization is required, and almost all databases support this solution.
Disadvantage
Performance may suffer, especially when a query is based on a combination of multiple conditions, such as (tag1 and tag2 and (tag4 or tag5) or not tag6).
An array is used to replace TAG columns. This requires the database to support arrays and possess high-efficiency retrieval capability. PostgreSQL fits this requirement perfectly.
Data structure
APPID, USERID, TAG[] array
A single array has a maximum length of 1 GB (supporting nearly 260 million TAGs)
Perform partitioning according to APPIDs, perform sharding randomly
Query syntax
array1 @> array2
array1 && array2
not array1 && array2
Example
create table t_arr(appid int, userid int8, tags int2[]) with(parallel_workers=128);
create index idx_t_array_tags on t_arr using gin (tags) with (fastupdate=on, gin_pending_list_limit= 1024000000);
create index idx_t_arr_uid on t_arr(userid);
819,200 KB may cache about 10,000 pieces of 80 KB array records. This can be adjusted. Each USERID contains 10,000 TAGs (limit).
insert into t_arr select 1, 2000000000*random(),(select array_agg(10000*random()) from generate_series(1,10000));
nohup pgbench -M prepared -n -r -f ./test.sql -P 1 -c 50 -j 50 -t 2000000 > ./arr.log 2>&1 &
Advantages
This solution is capable of storing more than enough TAGs, into the hundreds of millions (even 10,000 TAGs are considered a lot in the industry).
It supports index search of arrays but does not support indexes.
Disadvantages
The data size is still large because a record may include 10,000 TAGs of about 80 KB.
100 million records is about 8 TB, and indexes need about another 8 TB.
Not all databases support arrays.
BITs are used to store TAGs, where 0 and 1 indicate whether or not there is a TAG respectively.
Data structure
APPID, USERID, TAG bit stream
A single BIT field supports a BIT stream with a maximum length of 1 GB (supporting 8.5 billion TAGs)
Each BIT represents a TAG
Perform partitioning according to APPIDs, perform sharding randomly
Query syntax
bitand(bit1,bit2) = bit2
bitand(bit1,bit2) > 0
bitand(bit1,bit2) = zerobit(10000)
Example
create table t_bit(appid int, userid int8, tags varbit) ;
create index idx_t_bit_uid on t_bit(userid);
Each USERID corresponds to 10,000 random bit values
date;for ((i=1;i<=50;i++)); do psql -c "insert into t_bit select 1, 2000000000*random(), \
(select (string_agg(mod((2*random())::int,2)::text,''))::varbit from generate_series(1,10000)) tags \
from generate_series(1,2000000)" ; done; date
127 GB, 245,000 inserts per second, 326 MB/s
Insert speed
(Batch) 245,000 inserts per second, 326 MB/s
TAG update and delete speeds
create or replace function randbit(int) returns varbit as
$$
select (string_agg(mod((2*random())::int,2)::text,''))::varbit from generate_series(1,$1);
$$
language sql strict volatile;
create or replace function zerobit(int) returns varbit as
$$
select (string_agg('0',''))::varbit from generate_series(1,$1);
$$
language sql strict immutable;
update t_bit set tags=randbit(10000) where userid=:id;
Update or delete 10,000 records per second with a response time of about 4 ms
Query speed
do language plpgsql
$$
declare
sql text;
bit1 varbit := randbit(10000);
bit2 varbit := randbit(10000);
bit3 varbit := randbit(10000);
zbit varbit := zerobit(10000);
begin
set max_parallel_workers_per_gather =27;
sql := 'select * from t_bit where bitand(tags,'''||bit1::text||''')='''||bit1::text||''' and bitand(tags,'''||bit2::text||''')>bit''0'' and bitand(tags,'''||bit3::text||''')='''||zbit::text||'''';
raise notice '%', sql;
-- execute sql;
end;
$$
;
17 seconds with a concurrency of 27.
Advantages
It can handle more than enough TAGs, up to 850 million (even 10,000 TAGs are considered a lot in the industry).
There are 10,000 TAGs, occupying 10,000 BITs, of about 1.25 KB.
100 million records is about 120 GB, and contains no index.
Disadvantages
It doesn't use the index method, so query performance can only be improved through parallel computing.
PostgreSQL 9.6 supports CPU parallel computing which allows for responses within 20 seconds even with 100 million users. It may, however, consume significant CPU resources, so the degree of parallelism is limited.
So do we have a solution that both save resources and offer the efficiency we need?
The answer is yes.
Because queries are usually used to retrieve matching USERIDs based on combined TAG conditions, a reverse design may make queries more effective, where a TAG consists of a dimension and a USERID.
We need to maintain users under each tag, so there may be a significant amount of data that needs to be updated. A design that combines incremental merging and read time merging should be considered.
The data flow is quick and executed according to the following:
Data > detail list > incremental aggregation > appid, tagid, userid_bits
Two parts of data are merged while reading, one of which is the result of tag computations, and the other is the result of a non-merged detail list.
Of course, if the merging delay can be controlled to within 1 minute, and the service can handle that delay, then there is no need to merge the queries and results will be retrieved directly, which is much faster.
Query:
userids (bitand) userids
userids (bitor) userids
userids (bitor) userids
Advantage
Query efficiency is very high with a query oriented design because the data storage dimensions have changed.
Disadvantages
Because bits are used to represent USERIDs, relationships between positions and USERIDs must be mapped.
A USERID dictionary table needs to be maintained, and incremental merger measures are required to reduce data update frequency.
There will be a delay, which can generally be controlled to within 1 minute, but if the service can handle that delay, the results will be much better.
Usually, service USERIDs will expire periodically (such as inactive USERIDs, which may gradually expire over time), so that the USERID dictionary will require periodic maintenance, and USERID bit information will also need to be updated.
The structure is shown in the following figure:
Accelerating PostgreSQL Ad Hoc Query and Dictionary with RUM Index
Creating a Real-Time User Profile Recommendation System with PostgreSQL (2)
digoal - March 20, 2019
digoal - January 14, 2019
Alibaba Clouder - July 4, 2017
Hologres - June 30, 2021
Rupal_Click2Cloud - September 24, 2021
Alibaba Clouder - January 17, 2018
An on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for MySQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreApsaraDB RDS for MariaDB supports multiple storage engines, including MySQL InnoDB to meet different user requirements.
Learn MoreMore Posts by digoal