This article demonstrates a case of quickly identifying target groups based on their characteristics. Such cases are common in almost all industries, including the Internet, new retail, education, gaming, healthcare, and social networking. Let's take a look at some examples:
Generally, a database supports a maximum of more than 1,000 fields.
For example, if a user browsed mobile phones yesterday and placed an order later at night; but profiling data was not updated, then this user will be selected for mobile phone sellers, while actually the user is no longer in the target group.
Common database products, such as MySQL, have limited resources and cannot meet the requirement of real-time target group selection.
Table Structure: Consider the following table structure.
KEY: 用户ID
标签1:
标签2:
...
标签N:
Index: Consider the following index.
每个标签字段一个索引
Search Method: Consider the following search method.
and , or , not 组合
where 标签a and 标签b and ...
Defects: Note the following defects of this solution method:
Generally, a database supports a maximum of more than 1,000 fields.
Table Structure: Consider the following table structure.
KEY:用户ID
VALUES:标签数组
Index: Consider the following index.
标签数组字段: GIN倒排索引
Search Method: Consider the following search method.
与、或、非
where VALUES @> array[标签s] -- 与
where VALUES && array[标签s] -- 或
where not VALUES @> array[标签s] -- 非
Defects: Note the following defects of this solution method:
Table Structure: Consider the following table structure.
KEY:标签ID
VALUES: 用户bitmap
Index: Consider the following index.
标签ID字段: Btree索引
Search Method: Consider the following search method.
聚合bitmap: 与、或、非
and_agg(bitmaps) where KEY in (标签s) -- 与
or_agg(bitmaps) where KEY in (标签s) -- 或
except(bitmap1,bitmap2) -- 非
Defects: Note the following drawbacks of this solution method:
a) offset0_bitmap
, offset1gb_bitmap
, ...
Strengths: Note the following advantages of this solution method:
a) Only one B-tree index is required, and the number of index entries is small (the number of tags is equal to that of records. Generally, the number of tags is less than one million).
Prerequisites include the following operations:
1) Purchase ApsaraDB RDS for PostgreSQL 12.
2) Purchase ApsaraDB RDS for MySQL 8.0.
3) Set up a whitelist.
4) Create a user.
5) Create a database.
MySQL does not support the array type, inverted indexes, and the bitmap feature. Therefore, only Solution 1 is supported.
1) MySQL 8.0
2) PostgreSQL 12
Step 1) Create a group table, with each entry representing a group.
create table t_tag_dict (
tag int primary key, -- 标签(人群)id
info text, -- 人群描述
crt_time timestamp -- 时间
);
Step 2) Generate 100,000 groups (tags).
insert into t_tag_dict values (1, '男', now());
insert into t_tag_dict values (2, '女', now());
insert into t_tag_dict values (3, '大于24岁', now());
-- ...
insert into t_tag_dict
select generate_series(4,100000), md5(random()::text), clock_timestamp();
Step 3) Create a user profile table (N entries for each user and each record represents a tag applied to this user).
create table t_user_tag (
uid int8, -- 用户id
tag int, -- 用户对应标签(人群)
mod_time timestamp, -- 时间
primary key (tag,uid)
);
Step 4) Tag 20 million users. Each user has 64 random tags, and half of the users are men and the other half are women. The total number of entries is 1.28 billion.
create or replace function gen_rand_tag(int,int) returns setof int as
$$
select case when random() > 0.5 then 1::int else 2::int end as tag
union all
select ceil(random()*$1)::int as tag from generate_series(1,$2);
$$ language sql strict volatile;
insert into t_user_tag
select uid, gen_rand_tag(100000,63) as tag, clock_timestamp()
from generate_series(1,20000000) as uid on conflict (uid,tag) do nothing;
-- 或使用如下方法加速导入
create sequence seq;
vi test.sql
insert into t_user_tag
select uid, gen_rand_tag(100000,63) as tag, clock_timestamp()
from nextval('seq'::regclass) as uid
on conflict(tag,uid) do nothing;
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 50 -j 50 -t 400000
Step 5) Query groups that contain tags 1 and 3.
1、人群数量
select count(*) from
(
select uid from t_user_tag where tag=1
intersect
select uid from t_user_tag where tag=3
) t;
-- Time: 1494.789 ms (00:01.495)
2、提取人群ID
select uid from t_user_tag where tag=1
intersect
select uid from t_user_tag where tag=3;
-- Time: 3246.184 ms (00:03.246)
Step 6) Query groups that contain tag 1, 3, 10, or 200.
1、人群数量
select count(*) from
(
select uid from t_user_tag where tag=1
union
select uid from t_user_tag where tag=3
union
select uid from t_user_tag where tag=10
union
select uid from t_user_tag where tag=200
) t;
-- Time: 3577.714 ms (00:03.578)
2、提取人群ID
select uid from t_user_tag where tag=1
union
select uid from t_user_tag where tag=3
union
select uid from t_user_tag where tag=10
union
select uid from t_user_tag where tag=200;
-- Time: 5682.458 ms (00:05.682)
Step 7) Check the used space as shown below.
public | t_user_tag | table | postgres | 62 GB |
public | t_user_tag_pkey | index | postgres | t_user_tag | 61 GB |
1) PostgreSQL 12
Step 1) Create a group table, with each entry representing a group.
create table t_tag_dict (
tag int primary key, -- 标签(人群)id
info text, -- 人群描述
crt_time timestamp -- 时间
);
Step 2) Generate 100,000 groups (tags).
insert into t_tag_dict values (1, '男', now());
insert into t_tag_dict values (2, '女', now());
insert into t_tag_dict values (3, '大于24岁', now());
-- ...
insert into t_tag_dict
select generate_series(4,100000), md5(random()::text), clock_timestamp();
Step 3) Create a user profile table (one entry for each user and an array is used to indicate which tags a user has).
create table t_user_tags (
uid int8 primary key, -- 用户id
tags int[], -- 用户标签(人群)数组
mod_time timestamp -- 时间
);
Step 4) Create a function to generate a random tagging array.
create or replace function gen_rand_tags(int,int) returns int[] as $$
select array_agg(ceil(random()*$1)::int) from generate_series(1,$2);
$$ language sql strict;
Step 4.1) Randomly select 8 tags from the 100 thousand tags:
select gen_rand_tags(100000, 8);
gen_rand_tags
---------------------------------------------------
{43494,46038,74102,25308,99129,40893,33653,29690}
(1 row)
Step 5) Tag 20 million users. Each user has 64 random tags, and half of the users are men while the other half are women.
insert into t_user_tags
select generate_series(1,10000000),
array_append(gen_rand_tags(100000, 63),1), now();
insert into t_user_tags
select generate_series(10000001,20000000),
array_append(gen_rand_tags(100000, 63),2), now();
Step 6) Create an inverted index for the tag (group) field.
create index idx_t_user_tags_1 on t_user_tags using gin (tags);
Step 7) Query groups that contain tags 1 and 3 as shown below.
1、人群数量
select count(uid) from t_user_tags where tags @> array[1,3];
2、提取人群ID
select uid from t_user_tags where tags @> array[1,3];
Step 8) Query groups that contain tag 1, 3, 10, or 200 as shown below.
1、人群数量
select count(uid) from t_user_tags where tags && array[1,3,10,200];
2、提取人群ID
select uid from t_user_tags where tags && array[1,3,10,200];
1) PostgreSQL 12
ApsaraDB RDS for PostgreSQL 12 supports the bitmap feature. Instructions for using this feature are as follows.
安装插件 – create extension roaringbitmap;
bitmap输出格式 – set roaringbitmap.output_format='bytea|array';
bitmap取值范围 – 40亿(int4)
构造bitmap – rb_build(int4[])
bitmap转换为数组或多条记录 - rb_to_array(rb) – rb_iterate(rb)
bitmap内包含对象个数 – rb_cardinality(rb)
逻辑运算: 与、或、异或、差
SELECT roaringbitmap('{1,2,3}') | roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') & roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') # roaringbitmap('{3,4,5}');
SELECT roaringbitmap('{1,2,3}') - roaringbitmap('{3,4,5}');
聚合运算: build rb、与、或、异或
SELECT rb_build_agg(e) FROM generate_series(1,100) e;
SELECT rb_or_agg(bitmap) FROM t1;
SELECT rb_and_agg(bitmap) FROM t1;
SELECT rb_xor_agg(bitmap) FROM t1;
聚合并统计对象数(与、或、异或)
rb_or_cardinality_agg
rb_and_cardinality_agg
rb_xor_cardinality_agg
逻辑判断: 包含、相交、相等、不相等
Opperator Input Output Desc Example Result
@> roaringbitmap,roaringbitmap bool contains roaringbitmap('{1,2,3}') @> roaringbitmap('{3,4,5}') f
@> roaringbitmap,integer bool contains roaringbitmap('{1,2,3,4,5}') @> 3 t
<@ roaringbitmap,roaringbitmap bool is contained by roaringbitmap('{1,2,3}') f
<@ integer,roaringbitmap bool is contained by 3 t
&& roaringbitmap,roaringbitmap bool overlap (have elements in common) roaringbitmap('{1,2,3}') && roaringbitmap('{3,4,5}') t
= roaringbitmap,roaringbitmap bool equal roaringbitmap('{1,2,3}') = roaringbitmap('{3,4,5}') f
<> roaringbitmap,roaringbitmap bool not equal roaringbitmap('{1,2,3}') <> roaringbitmap('{3,4,5}') t
When UID exceeds int4 (4 billion), use offset for conversion. For more information, visit: https://pgxn.org/dist/pg_roaringbitmap/
Instructions for using the bitmap feature in ApsaraDB RDS for PostgreSQL to be supplemented.
Step 1) Install the plug-in.
create extension roaringbitmap;
Step 2) Create tags and the user bitmap table.
create table t_tag_users (
tagid int primary key, -- 用户标签(人群)id
uid_offset int, -- 由于userid是int8类型,roaringbitmap内部使用int4存储,需要转换一下。
userbits roaringbitmap, -- 用户id聚合的 bitmap
mod_time timestamp -- 时间
);
Step 3) Generate tags and the UID bitmap.
insert into t_tag_users
select tagid, uid_offset, rb_build_agg(uid::int) as userbits from
(
select
unnest(tags) as tagid,
(uid / (2^31)::int8) as uid_offset,
mod(uid, (2^31)::int8) as uid
from t_user_tags
) t
group by tagid, uid_offset;
Step 4) Query groups that contain tags 1 and 3 as shown below.
1、人群数量
select sum(ub) from
(
select uid_offset,rb_and_cardinality_agg(userbits) as ub
from t_tag_users
where tagid in (1,3)
group by uid_offset
) t;
2、提取人群ID
select uid_offset,rb_and_agg(userbits) as ub
from t_tag_users
where tagid in (1,3)
group by uid_offset;
Step 5) Query groups that contain tag 1, 3, 10, or 200 as shown below.
1、人群数量
select sum(ub) from
(
select uid_offset,rb_or_cardinality_agg(userbits) as ub
from t_tag_users
where tagid in (1,3,10,200)
group by uid_offset
) t;
2、提取人群ID
select uid_offset,rb_or_agg(userbits) as ub
from t_tag_users
where tagid in (1,3,10,200)
group by uid_offset;
The following table represents the case environment:
Databases | Computing specifications | Storage specifications |
---|---|---|
MySQL 8.0 | 8 cores and 32 GB memory | 1500 GB ESSD |
PostgreSQL 12 | 8 cores and 32 GB memory | 1500 GB ESSD |
The following table shows the performance comparison:
CASE (1.28 billion user/tags) (20 million, 64 tags/user) | Solution 1 (MySQL, PostgreSQL) Many-to-many: standard solution | Solution 2 (PostgreSQL) One-to-many: array and inverted index | Solution 3 (PostgreSQL) One-to-many: bitmap | Solution 3 vs. Solution 1 Percent increase |
---|---|---|---|---|
User selection speed of an AND query | 1.5 seconds | 42 milliseconds | 1.5 milliseconds | 99900% |
User selection speed of an OR query | 3.6 seconds | 3 seconds | 1.7 milliseconds | 211665% |
Used space (by tables) | 62 GB | 3,126 MB | 1,390 MB | 4467% |
Used space (by indexes) | 61 GB | 3,139 MB | 2 MB | 3123100% |
Indexing speed | - | 20 minutes | 0 seconds | - |
1) ApsaraDB RDS for PostgreSQL supports the RoaringBitmap function, which generates, compresses, and parses bitmap data very efficiently. It also supports the most common bitmap aggregation operations such as AND, OR, NOT, and XOR. Moreover, it extracts the IDs and selectivity of bitmaps and checks whether an ID exists.
2) ApsaraDB RDS for PostgreSQL is used to meet the needs of real-time precision marketing to hundreds of millions of users. It is also used for the quick tagging of users with massive data volumes including tens of millions of tags.
3) Unlike the MySQL solution, the ApsaraDB RDS for PostgreSQL solution is much more cost-effective and efficient.
Note that the following versions of ApsaraDB RDS for PostgreSQL support this feature:
Use Case Analysis - Image Recognition and Similar Feature Retrieval with PostgreSQL
ApsaraDB - October 13, 2021
digoal - May 9, 2020
Alibaba F(x) Team - June 22, 2021
ApsaraDB - February 20, 2021
Alibaba Clouder - June 12, 2018
ApsaraDB - July 4, 2022
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreA Digital and Remote Online Solution for eKYC
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreMore Posts by digoal