Typically when talking about product combinations, you would think of products that are closely related in terms of form and function. For instance, in China, stuffed buns, soybean milk, and tea eggs are common breakfast items. So it wouldn't come as a surprise if these products are promoted together in a supermarket.
But then, you've probably heard of the infamous beer and diapers correlation. Why do these seemingly disparate items are the best product combination?
For these questions, we can find answers based on accumulated order data. The question "how to use PostgreSQL to find the best production combination" was raised by a friend in the PostgreSQL community during the PostgreSQL community activity on April 8.
In fact, this problem can also be solved by using a professional recommendation database that supports a variety of recommendation algorithms.
However, this article does not intend to use RecDB to solve this problem. Instead, this article uses the traditional statistics to reach a conclusion.
The statistical method covered in this article can only be used to calculate the best combination of directly related items (data included in the same order).
To calculate the combination of indirectly associated items (for example, if user A bought item 1 and item 2 and user B bought item 2 and item 3, item 1 and item 3 have an indirect relationship), you need to use the recommendation algorithm in RecDB or use a similar graph search method.
Assume that there is a total of 100,000 item IDs and simulate a batch of users' order or shopping cart records with each order or shopping cart record containing 5-10 items. Create a total of around 11 million of these records.
Create a table
postgres=# create unlogged table buy (pay_id int8, item_id int[]);
CREATE TABLE
Create a function that inserts data into the buy table (array of 5-10 items)
create or replace function f() returns void as
$$
declare
begin
for i in 5..15 loop
insert into buy (item_id) select array_agg((100000*random())::int8) from generate_series(1,i);
end loop;
end;
$$
language plpgsql strict;
Use pgbench to generate 11 million records
vi test.sql
select f();
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 100 -j 100 -t 10000
transaction type: ./test.sql
scaling factor: 1
query mode: prepared
number of clients: 100
number of threads: 100
number of transactions per client: 10000
number of transactions actually processed: 1000000/1000000
latency average = 1.155 ms
latency stddev = 1.814 ms
tps = 85204.625725 (including connections establishing)
tps = 85411.351807 (excluding connections establishing)
script statistics:
- statement latencies in milliseconds:
1.158 select f();
Confirm that the data has been successfully written
postgres=# select count(*) from buy;
count
----------
11000000
(1 row)
postgres=# select * from buy limit 10;
pay_id | item_id
--------+--------------------------------------------------------------
| {6537,76804,33612,75580,8021}
| {72437,66015,2939,56128,7056}
| {40983,79581,15954,21039,6702,90279}
| {93626,8337,13416,69371,4366,75868}
| {84611,56893,25201,74038,59337,62045,59178}
| {97422,48801,69714,77056,17059,79714,21598}
| {42997,50834,57214,52866,83656,76342,5639,93416}
| {53543,24369,31552,28654,38516,63657,86564,11483}
| {58873,23162,23369,55091,32046,29907,31895,65658,5487}
| {39916,6641,85068,55870,27679,91770,46150,12290,48662,71350}
(10 rows)
postgres=# create index idx_buy_item on buy using gin(item_id);
The purpose of splitting is to split the array in an order into several sets. For example, an order containing five items can be split into 10 (4+3+2+1) sets of 2 items.
{6537,76804,33612,75580,8021}
Split the order into the following sets
{6537,76804}
{6537,33612}
{6537,75580}
{6537,8021}
{76804,33612}
{76804,75580}
{76804,8021}
{33612,75580}
{33612,8021}
{75580,8021}
Create a function to complete the split work
Use a recursive query for recombination
Example
WITH RECURSIVE
t(i) AS (
SELECT * FROM unnest('{A,B,C}'::char[])
),
cte AS (
SELECT i AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT cte.combo || t.i, t.i, ct + 1
FROM cte, t
WHERE ct <= 3 -- Combine 4 (3+1) times
AND position(t.i in cte.combo) = 0 -- Newly added characters are not included in existing characters
)
SELECT ARRAY(SELECT combo FROM cte ORDER BY ct, combo) AS result;
result
---------------------------------------------------
{A,B,C,AB,AC,BA,BC,CA,CB,ABC,ACB,BAC,BCA,CAB,CBA}
(1 row)
Assume that the array doesn't include replicate elements
create or replace function array_regroup(
i_arr int[], -- Enter an array
i_elems int -- Scramble into sets of fixed length
) returns setof int[] as
$$
declare
v_arr_len int := array_length(i_arr, 1); -- The length of the array that has been entered
begin
-- Protection
if i_elems > v_arr_len then
raise notice 'you cann''t return group len % more then %', i_elems, v_arr_len;
return;
elsif i_elems = v_arr_len then
return next i_arr;
return;
elsif i_elems = 1 then
return query select array(select i) from unnest(i_arr) t(i);
return;
end if;
return query
WITH RECURSIVE
t(i) AS (
select array(select i) from unnest(i_arr) t(i)
),
cte AS (
SELECT i AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1
FROM cte, t
WHERE cte.ct <= i_elems-1 -- Combine any times
AND (not cte.combo @> t.i) -- Newly added values are not included in existing value sets
)
SELECT combo FROM cte where array_length(combo,1)=i_elems group by combo;
return;
end;
$$
language plpgsql strict;
postgres=# select array_regroup(array[1,2,3],2);
array_regroup
---------------
{2,3}
{1,2}
{1,3}
(3 rows)
create or replace function array_regroup(
i_arr int[], -- Enter an array
) returns setof int[] as
$$
declare
v_arr_len int := array_length(i_arr, 1); -- The length of the array that has been entered
begin
return query
WITH RECURSIVE
t(i) AS (
select array(select i) from unnest(i_arr) t(i)
),
cte AS (
SELECT i AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1
FROM cte, t
WHERE cte.ct <= v_arr_len-1 -- Combine any times
AND (not cte.combo @> t.i) -- Newly added values are not included in existing value sets
)
SELECT combo FROM cte group by combo;
return;
end;
$$
language plpgsql strict;
postgres=# select array_regroup(array[1,2,3]);
array_regroup
---------------
{2}
{2,3}
{1,2}
{1}
{1,2,3}
{3}
{1,3}
(7 rows)
create or replace function array_regroup(
i_arr int[], -- Enter an array
i_elems int -- Scramble into sets of fixed length
i_arr_contain int[] -- Arrays that contain specified item IDs
) returns setof int[] as
$$
declare
v_arr_len int := array_length(i_arr, 1); -- The length of the array that has been entered
begin
-- Protection
if i_elems > v_arr_len then
raise notice 'you cann''t return group len % more then %', i_elems, v_arr_len;
return;
elsif i_elems = v_arr_len then
return next i_arr;
return;
elsif i_elems = 1 then
return query select array(select i) from unnest(i_arr) t(i);
return;
end if;
return query
WITH RECURSIVE
t(i) AS (
select array(select i) from unnest(i_arr) t(i)
),
cte AS (
SELECT i AS combo, i, 1 AS ct
FROM t
UNION ALL
SELECT array(select i from (select unnest(array_cat(cte.combo, t.i)) order by 1) t(i)), t.i, ct + 1
FROM cte, t
WHERE cte.ct <= i_elems-1 -- Combine any times
AND (not cte.combo @> t.i) -- Newly added values are not included in existing value sets
AND (cte.combo @> i_arr_contain)
)
SELECT combo FROM cte where array_length(combo,1)=i_elems group by combo;
return;
end;
$$
language plpgsql strict;
postgres=# select array_regroup(array[1,2,3,4,5],2,array[1]);
array_regroup
---------------
{1,2}
{1,3}
{1,4}
{1,5}
(4 rows)
Time: 1.150 ms
For example, find the best combination item for bread.
Assume that the item ID of bread is 6537.
postgres=# select item_id from buy where item_id @> array[6537];
......
{60573,17248,6537,77857,43349,66208,13656}
{97564,50031,79924,24255,6537,21174,39117}
{24026,78667,99115,87856,64782,8344,73169,41478,63091,29609,6537,71982,75382}
{53094,97465,26156,54181,6537}
(1101 rows)
Time: 5.791 ms
postgres=# explain select item_id from buy where item_id @> array[6537];
QUERY PLAN
---------------------------------------------------------------------------------
Bitmap Heap Scan on buy (cost=457.45..51909.51 rows=55000 width=60)
Recheck Cond: (item_id @> '{6537}'::integer[])
-> Bitmap Index Scan on idx_buy_item (cost=0.00..443.70 rows=55000 width=0)
Index Cond: (item_id @> '{6537}'::integer[])
(4 rows)
Split order data into sets and find the sets that have the highest occurrence of this item ID.
postgres=# select count(*), array_regroup(item_id,2,array[6537]) from buy where item_id @> array[6537] group by 2 order by 1 desc;
count | array_regroup
-------+---------------
3 | {6537,55286}
3 | {6537,48661}
3 | {6537,78337}
3 | {6537,72623}
3 | {6537,81442}
3 | {6537,66414}
3 | {6537,35346}
3 | {6537,79565}
3 | {3949,6537}
......
Time: 286.859 ms
For example, find the two best combination items for bread.
postgres=# select count(*), array_regroup(item_id,3,array[6537]) from buy where item_id @> array[6537] group by 2 order by 1 desc;
count | array_regroup
-------+--------------------
1 | {32,999,6537}
1 | {6537,49957,91533}
1 | {6537,49957,88377}
1 | {6537,49957,57887}
1 | {6537,49957,55192}
1 | {6537,49952,95266}
1 | {6537,49952,56916}
1 | {6537,49945,60492}
1 | {6537,49940,92888}
......
Time: 1055.414 ms
This may take a long time.
select count(*), array_regroup(item_id,2) from buy group by 2 order by 1 desc limit 10;
This may take a long time.
select count(*), array_regroup(item_id, n) from buy group by 2 order by 1 desc limit 10;
1. This case doesn't require highly technical methods, and only splits arrays by recommendation level to count the number of occurrences.
The following database features are used in this case:
1.1. Support for the array type
1.2. PL/pgSQL programming on the service side
1.3. Retrieval of array elements by index (containing a specific element)
1.4. The MPP distributed database architecture for computation performance enhancement For more information, refer to Alibaba Cloud HybridDB for PostgreSQL.
2. Note that the statistical method in this article has a limitation.
The statistical method covered in this article can only be used to calculate the best combination of directly related items (data included in the same order).
To calculate the combination of indirectly associated items (for example, if user A bought item 1 and item 2 and user B bought item 2 and item 3, item 1 and item 3 have an indirect relationship), you need to use the recommendation algorithm in RecDB or use a similar graph search method.
3. Alibaba Cloud HybridDB for PostgreSQL provides the MPP feature to support horizontal scaling, which is very suitable for OLAP scenarios. For example, the "group by" operation which is used many times in this case can result in significant performance improvement.
4. The CPU-based multi-core parallel computing is added in PostgreSQL 9.6 and can significantly improve performance in OLAP scenarios. For example, the "group by" operation which is used many times in this case can have significant performance improvement.
https://github.com/DataSystemsLab/recdb-postgresql
https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html
Use Cases for Millisecond-Level Massive Multi-Dimensional Data Pivoting on PostgreSQL/GPDB
Alibaba Clouder - December 12, 2017
Alibaba Developer - February 7, 2022
digoal - May 28, 2019
Hologres - July 16, 2021
ApsaraDB - October 13, 2021
Alibaba Clouder - April 12, 2021
An online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn on-demand database hosting service for PostgreSQL with automated monitoring, backup and disaster recovery capabilities
Learn MoreLeverage cloud-native database solutions dedicated for FinTech.
Learn MoreMore Posts by digoal