×
Community Blog Open Source PolarDB Uses postgresql_hll to Accelerate UV Sliding Analysis and Real-time Recommendations

Open Source PolarDB Uses postgresql_hll to Accelerate UV Sliding Analysis and Real-time Recommendations

This article describes how open source PolarDB uses postgresql_hll to to achieve efficient UV (Unique Visitor) sliding analysis and real-time recommendation for filtered read lists.

By digoal

Background

PolarDB's is a cloud-native database that adopts a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed multi-machine parallel computing capabilities, and fast data search and processing. By integrating computing algorithms, PolarDB drives the generation of business data value and transforms data into productivity.

This article describes how open source PolarDB uses postgresql_hll to to achieve efficient UV (Unique Visitor) sliding analysis and real-time recommendation for filtered read lists.

The test environment used in this article is macos + docker. For more information on deploying PolarDB, please see:

Introduction to postgresql_hll

postgresql_hll is a plugin that efficiently stores a collection of unique "hash values". It provides the following functionalities:

• Appending content to the "hash values"

• Determining the number of unique hash values

• Calculating the difference between two hash values

• Performing the union of two hash values

• Performing the union of multiple hash values

An hll can store hundreds of millions of unique hash values using just a few KB of storage.

Common scenarios for its usage include:

  1. UV (Unique Visitors)
  2. Sliding Window UVs (Unique Visitors)
  3. New UVs
  4. Year-on-year and month-on-month analysis
  5. In the short video recommendation service, only unread short videos are recommended. By utilizing postgresql_hll, the read list can be efficiently recorded and filtered.

hll is similar to the bloom filter in the following ways:

• When the judgment result indicates that "val" is in hll, it is possible that "val" may not actually be in hll. This is due to distorted storage, where the placeholder bitmask of multiple "vals" may overwrite the bitmask of other "vals".

• When the judgment result indicates that "val" is not in hll, it means that "val" is definitely not in hll.

postgresql_hll for PolarDB

1.  Install and deploy postgresql_hll for PolarDB

git clone --depth 1 https://github.com/citusdata/postgresql-hll  
  
export PGHOST=localhost  
[postgres@67e1eed1b4b6 ~]$ psql  
psql (11.9)  
Type "help" for help.  
  
postgres=# \q  
  
  
cd postgresql-hll/  
  
USE_PGXS=1 make  
  
USE_PGXS=1 make install  
  
USE_PGXS=1 make installcheck  

2.  Example

Create a table and write a large number of UID behavior data. Generate UV data by day and use hll to store UID hash.

create table t1 (id int, uid int, info text, crt_time timestamp);  
create table t1_hll (dt date, hllval hll);  
insert into t1 select id, random()*100000, random()::text, now() from generate_series(1,1000000) id;  
insert into t1 select id, random()*100000, random()::text, now()+interval '1 day' from generate_series(1,1000000) id;  
insert into t1_hll select date(crt_time), hll_add_agg(hll_hash_integer(uid)) from t1 group by 1;   

Determine whether the UID is within the hll hash and check the hll accuracy.

postgres=# select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1 limit 10;  
  uid  | ?column?   
-------+----------  
 95912 | t  
 69657 | t  
 53722 | t  
 95821 | t  
  2836 | t  
 66298 | t  
 68466 | t  
 10122 | t  
 27861 | t  
  6824 | t  
(10 rows)  
  
  
select * from   
  (select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) as yesorno from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1) t   
where t.yesorno=false;  
  
 uid | yesorno   
-----+---------  
(0 rows)  
  
-- Exactly accurate.

Utilizing sliding window analysis, for instance, counting the number of UVs within any 7-day sliding window directly from the hll statistics table. Without hll, performing sliding window analysis would require accessing the base table for statistics, resulting in extremely poor performance. However, with hll, only 7 records need to be accessed for aggregation.

## What if you wanted to this week's uniques?

SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date;

## Or the monthly uniques for this year?

SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE date >= '2012-01-01' AND
      date <  '2013-01-01'
GROUP BY 1;

## Or how about a sliding window of uniques over the past 6 days?

SELECT date, #hll_union_agg(users) OVER seven_days
FROM daily_uniques
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);

## Or the number of uniques you saw yesterday that you didn't see today?

SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
0 1 0
Share on

digoal

282 posts | 24 followers

You may also like

Comments