By digoal
PolarDB's is a cloud-native database that adopts a computing-storage separation architecture. It offers affordable data storage, efficient scalability, high-speed multi-machine parallel computing capabilities, and fast data search and processing. By integrating computing algorithms, PolarDB drives the generation of business data value and transforms data into productivity.
This article describes how open source PolarDB uses postgresql_hll to to achieve efficient UV (Unique Visitor) sliding analysis and real-time recommendation for filtered read lists.
The test environment used in this article is macos + docker. For more information on deploying PolarDB, please see:
postgresql_hll is a plugin that efficiently stores a collection of unique "hash values". It provides the following functionalities:
• Appending content to the "hash values"
• Determining the number of unique hash values
• Calculating the difference between two hash values
• Performing the union of two hash values
• Performing the union of multiple hash values
An hll can store hundreds of millions of unique hash values using just a few KB of storage.
Common scenarios for its usage include:
hll is similar to the bloom filter in the following ways:
• When the judgment result indicates that "val" is in hll, it is possible that "val" may not actually be in hll. This is due to distorted storage, where the placeholder bitmask of multiple "vals" may overwrite the bitmask of other "vals".
• When the judgment result indicates that "val" is not in hll, it means that "val" is definitely not in hll.
1. Install and deploy postgresql_hll for PolarDB
git clone --depth 1 https://github.com/citusdata/postgresql-hll
export PGHOST=localhost
[postgres@67e1eed1b4b6 ~]$ psql
psql (11.9)
Type "help" for help.
postgres=# \q
cd postgresql-hll/
USE_PGXS=1 make
USE_PGXS=1 make install
USE_PGXS=1 make installcheck
2. Example
Create a table and write a large number of UID behavior data. Generate UV data by day and use hll to store UID hash.
create table t1 (id int, uid int, info text, crt_time timestamp);
create table t1_hll (dt date, hllval hll);
insert into t1 select id, random()*100000, random()::text, now() from generate_series(1,1000000) id;
insert into t1 select id, random()*100000, random()::text, now()+interval '1 day' from generate_series(1,1000000) id;
insert into t1_hll select date(crt_time), hll_add_agg(hll_hash_integer(uid)) from t1 group by 1;
Determine whether the UID is within the hll hash and check the hll accuracy.
postgres=# select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1 limit 10;
uid | ?column?
-------+----------
95912 | t
69657 | t
53722 | t
95821 | t
2836 | t
66298 | t
68466 | t
10122 | t
27861 | t
6824 | t
(10 rows)
select * from
(select t1.uid, t2.hllval=hll_add(t2.hllval, hll_hash_integer(t1.uid)) as yesorno from t1 , t1_hll t2 where t2.dt=date(now()) and t1.crt_time < date(now())+1) t
where t.yesorno=false;
uid | yesorno
-----+---------
(0 rows)
-- Exactly accurate.
Utilizing sliding window analysis, for instance, counting the number of UVs within any 7-day sliding window directly from the hll statistics table. Without hll, performing sliding window analysis would require accessing the base table for statistics, resulting in extremely poor performance. However, with hll, only 7 records need to be accessed for aggregation.
## What if you wanted to this week's uniques?
SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date;
## Or the monthly uniques for this year?
SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE date >= '2012-01-01' AND
date < '2013-01-01'
GROUP BY 1;
## Or how about a sliding window of uniques over the past 6 days?
SELECT date, #hll_union_agg(users) OVER seven_days
FROM daily_uniques
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);
## Or the number of uniques you saw yesterday that you didn't see today?
SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
Open Source PolarDB Uses Rum to Implement Efficient Searches and Sorting
Alibaba Cloud Community - September 3, 2024
Apache Flink Community China - June 28, 2021
digoal - September 17, 2019
ApsaraDB - September 21, 2022
Alibaba Clouder - December 21, 2020
Hologres - July 24, 2020
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreA public Internet gateway for flexible usage of network resources and access to VPC.
Learn MoreMore Posts by digoal