By digoal
PolarDB's cloud-native computing-storage separation architecture offers cost-effective data storage, scalable operations, high-speed multi-machine parallel computing, and efficient data search and processing. By combining PolarDB with computing algorithms, it enables businesses to derive value from their data and turn it into productivity.
This article explores how to leverage the open-source PolarDB Bloom Filter Index to filter data based on any combination of fields. The test environment used for this demonstration is macOS + Docker. For detailed instructions on PolarDB deployment, please refer to the following article: Simple Deployment of PolarDB
Filtering data based on any combination of fields is typically required in data analysis scenarios, such as for BI analysts. This operation involves filtering numbers according to specific requirements, which often entails a lot of trial and error. To address this need for filtering any combination of fields, an efficient solution is required.
Creating an index for each combination of fields can accelerate the filtering process for any field. PolarDB for PostgreSQL also supports bitmap scans of multiple B-tree indexes, so you only need to create one index for each field. However, some drawbacks still exist:
• The total space occupied by each column's indexes is relatively large.
• The presence of more indexes can increase the response time (RT) caused by data manipulation language (DML) operations, impacting the write throughput.
To overcome these challenges, the concept of the bloom filter was introduced. The bloom filter is an index that supports data filtering based on any combination of fields (Note: bloom filter only supports equivalent combinations). Here's how it works:
• The bloom filter consists of an N-bit hash value.
• After hashing each value, it is mapped to N bits in the hash value. For example, with 2 bits, the value "hello" (11) may map to bits 1 and 8, while "world" (11) may map to bits 2 and 8.
To query the existence of "hello," you only need to check if the corresponding bit value matches the hash value.
The bloom filter has some lossy characteristics: the absence does not necessarily mean the value doesn't exist, and the presence does not necessarily mean it does exist. This is due to bit conflicts, where other column values may also map to the same bits (bit collisions). Therefore, a recheck is required for verification.
Why do bit conflicts occur?
For example, if 80 bits are used to store values from 32 columns, and each column has 2 bits, different column values may be mapped to the same bits simultaneously. The more fields involved, the higher the probability of collision.
In any case, the bloom filter provides a cost-effective and efficient method for filtering data based on any combination of fields, facilitating seamless queries.
Create a test table and write 10 million records
=# CREATE TABLE tbloom AS
SELECT
(random() * 1000000)::int as i1,
(random() * 1000000)::int as i2,
(random() * 1000000)::int as i3,
(random() * 1000000)::int as i4,
(random() * 1000000)::int as i5,
(random() * 1000000)::int as i6
FROM
generate_series(1,10000000);
SELECT 10000000
Full table scan performance
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Seq Scan on tbloom (cost=0.00..2137.14 rows=3 width=24) (actual time=16.971..16.971 rows=0 loops=1)
Filter: ((i2 = 898732) AND (i5 = 123451))
Rows Removed by Filter: 100000
Planning Time: 0.346 ms
Execution Time: 16.988 ms
(5 rows)
Create a bloom index to cover six fields, consuming only 1.5 MB memory and improving query performance by nearly 50 times.
create extension bloom ;
=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
CREATE INDEX
=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
pg_size_pretty
----------------
1584 kB
(1 row)
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbloom (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)
Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
Rows Removed by Index Recheck: 29
Heap Blocks: exact=28
-> Bitmap Index Scan on bloomidx (cost=0.00..1792.00 rows=2 width=0) (actual time=0.356..0.356 rows=29 loops=1)
Index Cond: ((i2 = 898732) AND (i5 = 123451))
Planning Time: 0.099 ms
Execution Time: 0.408 ms
(8 rows)
For a common btree index, create one index for each field, and there is a total of six indexes. The combination of multiple fields uses bitmap and|or scan, which has the highest efficiency, but takes up a lot of space and has a greater impact on the write throughput performance.
=# CREATE INDEX btreeidx1 ON tbloom (i1);
CREATE INDEX
=# CREATE INDEX btreeidx2 ON tbloom (i2);
CREATE INDEX
=# CREATE INDEX btreeidx3 ON tbloom (i3);
CREATE INDEX
=# CREATE INDEX btreeidx4 ON tbloom (i4);
CREATE INDEX
=# CREATE INDEX btreeidx5 ON tbloom (i5);
CREATE INDEX
=# CREATE INDEX btreeidx6 ON tbloom (i6);
CREATE INDEX
=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on tbloom (cost=24.34..32.03 rows=2 width=24) (actual time=0.028..0.029 rows=0 loops=1)
Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
-> BitmapAnd (cost=24.34..24.34 rows=2 width=0) (actual time=0.027..0.027 rows=0 loops=1)
-> Bitmap Index Scan on btreeidx5 (cost=0.00..12.04 rows=500 width=0) (actual time=0.026..0.026 rows=0 loops=1)
Index Cond: (i5 = 123451)
-> Bitmap Index Scan on btreeidx2 (cost=0.00..12.04 rows=500 width=0) (never executed)
Index Cond: (i2 = 898732)
Planning Time: 0.491 ms
Execution Time: 0.055 ms
(9 rows)
Database Performance Optimization for High-concurrency Queue Processing Business
Use Open Source PolarDB and imgsmlr for Efficient Image Storage and Similarity Search
ApsaraDB - August 8, 2023
ApsaraDB - September 29, 2021
ApsaraDB - October 16, 2024
Alibaba EMR - May 16, 2022
Alibaba EMR - September 13, 2022
Alibaba Cloud Native - November 16, 2023
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreMore Posts by digoal