Bloom is a space-efficient probabilistic data structure that consumes minimal memory to check whether an element exists among large amounts of data. TairBloom is an implementation of scalable Bloom filters (SBFs). It features dynamic scalability while maintaining a stable false positive rate during scaling.
Overview
You can use bitmaps on Redis data structures, such as hashes, sets, and strings, to implement features similar to those of TairBloom. However, these data structures may consume a large amount of memory or fail to maintain a stable false positive rate during dynamic scaling. For this reason, TairBloom is ideal for checking for large amounts of data when a specific false positive rate is allowed. You can use the built-in Bloom filters of TairBloom without further encapsulation or the need to create extra Bloom filters on your on-premises devices.
Features
Minimum consumption of memory.
Dynamic scaling.
Stable custom false positive rate during scaling.
Typical scenarios
TairBloom can be used for recommendation and crawler systems in industries such as live-streaming, music, and e-commerce.
Recommendation system: TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested.
Crawler system: TairBloom filters out the URLs that have been crawled to improve productivity.
Best practices
Build recommendation systems based on TairBloom
TairBloom records the IDs of articles that may have been recommended, queries these articles, determines duplicate articles, and then recommends new articles in which users might be interested. Pseudocode:
void recommendedSystem(userid) {
while (true) {
// Obtain a random article ID or a desired article ID.
docid = getDocByRandom()
if (bf.exists(userid, docid)) {
// If the article may have been recommended to a user, the next article is recommended.
continue;
} else {
// If the article has not been recommended to the user, the article is recommended.
sendRecommendMsg(docid);
// If the article has been recommended, the article ID is recorded in a Bloom filter.
bf.add(userid, docid);
break;
}
}
}
Optimize crawler systems based on TairBloom
TairBloom filters out the URLs that may have been crawled to improve productivity. Pseudocode:
bool crawlerSystem( ) {
while (true) {
// Obtain a URL that you want to crawl.
url = getURLFromQueue()
if (bf.exists(url_bloom, url)) {
// If the URL may have been crawled, the URL is skipped.
continue;
} else {
// Crawl this URL.
doDownload(url)
// Add this URL to a Bloom filter.
bf.add(url_bloom, url);
}
}
}
How it works
TairBloom is an implementation of SBFs. It features dynamic scalability while maintaining a stable false positive rate. SBFs are optimized Bloom filters. The following section describes the basic principles of Bloom filters and SBFs.
Bloom filters
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
A new Bloom filter is a bit array of m bits, all of which are set to 0. The Bloom filter also includes a set of k different hash functions to generate a uniform random distribution. k is a constant that is less than m. When you add elements to the Bloom filter, the k hash functions map these elements to k bits in the bit array and set the k bits to a value of 1. In this case, a bit can be shared by multiple pieces of data. The following figure shows how to insert X1 and X2 into a Bloom filter that includes 3 hash functions.
When you query an element in a Bloom filter, you can use k hash functions to obtain k bits. If all of k bits have a value of 1, the element exists in the Bloom filter. Otherwise, the element does not exist. The following figure shows how to query Y1 and Y2 in the Bloom filter.
The preceding figure demonstrates that although Y2 was never inserted into the Bloom filter, Y2 is determined to be existent in the Bloom filter. This scenario shows that Bloom filters have a false positive rate. The preceding analysis shows that Bloom filters have the following features:
A bit can be shared by multiple pieces of data.
Bloom filters have a false positive rate and the rate gets higher as the number of elements in a Bloom filter increases. However, Bloom filters do not have a false negative rate. If an element exists in a Bloom filter, the element is always determined to be existent.
An element can be added to a Bloom filter but cannot be removed from the Bloom filter. The reason is that bits can be shared. If you remove an element from the Bloom filter, other elements in the Bloom filter are affected.
SBFs
When the number of elements increases, the false positive rate also increases. If you want to maintain a stable false positive rate, you must accordingly increase the size of the Bloom filter. However, the size cannot be increased due to structural limits. SBFs emerged in response to this issue. SBFs are a new type of Bloom filter that combines multiple Bloom filters into one.
The following figure shows the basic model of an SBF. This SBF has two layers: BF0 and BF1. At first, the SBF contains only the BF0 layer. If you insert elements a, b, and c into this SBF and the BF0 layer is not large enough to maintain a specified false positive rate, the BF1 layer is created to increase the SBF size. Then, elements d, e, and f are inserted into the BF1 layer. A new layer named BF2 is created if the BF1 layer also cannot maintain the specified positive rate.
SBFs insert data only into the last layer and query data from the last layer to the BF0 layer. For more information, see Scalable Bloom Filters.
Prerequisites
A Tair DRAM-based instance is created.
The latest minor version provides more features and higher stability. We recommend that you update the instance to the latest minor version. For more information, see Update the minor version of an instance. If your instance is a cluster or read/write splitting instance, we recommend that you update the proxy nodes in the instance to the latest minor version. This ensures that all commands can be run as expected.
Precautions
The TairBloom data that you want to manage is stored in a Tair instance.
The initial capacity and false positive rate that meet your requirements must be calculated in advance. To create a TairBloom key that has a capacity far more than 100 elements, run the
BF.RESERVE
command instead of theBF.ADD
command.The following section demonstrates the difference between the
BF.ADD
command and theBF.RESERVE
command:BF.ADD
(also known asBF.MADD
): If a TairBloom key does not exist when this command is run, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. If you require a key capacity that is far more than 100, you can only scale up the key to support more elements.When a key is scaled up, more layers of Bloom filters are added to it. Every time a Bloom filter layer is added, the key capacity increases. In this case, if you perform a query on the key, multiple layers of Bloom filters need to be traversed and the performance of the key significantly deteriorates.
BF.RESERVE
(also known asBF.INSERT
): An initial capacity is specified when this command is run. This command specifies an initial capacity at the first layer of a key. If a key contains a smaller number of Bloom filter layers, the query speed of the key is faster.
NoteFor example, assume that you want to insert 10,000,000 elements into a key and allows a false positive rate of 0.01. The created key is 176 MB in size if you use the
BF.ADD
command, or 16 MB in size if you use theBF.RESERVE
command.Although keys can be scaled up, we recommend that you consider key scale-up only as a safety measure and do not frequently scale up keys. If the actual capacity of a key exceeds the provisioned capacity of the key, you can scale up the key to ensure that write operations can be performed on the key and prevent live accidents.
The following table lists a variety of key sizes supported by the
BF.RESERVE
command and their matched initial capacities and false positive rates.Capacity (number of elements)
false positive:0.01
false positive:0.001
false positive:0.0001
100,000
0.12 MB
0.25 MB
0.25 MB
1,000,000
2 MB
2 MB
4 MB
10,000,000
16 MB
32 MB
32 MB
100,000,000
128 MB
256 MB
256 MB
1,000,000,000
2 GB
2 GB
4 GB
The size of a key cannot be decreased because elements can only be added to but not removed from a key. To prevent capacity issues such as an out of memory (OOM) error, we recommend that you implement the following solutions:
Split business data. You can split business data to prevent large keys that affect query performance. If large keys exist, most requests are made to the instance that contains these keys. This may cause hotkeys or even skewed requests.
You can distribute the split data to multiple keys. If your business data is stored in a Tair cluster instance, you can distribute the split data to multiple nodes in the instance to ensure that large keys and hotkeys do not occur.
Regularly rebuild keys. If possible, you can run the
DEL
command to delete data from a key, and then insert data from backend databases into the key to manage the key size.You can also create multiple keys to interchangeably use them. This way, the size of a single key is kept appropriate. The benefit of this solution is that you only need to create keys once. However, multiple keys must be created and some memory may be wasted.
Supported commands
Table 1. TairBloom commands
Command | Syntax | Description |
| Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key. | |
| Adds an element to a TairBloom key. | |
| Adds multiple elements to a TairBloom key. | |
| Checks whether an element exists in a TairBloom key. | |
| Checks whether multiple elements exist in a TairBloom key. | |
| Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key. | |
| Retrieves information about a TairBloom key. The information includes the number of layers, the number of items at each layer, and the false positive rate. | |
| Deletes one or more TairBloom keys. Note The elements that are already added to a key cannot be deleted. You can run the DEL command to delete all data from a key. |
Uppercase keyword
: indicates the command keyword.Italic text
: indicates variables.[options]
: indicates that the enclosed parameters are optional. Parameters that are not enclosed by brackets must be specified.A|B
: indicates that the parameters separated by the vertical bars (|) are mutually exclusive. Only one of the parameters can be specified....
: indicates that the parameter preceding this symbol can be repeatedly specified.
BF.RESERVE
Item | Description |
Syntax |
|
Time complexity | O(1) |
Command description | Creates an empty TairBloom key that has a specified capacity and false positive rate. The capacity parameter specifies the capacity of the key and the error_rate parameter specifies the false positive rate of the key. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.ADD
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Adds an element to a TairBloom key. Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.MADD
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Adds multiple elements to a TairBloom key. Note If the key does not exist, the key is automatically created. The default capacity of the key is 100, and the default false positive rate of the key is 0.01. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.EXISTS
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Checks whether an element exists in a TairBloom key. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.MEXISTS
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Checks whether multiple elements exist in a TairBloom key. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.INSERT
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Adds multiple elements to a TairBloom key. If the key does not exist, you can specify whether to create the key. You can also specify the capacity and false positive rate of the key. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
|
BF.INFO
Item | Description |
Syntax |
|
Time complexity | O(log N), where N specifies the number of Bloom filter layers. |
Command description | Retrieves information about a TairBloom key. The information includes the number of layers, the number of items at each layer, and the false positive rate. |
Parameter |
|
Output |
|
Example | Sample command:
Sample output:
BF.INFO response parameters:
|