×
Community Blog How to Remove Duplicate and Similar Contents in Vector Databases

How to Remove Duplicate and Similar Contents in Vector Databases

The article introduces the use of vector databases for similarity searches and deduplication.

By digoal

Background

Vector databases are the perfect partners for large models, which are often used in RAG services to store text, images, videos, and their corresponding feature vectors. In practice, they are typically used for similarity searches based on vectors, returning the original text, images, and videos to assist large models in auxiliary reasoning.

Vector databases are also used for similarity deduplication: They can detect and remove accumulated similar content that results from minor modifications, plagiarism, and the feature of uploading content.

How to Remove Duplicate and Similar Contents in Vector Databases

Procedures:

  1. Identify similar content
  2. Define retention rules
  3. Remove similar content

1.  Generate test data.

Create a vector plug-in.

create extension vector;   

Create a test table.

create table tbl (  
  id int primary key,   
  v vector(32) not null,   
  ts timestamp not null,   
  -- Original content field, omitted.    
  filesize float8  not null  
);   

Create a function that generates random vectors.

create or replace function gen_vec(int) returns vector as $$  
  select array_to_vector(array(select (random()*100)::real from generate_series(1,$1) t), $1, true);  
$$ language sql strict;   

Insert 10,000 test data.

insert into tbl select generate_series(1,10000), gen_vec(32), clock_timestamp(), random()*100;  

2.  Create a table to store the IDs of duplicate records.

create table tbl_duplicate (  
  id int  
);    

3.  Generate the Cartesian product to identify similar and duplicate IDs.

To speed up execution, you can segment by ID and process in parallel.

Define retention rules:

• Retain the one with the largest or smallest filesize?

• Retain the one with the newest or oldest ts?

In this example, it is assumed that the newest one is retained and the older ones are deleted.

First, disable single SQL parallelism. Why not use the single SQL parallelism feature? Please refer to the end of the article.

set max_parallel_workers_per_gather=0;  

Use ID for segmentation. If there are no continuous IDs, ctid (row number) can be used for segmentation. For example, a total of 1,000 data blocks are divided into 0-100, 101-200, ...

insert into tbl_duplicate     
select case when ts1 > ts2 then id2 else id1 end  -- In this example, it is assumed that the newest record is retained and the older ones are deleted. 
from (    
select a.id id1, b.id id2, a.ts ts1, b.ts ts2, a.filesize filesize1, b.filesize filesize2,   
  -- The distance operator is self-defined, and here we use <=> for testing. The threshold is self-defined, and here we use 0.1 for testing; 0 indicates duplication, and 1 indicates no duplication.     
   (case when a.v <=> b.v < 0.1 then 0 else 1 end) as ds    
from   
   (select * from tbl where id >= 1 and id < 1001) a,  -- Segmented parallel execution, this SQL segment is 1-1000.        
  tbl b    
where a.id <> b.id   
) t    
where ds=0   
;   

Modify the preceding SQL statements for other shards as needed.

Without sharding, the Cartesian product of 10,000 records takes about 6 seconds.

insert into tbl_duplicate     
select case when ts1 > ts2 then id2 else id1 end  -- In this example, it is assumed that the newest record is retained and the older ones are deleted. 
from (    
select a.id id1, b.id id2, a.ts ts1, b.ts ts2, a.filesize filesize1, b.filesize filesize2,   
  -- The distance operator is self-defined, and here we use <=> for testing. The threshold is self-defined, and here we use 0.1 for testing; 0 indicates duplication, and 1 indicates no duplication.     
   (case when a.v <=> b.v < 0.1 then 0 else 1 end) as ds    
from tbl a,  tbl b    
where a.id <> b.id   
) t    
where ds=0   
;   
  
INSERT 0 55996  
Time: 6024.504 ms (00:06.025)  

4.  Delete similar and duplicate IDs.

delete from tbl where exists (select 1 from tbl_duplicate t where tbl.id=t.id);   
  
DELETE 5549  
Time: 57.987 ms  

Thinking

Why not use the built-in parallel feature of PG?

Because it is meaningless to perform parallelism before gather and not perform parallelism after gather. What is needed is to calculate parallelism after gather. Therefore, the annotation is not advisable.

The following is an example.

/*
-- Set forced parallelism.
set max_parallel_workers=32;  
set max_parallel_workers_per_gather=6;  
set min_parallel_table_scan_size =0;  
set min_parallel_index_scan_size =0;  
set parallel_leader_participation =false;  
set parallel_setup_cost =0;  
set parallel_tuple_cost =0;  
alter table tbl set (parallel_workers =6);  
  
explain   
select id1,id2,ds from (  
select a.id id1, b.id id2,   
   (case when a.v <=> b.v < 0.1 then 0 else 1 end) as ds -- The distance operator is self-defined, and here we use <=> for testing. The threshold is self-defined, and here we use 0.1 for testing; 0 indicates duplication, and 1 indicates no duplication.    
from   
  tbl a,  
  tbl b    
where a.id <> b.id   
) t  
where ds=0;    
  
                                                   QUERY PLAN                      -----------------------------------------------------------------------------------------------------------------  
 Nested Loop  (cost=0.00..2252984.08 rows=499950 width=12)  
   Join Filter: ((a.id <> b.id) AND (CASE WHEN ((a.v <=> b.v) < '0.1'::double precision) THEN 0 ELSE 1 END = 0))  
   ->  Gather  (cost=0.00..229.67 rows=10000 width=140)  
         Workers Planned: 6  
         ->  Parallel Seq Scan on tbl a  (cost=0.00..229.67 rows=1667 width=140)  
   ->  Materialize  (cost=0.00..279.67 rows=10000 width=140)  
         ->  Gather  (cost=0.00..229.67 rows=10000 width=140)  
               Workers Planned: 6  
               ->  Parallel Seq Scan on tbl b  (cost=0.00..229.67 rows=1667 width=140)
 JIT:  
   Functions: 4  
   Options: Inlining true, Optimization true, Expressions true, Deforming true  
(12 rows)  
*/  

Tips: If you are a product manager or kernel developer, have you found any good performance improvements or experience enhancements?

Why not use a vector index?

  1. It is time-consuming to build a vector index.
  2. Vector indexes trade off precision for query efficiency. To achieve perfect deduplication, you can use brute-force Cartesian calculations.
  3. The purpose of segmented parallelism is to make full use of the CPU resources of the machine.
0 1 0
Share on

digoal

284 posts | 25 followers

You may also like

Comments