This article seeks to address the following list of questions raised in the Q&A section of the Alibaba Cloud Community.
How can I delete similar data from a table?
For example, similar strings such as the following:
Follow the steps below to resolve the issues listed above.
1) Use the pg_trgm
word breaker provided by PostgreSQL to divide strings into multiple non-repeated tokens and determine the content similarity between two strings. We recommend using the word breaking method to divide the content into terms.
2) After dividing the content into terms, aggregate the terms and remove duplicate data.
3) Generate a Cartesian product (matrix) through self-association to calculate the similarity between each record and each other record. To calculate the similarity, simply divide the number of overlapping tokens by the number of tokens in the set after deduplication.
4) Remove unnecessary data based on similarity.
While dealing with huge data volumes, use a professional analytical programming language, such as PL/R.
First, install the PostgreSQL word breaker (AliCloudDB for PostgreSQL contains this plug-in. For more information, see the official manual.)
git clone https://github.com/jaiminpan/pg_jieba.git
mv pg_jieba $PGSRC/contrib/
export PATH=/home/digoal/pgsql9.5/bin:$PATH
cd $PGSRC/contrib/pg_jieba
make clean;make;make install
git clone https://github.com/jaiminpan/pg_scws.git
mv pg_jieba $PGSRC/contrib/
export PATH=/home/digoal/pgsql9.5/bin:$PATH
cd $PGSRC/contrib/pg_scws
make clean;make;make install
Create a plug-in as shown below.
psql
# create extension pg_jieba;
# create extension pg_scws;
Next, create a test case.
create table tdup1 (id int primary key, info text);
create extension pg_trgm;
insert into tdup1 values (1, '银屑病怎么治?');
insert into tdup1 values (2, '银屑病怎么治疗?');
insert into tdup1 values (3, '银屑病怎么治疗好?');
insert into tdup1 values (4, '银屑病怎么能治疗好?');
Now, choose one of the word breakers as shown below.
postgres=# select to_tsvector('jiebacfg', info),* from tdup1 ;
to_tsvector | id | info
---------------------+----+----------------------
'治':3 '银屑病':1 | 1 | 银屑病怎么治?
'治疗':3 '银屑病':1 | 2 | 银屑病怎么治疗?
'治疗':3 '银屑病':1 | 3 | 银屑病怎么治疗好?
'治疗':4 '银屑病':1 | 4 | 银屑病怎么能治疗好?
(4 rows)
postgres=# select to_tsvector('scwscfg', info),* from tdup1 ;
to_tsvector | id | info
-----------------------------------+----+----------------------
'治':2 '银屑病':1 | 1 | 银屑病怎么治?
'治疗':2 '银屑病':1 | 2 | 银屑病怎么治疗?
'好':3 '治疗':2 '银屑病':1 | 3 | 银屑病怎么治疗好?
'好':4 '治疗':3 '能':2 '银屑病':1 | 4 | 银屑病怎么能治疗好?
(4 rows)
Create the following three functions:
postgres=# create or replace function array_union(text[], text[]) returns text[] as $$
select array_agg(c1) from (select c1 from unnest($1||$2) t(c1) group by c1) t;
$$ language sql strict;
CREATE FUNCTION
postgres=# create or replace function array_dist(text[]) returns text[] as $$
select array_agg(c1) from (select c1 from unnest($1) t(c1) group by c1) t;
$$ language sql strict;
CREATE FUNCTION
postgres=# create or replace function array_share(text[], text[]) returns text[] as $$
select array_agg(unnest) from (select unnest($1) intersect select unnest($2) group by 1) t;
$$ language sql strict;
CREATE FUNCTION
The Cartesian result is as follows:
regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')
used to convert info into arrays
postgres=# with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2) t;
t1c1 | t2c1 | t1c2 | t2c2 | t1c3 | t2c3 | simulate
------+------+----------------------+----------------------+-------------------+-------------------+----------
1 | 1 | 银屑病怎么治? | 银屑病怎么治? | {'银屑病','治'} | {'银屑病','治'} | 1.00
1 | 2 | 银屑病怎么治? | 银屑病怎么治疗? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
1 | 3 | 银屑病怎么治? | 银屑病怎么治疗好? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
1 | 4 | 银屑病怎么治? | 银屑病怎么能治疗好? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
2 | 1 | 银屑病怎么治疗? | 银屑病怎么治? | {'银屑病','治疗'} | {'银屑病','治'} | 0.33
2 | 2 | 银屑病怎么治疗? | 银屑病怎么治疗? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
2 | 3 | 银屑病怎么治疗? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
2 | 4 | 银屑病怎么治疗? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 1 | 银屑病怎么治疗好? | 银屑病怎么治? | {'银屑病','治疗'} | {'银屑病','治'} | 0.33
3 | 2 | 银屑病怎么治疗好? | 银屑病怎么治疗? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 3 | 银屑病怎么治疗好? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
4 | 1 | 银屑病怎么能治疗好? | 银屑病怎么治? | {'银屑病','治疗'} | {'银屑病','治'} | 0.33
4 | 2 | 银屑病怎么能治疗好? | 银屑病怎么治疗? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
4 | 3 | 银屑病怎么能治疗好? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
4 | 4 | 银屑病怎么能治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
(16 rows)
The preceding result is actually a matrix, where simulate indicates the similarity to be calculated.
During deduplication, we only need the data in the upper or lower part of the diagonal of the matrix. Therefore, add one more condition to determine the similarity.
postgres=# with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t;
t1c1 | t2c1 | t1c2 | t2c2 | t1c3 | t2c3 | simulate
------+------+--------------------+----------------------+-------------------+-------------------+----------
1 | 2 | 银屑病怎么治? | 银屑病怎么治疗? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
1 | 3 | 银屑病怎么治? | 银屑病怎么治疗好? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
1 | 4 | 银屑病怎么治? | 银屑病怎么能治疗好? | {'银屑病','治'} | {'银屑病','治疗'} | 0.33
2 | 3 | 银屑病怎么治疗? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
2 | 4 | 银屑病怎么治疗? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
(6 rows)
First, specify the simulate. For example, data with a similarity greater than 0.5 needs to be deduplicated.
postgres=# with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;
t1c1 | t2c1 | t1c2 | t2c2 | t1c3 | t2c3 | simulate
------+------+--------------------+----------------------+-------------------+-------------------+----------
2 | 3 | 银屑病怎么治疗? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
2 | 4 | 银屑病怎么治疗? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
(3 rows)
Second, delete the records corresponding to the IDs in column t2c1.
delete from tdup1 where id in (with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);
例如 :
postgres=# insert into tdup1 values (11, '白血病怎么治?');
INSERT 0 1
postgres=# insert into tdup1 values (22, '白血病怎么治疗?');
INSERT 0 1
postgres=# insert into tdup1 values (13, '白血病怎么治疗好?');
INSERT 0 1
postgres=# insert into tdup1 values (24, '白血病怎么能治疗好?');
INSERT 0 1
postgres=#
postgres=# with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select * from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5;
t1c1 | t2c1 | t1c2 | t2c2 | t1c3 | t2c3 | simulate
------+------+--------------------+----------------------+-------------------+-------------------+----------
2 | 3 | 银屑病怎么治疗? | 银屑病怎么治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
2 | 4 | 银屑病怎么治疗? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
3 | 4 | 银屑病怎么治疗好? | 银屑病怎么能治疗好? | {'银屑病','治疗'} | {'银屑病','治疗'} | 1.00
22 | 24 | 白血病怎么治疗? | 白血病怎么能治疗好? | {'治疗','白血病'} | {'治疗','白血病'} | 1.00
13 | 22 | 白血病怎么治疗好? | 白血病怎么治疗? | {'治疗','白血病'} | {'治疗','白血病'} | 1.00
13 | 24 | 白血病怎么治疗好? | 白血病怎么能治疗好? | {'治疗','白血病'} | {'治疗','白血病'} | 1.00
(6 rows)
postgres=# begin;
BEGIN
postgres=# delete from tdup1 where id in (with t(c1,c2,c3) as
postgres(# (select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
postgres(# select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
postgres(# simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);
DELETE 4
postgres=# select * from tdup1 ;
id | info
----+--------------------
1 | 银屑病怎么治?
2 | 银屑病怎么治疗?
11 | 白血病怎么治?
13 | 白血病怎么治疗好?
(4 rows)
Use a database to solve the problems that you may encounter because the JOIN filter is <> and <, while the hashjoin is not used.
This operation usually takes longer for large amounts of data.
postgres=# explain delete from tdup1 where id in (with t(c1,c2,c3) as
(select id,info,array_dist(regexp_split_to_array((regexp_replace(to_tsvector('jiebacfg',info)::text,'(:\d+)', '', 'g')),' ')) from tdup1)
select t2c1 from (select t1.c1 t1c1,t2.c1 t2c1,t1.c2 t1c2,t2.c2 t2c2,t1.c3 t1c3,t2.c3 t2c3,round(array_length(array_share(t1.c3,t2.c3),1)::numeric/array_length(array_union(t1.c3,t2.c3),1),2)
simulate from t t1,t t2 where t1.c1<>t2.c1 and t1.c1<t2.c1) t where simulate>0.5);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Delete on tdup1 (cost=10005260133.58..10005260215.84 rows=2555 width=34)
-> Hash Join (cost=10005260133.58..10005260215.84 rows=2555 width=34)
Hash Cond: (tdup1.id = "ANY_subquery".t2c1)
-> Seq Scan on tdup1 (cost=0.00..61.10 rows=5110 width=10)
-> Hash (cost=10005260131.08..10005260131.08 rows=200 width=32)
-> HashAggregate (cost=10005260129.08..10005260131.08 rows=200 width=32)
Group Key: "ANY_subquery".t2c1
-> Subquery Scan on "ANY_subquery" (cost=10000002667.20..10005252911.99 rows=2886838 width=32)
-> Subquery Scan on t (cost=10000002667.20..10005224043.61 rows=2886838 width=4)
Filter: (t.simulate > 0.5)
CTE t
-> Seq Scan on tdup1 tdup1_1 (cost=0.00..2667.20 rows=5110 width=36)
-> Nested Loop (cost=10000000000.00..10005113119.99 rows=8660513 width=68)
Join Filter: ((t1.c1 <> t2.c1) AND (t1.c1 < t2.c1))
-> CTE Scan on t t1 (cost=0.00..102.20 rows=5110 width=36)
-> CTE Scan on t t2 (cost=0.00..102.20 rows=5110 width=36)
(16 rows)
A better way is to use PL/R or R for matrix calculation and then filtering the result. You can also use an MPP database. For example, Greenplum processes a large amount of data with R and MADlib.
The article illustrates how to retrieve and deduplicate similar text in PostgreSQL by using examples. It applies Word Breaking functionality of PG to answer the questions regarding the deduplication of similar text. The other feature of PG, Window Query, is not used in this example. However, if your data does not have a primary key, use ctid and row_number to locate a unique record.
digoal - February 3, 2020
zhuodao - July 30, 2020
digoal - February 3, 2020
harold c - July 24, 2023
Data Driven Investor - October 11, 2019
digoal - July 4, 2019
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreReach global users more accurately and efficiently via IM Channel
Learn MoreTranscode multimedia data into media files in various resolutions, bitrates, and formats that are suitable for playback on PCs, TVs, and mobile devices.
Learn MoreMore Posts by digoal