Multiple image processing business scenarios, such as image search, video deduplication, image beautification, and image deduplication, encounter the common challenge of managing the humongous volume of duplicate images and videos stored on the server.
For instance, users may upload many videos where specific video has different versions with different resolutions, audio tracks, or compression ratios. This leads to a large number of duplicate videos saved on the server. Another example is the identification of porn videos or images. It is crucial to identify such content but the challenge is how to identify porn videos and images.
On the other hand, image search is the second most common search engine after text search. Image search engines, such as Alibaba Cloud Image Search, allow you to upload a particular picture to the interface and search out a bunch of similar pictures.
This article describes how to screen out duplicate videos and how to implement an efficient image search using PostgreSQL. With the omnipotent API of PostgreSQL, it is easy to extend its image search function.
The PostgreSQL image search plug-in uses the mainstream Haar wavelet technology to store images after conversion. Take a look at its Wikipedia page to learn more about this technology.
You can also refer to this document to know more about how to search similar images in PostgreSQL.
Following are some examples to illustrate the Haar wavelet technology application.
Follow the steps below to install the PostgreSQL Image Search Plug-in:
Step 1. Add dependency on gd.h.
yum install -y gd-devel
Step 2. Download and install imgsmlr.
$ git clone https://github.com/postgrespro/imgsmlr
$ cd imgsmlr
$ export PGHOME=/home/digoal/pgsql9.5
$ export PATH=$PGHOME/bin:$PATH:.
$ make USE_PGXS=1
$ make USE_PGXS=1 install
Step 3. Install the plug-in.
$ psql
psql (9.5.3)
Type "help" for help.
postgres=# create extension imgsmlr;
CREATE EXTENSION
The following are two newly added data types in imgsmlr.
Datatype | Storage length | Description |
---|---|---|
pattern | 16388 bytes | Result of Haar wavelet transform on the image |
signature | 64 bytes | Short representation of pattern for fast search using GiST indexes |
The KNN operator and gist index method that supports the pattern and signature types can be used for similar image search.
Operator | Left type | Right type | Return type | Description |
---|---|---|---|---|
<-> | pattern | pattern | float8 | Eucledian distance between two patterns |
<-> | signature | signature | float8 | Eucledian distance between two signatures |
The following table shows several newly added functions. Now, you can convert binary images to the pattern type and convert the data stored in the pattern into the signature type.
Function | Return type | Description |
---|---|---|
jpeg2pattern(bytea) | pattern | Convert jpeg image into pattern |
png2pattern(bytea) | pattern | Convert png image into pattern |
gif2pattern(bytea) | pattern | Convert gif image into pattern |
pattern2signature(pattern) | signature | Create signature from pattern |
shuffle_pattern(pattern) | pattern | Shuffle pattern for less sensitivity to image shift |
Follow the steps listed below to test the PostgreSQL Image Search Plug-in:
Step 1. Import some images as shown below (the more the better).
Step 2. Now, create the image table.
create table image (id serial, data bytea);
Step 3. Import the images to the database.
insert into image(data) select pg_read_binary_file('文件路径');
Step 4. Next, convert the images to the pattern and signature types.
CREATE TABLE pat AS (
SELECT
id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);
Step 5. Create indexes as shown below.
ALTER TABLE pat ADD PRIMARY KEY (id);
CREATE INDEX pat_signature_idx ON pat USING gist (signature);
Step 6. Now you can run an approximation query, such as querying images that are similar to id = : id images and retrieve the top 10 items on the similarity ranking list.
SELECT
id,
smlr
FROM
(
SELECT
id,
pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
FROM pat
WHERE id <> :id
ORDER BY
signature <-> (SELECT signature FROM pat WHERE id = :id)
LIMIT 100
) x
ORDER BY x.smlr ASC
LIMIT 10
Here, you can use KNN indexing to quickly get output based on the similarity rankings.
Let's take a look at the following examples.
For video deduplication, extract keyframes in a video to generate the Cartesian product through self-correlation. Calculate the similarity between any two images of different videos, and when the similarity reaches a certain threshold value, the two videos are deemed the same.
Refer to the following example for better understanding.
创建图片表,并将所有视频的关键帧导入表中
create table image (id serial8 primary key, movie_id int, data bytea);
导入图片,假设为jpeg格式
... 略 ...
生成patten 和 signature
CREATE TABLE pat AS (
SELECT
id, movie_id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id, movie_id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);
计算不同视频的相似度
select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
order by t1.signature<->t2.signature desc
or
select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
where t1.signature<->t2.signature > 0.9
order by t1.signature<->t2.signature desc
In short, PostgreSQL is a very powerful database with highly customizable functions and does not require you to make any changes to the PostgreSQL kernel. Therefore, it is safe and reliable. The image search technology is an example of PostgreSQL feature extension and offers extremely high speed.
Build your own PostgreSQL solution on Alibaba Cloud with ApsaraDB for RDS PostgreSQL.
Keyword Analysis with PostgreSQL: Cosine and Linear Correlation Algorithms for Text Analysis
Alibaba Clouder - June 1, 2018
digoal - February 3, 2020
Apache Flink Community China - July 27, 2021
digoal - September 12, 2019
digoal - February 3, 2020
ApsaraDB - November 16, 2020
An intelligent image search service with product search and generic search features to help users resolve image search requests.
Learn MoreThis technology can assist realizing quantitative analysis, speeding up CT image analytics, avoiding errors caused by fatigue and adjusting treatment plans in time.
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by digoal