Image processing technology, such as image search, has a multitude of applications in the real world. For example, Internet users may upload multiple versions of a video or image, each with different formatting, audio tracks, or compression ratios. This leads to a significant number of duplicate videos stored on the service end. However, this problem can be solved using data de-duplication. But how is this normally done?
When you use search engines to look for relevant images, the search engine will process the image and the tags related to the image. For example, when I search for a "snowman" image, a search engine may return me this result.
Pretty accurate right? Typically, PostgreSQL is behind the implementation of the image search and its Payment Gateway Application Programming Interface (API) extends the image search function.
PostgreSQL’s image search plug-in adopts the mainstream Haar wavelet technology to convert and store an image. The following figures briefly describe the Haar wavelet technology. For additional details, refer to the following Wikipedia link: https://en.wikipedia.org/wiki/Haar_wavelet
Below are the steps to install PostgreSQL image search plug-in:
# yum install -y gd-devel
$ git clone https://github.com/postgrespro/imgsmlr
$ cd imgsmlr
$ export PGHOME=/home/digoal/pgsql9.5
$ export PATH=$PGHOME/bin:$PATH:.
$ make USE_PGXS=1
$ make USE_PGXS=1 install
$ psql
psql (9.5.3)
Type "help" for help.
postgres=# create extension imgsmlr;
CREATE EXTENSION
Data Type | Storage Length | Description |
---|---|---|
Pattern | 16388 bytes | Result of Haar wavelet transform on the image |
Signature | 64 bytes | Short representation of pattern for fast search using GiST indexes |
Data Type | Left Type | Right Type | Return Type | Description |
---|---|---|---|---|
<-> | pattern | pattern | float8 | Eucledian distance between two patterns |
<-> | signature | signature | float8 | Eucledian distance between two signatures |
This adds several functions.
Function | Return Type | Description |
---|---|---|
jpeg2pattern(bytea) | pattern | Convert jpeg image to pattern |
png2pattern(bytea) | pattern | Convert png image to pattern |
gif2pattern(bytea) | pattern | Convert gif image to pattern |
pattern2signature(pattern) | signature | Create signature from pattern |
shuffle_pattern(pattern) | pattern | Shuffle pattern for less sensitivity to image shift |
Once you are done installing, carry out these steps to perform PostgreSQL image search plug-in test:
CREATE TABLE pat AS (
SELECT
id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);
ALTER TABLE pat ADD PRIMARY KEY (id);
CREATE INDEX pat_signature_idx ON pat USING gist (signature);
SELECT
id,
smlr
FROM
(
SELECT
id,
pattern <-> (SELECT pattern FROM pat WHERE id = :id) AS smlr
FROM pat
WHERE id <> :id
ORDER BY
signature <-> (SELECT signature FROM pat WHERE id = :id)
LIMIT 100
) x
ORDER BY x.smlr ASC
LIMIT 10
For the most part, our search engine works as expected.
However, sometimes the image search does not work too well.
This is because the computer "sees" the images differently from humans. It processes an object as a 2D matrix, and transform it to a signature, which is readable for computers.
For video de-duplication, you can extract key frames in a video to generate the Cartesian product through self-correlation. Remember to calculate the similarity of two images of different videos. When the similarity reaches a certain threshold, the services deem the two videos the same.
Example:
CREATE TABLE pat AS (
SELECT
id, movie_id,
shuffle_pattern(pattern) AS pattern,
pattern2signature(pattern) AS signature
FROM (
SELECT
id, movie_id,
jpeg2pattern(data) AS pattern
FROM
image
) x
);
select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
order by t1.signature<->t2.signature desc
or
select t1.movie_id, t1.id, t1.signature<->t2.signature from
pat t1 join pat t2 on (t1.movie_id<>t2.movie_id)
where t1.signature<->t2.signature > 0.9
order by t1.signature<->t2.signature desc
Image de-duplication requires Postgres as their database and uses its API. PostgreSQL is a powerful database with customizable functions. It not only ensures image de-duplication effectively but is also safe and reliable. Video de-duplication is the additional feature that is possible using PostgreSQL. Haar wavelet algorithm adds to the possibility of searching images on popular search engines. The implementation of PostgreSQL and installation are aspects that are worth knowing.
Alibaba Cloud DevOps Cookbook Part 3 – Function Compute with Python 3.0
Why You Should Use FlashText Instead of RegEx for Data Analysis
2,599 posts | 762 followers
FollowJunho Lee - June 22, 2023
Junho Lee - June 15, 2023
Alibaba Clouder - July 31, 2019
Alibaba Clouder - January 17, 2018
ApsaraDB - January 22, 2021
Alibaba Cloud Serverless - September 29, 2022
2,599 posts | 762 followers
FollowAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAn intelligent image search service with product search and generic search features to help users resolve image search requests.
Learn MoreA fully managed NoSQL cloud database service that enables storage of massive amount of structured and semi-structured data
Learn MoreMore Posts by Alibaba Clouder
Raja_KT February 19, 2019 at 4:21 am
Good one. Maybe you want to have a look at this one: https://www.alibabacloud.com/forum/read-3422?spm=a2c5q.11423524.0.0.225c53ceNAqfiRImage search using online and offline and using DL techniques are more optimized especially for huge amount of images...