All Products
Search
Document Center

OpenSearch:Introduction to vectors

Last Updated:Nov 28, 2024

This topic describes the vector models that are supported by OpenSearch Vector Search Edition.

Vector-based retrieval

In the information age, information may consist of multiple modalities, such as text, image, video, and audio. Multimodal information can present more information than text, such as color, shape, motion state, sound, and spatial relationship.

image.png

The modalities of information in various fields have greatly changed.

image.png

Multimodal information is classified into two categories: structured data and unstructured data.

Unstructured data is difficult for computers to understand. The traditional query method by analyzing text terms cannot meet the search requirements of various fields. To resolve the issue, OpenSearch Vector Search Edition provides the vector-based retrieval feature.

This topic describes vectors and how to perform vector-based retrieval.

Unstructured data generated in the physical world can be transformed into structured multidimensional vectors that identify relationships between entities.

Then, the distance between the vectors is calculated. In most cases, the closer the distance, the higher the vector similarity. The top N results with the highest vector similarity are retrieved. This implements vector-based retrieval.

image.png

Vector retrieval algorithms

Linear algorithm

The linear algorithm linearly processes all vector data.

Scenario: You can use this algorithm if you require a recall rate of 100%.

Disadvantage: This algorithm delivers low efficiency and requires high CPU utilization and memory usage when it processes large volumes of data.

Clustering algorithms

Quantized clustering

Overview

image.png

Quantized clustering is a vector retrieval algorithm developed by Alibaba Cloud based on K-means clustering. This algorithm seeks n centroids of the vector documents, and clusters each document to the centroid closest to the document. This way, n posting lists are built. The n centroids and the corresponding posting lists are the main content of quantized clustering indexing. During the retrieval, OpenSearch calculates the distance between several centroids and the request vector, and selects multiple centroids closest to the request vector. Then, OpenSearch calculates the vector distance for the documents within the posting lists that correspond to the selected centroids and returns the top K documents closest to the request vector.

Quantized clustering supports the FP16 and INT8 quantization feature, which can reduce the index size and improve the retrieval performance. However, this feature may compromise the retrieved results. You can enable or disable the FP16 and INT8 quantization feature based on your business requirements.

The quantized clustering algorithm is suitable for real-time data scenarios or scenarios in which you use GPU resources and require low latency.

Parameter tuning:

The following figure shows the performance of different vector algorithms to process different volumes of data.

image.png

HNSW

Overview

image.png

Hierarchical navigable small world (HNSW) is a vector retrieval algorithm based on proximity graphs. This algorithm requires you to create a proximity graph and build edges between vectors that are close to each other. The vectors and the edges are the main content of HNSW indexing. During the retrieval, OpenSearch starts the traversal from the entry vector, calculates the distance between the request vector and each adjacent vector of the entry vector, and selects the nearest adjacent vector as the next vector for traversal. OpenSearch traverses the vectors based on the preceding rule until the convergence appears. Convergence refers to that OpenSearch finds an adjacent vector of the current vector closer to the request vector than all other retrieved adjacent vectors.

To accelerate convergence, HNSW builds a multi-layer proximity graph structure based on the logic of skip list query. The retrieval proceeds from top to bottom. The retrieval logic of each layer is similar except for Layer 0. If OpenSearch traverses to a vector in Layer K when the convergence appears, the vector is considered the entry vector of Layer K-1, and OpenSearch continues to traverse Layer K-1. Compared to Layer K-1, Layer K contains sparser nodes between which the distance is longer. In this case, OpenSearch traverses vectors in Layer K with a larger step size and a higher iteration speed than in Layer K-1.

Parameter tuning:

QGraph

Overview

QGraph (Quantized Graph) is an improved algorithm based on HNSW developed by OpenSearch. During the index construction process, it automatically quantizes the user's original data and then constructs a graph index. Compared to the HNSW algorithm, the QGraph algorithm can effectively reduce index size and save memory overhead, with the potential to reduce the index to as low as 1/8 of its original size. Moreover, with optimizations in CPU instructions for integer calculations, QGraph's performance can be several times better than that of HNSW. However, due to the reduction in distinguishability of vectors after quantization, the recall rate of QGraph is lower compared to HNSW, but in real scenarios, this impact can be mitigated by increasing the number of recalls.

QGraph supports quantization of data at int4, int8, and int16 levels. The different levels of quantization have varying effects on memory consumption of vector indices, query performance, and recall rate. Generally, as the integer bit-width decreases, the memory occupied by the vector index becomes smaller, performance increases, and recall rate decreases. Notably, int16 quantization, due to issues with the underlying CPU instruction set, has performance and recall rates nearly identical to those of unquantized data.

Parameter tuning:

QGraph (Quantized Graph) configuration

Vector distance types

The process of vector-based retrieval is to calculate the similarity between vectors and return the top K vectors based on the similarity. The similarity between vectors is obtained by calculating the distance between vectors. In most cases, the smaller the score, the closer the vector distance. The larger the score, the further the vector distance.

image.png

In different vector spaces, different distance metrics are defined to calculate the distance between vectors. OpenSearch supports the following distance metrics: squared Euclidean distance and inner product.

Squared Euclidean distance

image.png

Squared Euclidean distance refers to the distance between two vectors on the plane. As one of the most common distance metrics, the squared Euclidean distance can be obtained by calculating the square root of the sum of the squares of the coordinate differences between two vectors. The smaller the squared Euclidean distance, the higher the vector similarity. The squared Euclidean distance is calculated based on the following formula:

image.png

Inner product

image.png

The inner product is the dot product or scalar product between two vectors. The larger the inner product, the higher the vector similarity. The inner product can be calculated by multiplying the corresponding elements of the two vectors and summing the products. This distance metric is commonly used in search recommendation scenarios. Generally, whether to calculate the inner product depends on whether the algorithm has an inner product model. The inner product is calculated based on the following formula:

image.png

Comparison among vector retrieval algorithms

Vector retrieval algorithm

Advantage

Disadvantage

Scenario

Quantized clustering

Low CPU utilization and memory usage.

  • Lower recall rate than HNSW.

  • Lower query speed than HNSW.

The quantized clustering algorithm is suitable for scenarios in which hundreds of millions of data records need to be processed and high data accuracy and low query latency are not required.

HNSW

High recall rate and high query speed.

High CPU utilization and memory usage.

The HNSW algorithm is suitable for scenarios in which tens of millions of data records need to be processed and high data accuracy and low query latency are required.

linear

Recall rate of 100%.

  • Low query speed.

  • Higher CPU utilization and memory usage than the quantized clustering and HNSW algorithms.

The linear algorithm is suitable for scenarios in which tens of thousands of data records need to be processed.

QGraph

Low CPU utilization and memory usage. High speed and performance.

Lower recall rate than HNSW.

The QGraph algorithm is suitable for scenarios in which billions of data records need to be processed. High query speed and performance are required and high accuracy is not required.