Proxima-based vector processing - Hologres - Alibaba Cloud Documentation Center

Hologres supports vector processing and allows you to use vector data to show the characteristics of unstructured data. The high-performance vector search feature helps you quickly query unstructured data. This topic describes the characteristics and benefits of vector processing in Hologres.

Background information

Proxima is a high-performance software library that is developed by Alibaba DAMO Academy. You can use Proxima to search for the nearest neighbors of vectors. Proxima provides higher stability and performance than similar open source software such as Facebook AI Similarity Search (Faiss). Proxima also provides basic modules that have industry-leading performance to help you search for images, videos, or human faces.

Hologres is deeply integrated with Proxima to provide the following benefits:

Powerful vector processing capabilities
- Timeliness: Hologres supports vector data writes and updates in real time. Data can be queried immediately after the data is written.
- Powerful query capability: Hologres supports vector data queries based on complex filter conditions. You can use vector indexes together with other structured indexes.
- High performance: Hologres supports real-time vector data writes with ultra-high queries per second (QPS), efficient index building, and vector searches with high QPS and low latency.
- Low cost: Hologres uses the FLOAT2 data type to compress vector indexes. This helps reduce the vector storage cost.
Real-time data warehousing capabilities combined with vector processing
- Ease of use: Hologres allows you to use standard SQL statements to install and use Proxima.
- Transaction: In Hologres, you can execute multiple data definition language (DDL) statements in a transaction or multiple data manipulation language (DML) statements in a transaction.
- Binary logs: Hologres supports binary logs. You can subscribe to vector data change events.
- Multi-scenarios: Hologres supports three types of storage formats: row-oriented storage, column-oriented storage, and row-column hybrid storage. You can perform high-performance online analytical processing (OLAP), point queries of key-value pairs, and vector searches on a vector table at the same time.
Enterprise-class high availability capabilities combined with vector processing
- Primary/Secondary instance architecture: You can deploy one primary instance and multiple secondary instances that have shared storage resources and isolated computing resources. This helps achieve high availability of the vector processing service with read/write splitting and read/read splitting. For more information, see Configure read/write splitting for primary and secondary instances (shared storage).
- Virtual warehouse instance architecture: You can deploy multiple virtual warehouses that share storage resources. This architecture supports write/write splitting, which is not supported by the primary/secondary instance architecture. For more information, see Architecture of virtual warehouse instances.
Product ecosystem combined with vector processing
- Hologres is seamlessly integrated with MaxCompute to support accelerated queries on vector data in MaxCompute by using foreign tables and high-performance batch writes of vector data from MaxCompute.
- Hologres is natively integrated with Flink to support real-time writes and updates of large amounts of vector data, multiple scenarios that involve source tables, result tables, or dimension tables, and complex operations such as combination of multiple vector data streams.
- Hologres is deeply integrated with DataWorks to support vector data integration from various data sources and enterprise-class capabilities, such as data assets, data lineage, and data services.

Introduction to Proxima

Terms

Characteristic vector: A vector is the algebraic representation of an entity or an application. Vectors abstract the relationship between entities into the distance in the vector space. The distance indicates the degree of similarity. For example, height, age, gender, and region are characteristic vectors.
Vector search: fast search and match on a characteristic vector dataset. In most cases, K-nearest neighbors (KNN) and Radius nearest neighbors (RNN) searches are involved.
KNN: searches for the K points that are closest to a point.
RNN: searches for all points within a circle whose centroid and radius are specified.

Basic model of Proxima

The basic model of Proxima is divided into two parts: index building and online search.

Index building: An index file is built from the original vector data and then passed to the online search module for loading and use. Index building supports brute force, k-dimensional (k-d) trees, product quantization, KNN graphs, and locality-sensitive hashing (LSH).
Online search: After an index file is loaded, you can perform vector data queries, such as KNN and RNN searches, on the involved clustered dataset. You can configure the parameters that are used for the searches.

Mappings between concepts in Proxima and Hologres

Concept in Proxima	Concept in Hologres
Characteristic vector	Fixed-length arrays.
Vector index	Indexes of a special type. Only graph-based indexes in KNN and RNN searches are supported.
Distance calculation	`proxima_distance()`: a user-defined function (UDF) that is used for distance calculation. Each type of distance calculation method corresponds to a UDF.
KNN search	order by distance(x, [x1, x2]) asc limit k
RNN search	where distance(x, [x1,x2]) < r Note RNN searches do not support Proxima indexes.

References

For more information about how to use vector processing in Hologres, see User guide on vector processing.
For more information about the required memory specifications for high-performance vector searches, see Recommended instance specifications for vector processing.