By Gongchang
Retrieval-augmented generation (RAG) integrates retrieval and generation techniques, using external knowledge bases to improve the accuracy and richness of the outputs of large language models (LLMs). The RAG framework relies on two primary components: an embedding model, which converts text into vector representations, and a reranker model, which refines the relevance ranking of retrieved documents. Alibaba's Tongyi Lab has introduced the GTE-Multilingual (mGTE) series which offers high performance, long-context handling, multilingual support, and elastic embedding, significantly improving retrieval and ranking efficiency. The models have achieved outstanding results across datasets, and are suitable for a wide range of advanced RAG applications.
RAG is becoming a dominant paradigm in LLM applications. By combining the strengths of retrieval and text generation, this architecture allows LLMs to leverage external knowledge bases to produce more accurate and detailed answers. RAG mitigates common issues in LLMs, such as factual errors and privacy concerns, and improves real-time response capabilities.
The embedding model and reranker model play a vital role in RAG implementation. Both models retrieve documents that are relevant to the user's query, but they achieve this in different ways. The embedding model computes vector representations for each piece of text, and uses methods, such as cosine distance, to measure the relevance between the vectors. The vector representations for all documents can be pre-computed offline. At runtime, only the query's vector needs to be processed, allowing for fast retrieval of relevant documents using efficient vector search engines. The reranker model takes text pairs as input and performs a more detailed calculation to produce relevance scores, resulting in more accurate ranking. However, due to its high computational complexity, this model is typically used on a small set of candidate documents.
As RAG evolves, system capabilities continue to improve. Early RAG systems focused primarily on vector-based retrieval, but more advanced modules are now being integrated, enhancing overall performance. As applications expand and LLMs handle longer contexts better, the need for multilingual information retrieval, cross-lingual information search, and long-context processing has become increasingly urgent.
Previously, Tongyi Lab released the general text embedding (GTE) series, featuring models based on BERT and LLM embeddings from the Qwen series, such as gte-Qwen2-1.5B-instruct and gte-Qwen2-7B-instruct. Encoder-only models with bidirectional attention currently outperform decoder-only models of similar size in both retrieval and ranking. However, encoder-only models still face challenges from the BERT era, such as a 512-token context limit and insufficient pretraining data. To address these issues, the GTE team developed a new encoder-only model from scratch with extended context support and multilingual capabilities. This led to the launch of the mGTE series, featuring the following advantages:
• High performance: The mGTE models consistently outperform similarly sized open source models across multiple datasets.
• Long-context handling: Both embedding and reranker models handle up to 8,000 tokens and can extend context lengths further using methods such as ntk-rope.
• Multilingual support: The models support 75 languages, covering all major languages available in modern LLMs.
• Elastic embedding: The models offer an adjustable vector dimensionality from 128 to 768, balancing performance with storage efficiency. A 128-dimensional embedding saves 6 times more storage than a 768-dimensional embedding, with less than 2% loss in retrieval performance.
• Sparse embedding: The models generate sparse embeddings by assigning weights to individual words, enabling precise matching when needed.
Figure 1: Architectures of the embedding and reranker models
The process of building the mGTE models is shown in Figure 2. First, a multilingual encoder-only base model, called GTE-base-multilingual, is trained to handle long-context inputs. On top of this base, two additional models are trained: gte-multilingual-base for embeddings and gte-multilingual-reranker-base for ranking.
Figure 2: Model training process
To enhance multilingual capabilities and performance with long texts, several improvements are applied to the original BERT architecture, using techniques that are commonly used in decoder-only LLMs. The architecture is illustrated in Figure 3, with key changes including:
• Rotary Position Encoding (RoPE): RoPE replaces BERT's absolute position embeddings to better support long-context training and allow context length extension.
• Gated Linear Units (GLU): GLU replaces the original feed-forward network (FFN) in BERT. GLU is a proven technique in LLMs to improve training stability and efficiency.
The mGTE models also adopt the XLM-RoBERTa vocabulary for effective multilingual and long-context processing.
Figure 3: GTE base model architecture
The base model is trained on publicly available multilingual datasets, including C4, Skypile, mC4, CulturaX, Wikipedia, and books. After filtering, cleaning, and sampling, the final dataset contains 1,028B tokens (using the XLM-R tokenizer). The language distribution is shown in the figure below, with languages comprising less than 1% grouped under "Others".
The pre-training of mGTE models uses a Masked Language Modeling (MLM) loss, similar to standard encoder-only models, but with several optimizations:
1. Data sampling: To ensure each batch contains data from a single language, data is sampled based on a probability distribution. The sampling probability for each language is calculated by using the following formula:
, where n is the amount of data for the language,
.
2. Multi-stage pre-training: To increase efficiency, the training data is truncated to 2,000 tokens first and then extended to 8,000 tokens. The RoPE base parameter is also increased from 10,000 to 160,000.
3. Unpadding technique: The length of documents used for training varies. To avoid unnecessary computations on padding tokens, unpadding is applied. This technique improves training efficiency and is supported by common libraries such as flash-attention and xFormers.
The mGTE models are trained using BF16 precision.
The effectiveness of the pre-trained base model, mGTE-mlm, is evaluated on two benchmarks: XTREME (a multilingual benchmark covering 50 languages) and GLUE (a benchmark for English-language tasks). Models trained with 2,000 tokens and 8,000 tokens are compared with previous encoder-only multilingual models of similar size. Results show that mGTE-mlm outperforms earlier models across most tasks.
Table 1: XTREME-R multilingual evaluation results
Table 2: GLUE English evaluation results
Encoder-only embedding models typically follow a two-stage training process: weakly supervised pre-training and supervised fine-tuning, which improves generalization and performance.
In the weakly supervised pre-training phase, the mGTE embedding model uses large-scale text-pair datasets collected from the web, such as titles and content of webpages or questions and answers from forums. These datasets do not require manual annotation, making them relatively easy to obtain, though the quality may be low. The model is trained using contrastive learning to gain basic embedding capabilities. After data cleaning and preprocessing, 2.8 billion multilingual text pairs were collected for this phase. Mainstream contrastive learning methods are used to train the model during this phase.
In the supervised fine-tuning phase, the model is further trained with high-quality, manually-annotated text pairs to refine its embedding capabilities. Chinese data includes six datasets (such as Dureader, Simclue, and Multi-CPR) totaling 2 million annotated pairs. English data includes seven datasets (such as MS MARCO, NQ, NLI) with 1.4 million pairs. Three multilingual datasets, namely, MLDR, MIRACL, and Mr.TyDi, contribute 120,000 pairs. For more dataset information, see the appendix.
The supervised fine-tuning phase introduces two advanced embedding features in addition to contrastive learning loss for dense embeddings:
• Elastic embedding: The model can produce embeddings with different dimensions, enabling a balance between storage size and retrieval performance. Currently, many models, both open source and closed source, support this feature through MRL. During the model training process, D represents a list of integers, and the first k dimensions (k ∈ D) of the model's final embedding are taken to be standardized and used to calculate the comparative learning loss. The final loss is the average of the losses of individual dimensions.
• Sparse embedding: Unlike dense embeddings, sparse embeddings assign weights to individual words using a deep learning model, and measure text similarity by summing the product of the weights of matching words. This method extends traditional retrieval approaches such as BM25. Sparse embeddings are particularly effective for precise matching tasks (such as matching based on model numbers, brand names, and years) and long-context retrieval. During training, a linear layer with ReLU activation is added to the last token layer to generate sparse embeddings, with contrastive learning loss as the loss function.
The final training loss is a weighted sum of the MRL-based loss and the sparse embedding loss.
To maximize performance and efficiency, mGTE uses two additional strategies:
Both embedding and ranking models are trained with FP16 precision. Techniques such as DeepSpeed Zero-0 and gradient checkpointing are used to optimize memory consumption.
The mGTE model is tested on the following datasets to evaluate the retrieval performance of the embedding model, especially its multilingual and long-context processing abilities:
• MLDR: A multilingual long-context retrieval dataset, covering 13 languages.
• MIRACL: A multilingual retrieval dataset, covering 18 languages.
• MKQA: A cross-lingual retrieval dataset, covering 25 languages.
• BEIR: A retrieval dataset for English across multiple domains.
• LoCo: A dataset focused on long-context retrieval in English.
Table 3 compares mGTE's performance with other models of similar size across the five datasets. Here are the key findings:
• Long-context retrieval: mGTE outperforms comparable models due to its training on long-context inputs.
• Short-context retrieval: mGTE performs significantly better than models of the same size and nearly matches larger models.
• Sparse vector retrieval: In most scenarios, mGTE's sparse vector retrieval surpasses BM25, with a notable edge over existing dense retrieval models for long contexts.
Table 3: Comparison of the retrieval performance of mGTE and other models
The MTEB dataset evaluates multitask embedding performance across languages such as English, French, and Polish. The mGTE model outperforms encoder-only models of similar size in the open source community. However, it still falls behind larger LLM-based models. Despite these gaps, mGTE's smaller size and faster inference make it highly practical for real-world applications.
Table 4: Comparison of mGTE and other models in multitask and multilingual processing on the MTEB dataset
Using elastic embeddings enhances the efficiency and scalability of text processing and retrieval, making it easier and more efficient to handle high-dimensional data. In the MTEB English evaluation, mGTE's performance across different vector dimensions is similar to that of OpenAI's elastic models. Lowering the number of vector dimensions reduces performance, but as long as the number of dimensions stays above 512, the performance impact is acceptable.
Figure 4: Comparison of mGTE and other models in multitask and multilingual processing on the MTEB dataset
A contrastive learning loss function is used to train the reranker model. The paper notes that weakly supervised pretraining has a minimal impact on improving the reranker model's performance. As a result, only supervised data is used to fine-tune the final reranker model. The reranker model requires text pairs as input to calculate relevance scores. In addition, the reranker model's hyperparameter settings are consistent with those used in the embedding model.
The mGTE reranker model was evaluated across multiple datasets, including MLDR, MIRACL, MKQA, and BEIR, compared to other ranking models. All the models re-ranked the top 100 results retrieved using vectors generated by the mGTE-TRM-base model. The evaluation results are summarized below:
Figure 5: Comparison of ranking models
Key insights from Figure 5:
• All ranking models outperform the vector-based retrieval models, validating the importance of adding a ranking model to the retrieval process.
• The mGTE-reranker-base model delivers results that are comparable to or better than models of the same and larger size. It performs particularly well in multilingual and long-context retrieval.
Instructions for using the models can be found on Hugging Face, along with example implementations.
# Requires transformers>=4.36.0
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
" Introduction to Quick Sort Algorithm "
]
model_path = 'Alibaba-NLP/gte-multilingual-base'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
dimension=768 # The output dimension of the output embedding, should be in [128, 768]
embeddings = outputs.last_hidden_state[:, 0][:dimension]
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Alibaba-NLP/gte-multilingual-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('Alibaba-NLP/gte-multilingual-reranker-base', trust_remote_code=True)
model.eval()
pairs = [["中国的首都在哪儿","北京"], ["what is the capital of China?", "Beijing"], ["how to implement quick sort in python?","Introduction of quick sort"]]
with torch.no_grad():
inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
print(scores)
This article introduces the latest GTE-multilingual models from Alibaba's Tongyi Lab. These models include base models, embedding models, and ranking models, designed to support multilingual retrieval and long-context processing while maintaining efficient inference costs. The GTE series offers industry-leading retrieval capabilities, particularly for tasks involving RAG. The GTE model family now includes GTE Chinese/English single-language models, GTE-Qwen-instruct models (which achieved multilingual state-of-the-art (SOTA) performance on the MTEB leaderboard), and GTE-Multilingual models, as discussed in this article. All these models are available as open source releases on ModelScope and Hugging Face.
☝ That is the end of this article. If you are curious about RAG and want to know more, feel free to leave a comment below!
• GTE Text Embedding-Chinese-General Domain (Large)
• GTE Text Embedding-Chinese-General Domain (Base)
• GTE Text Embedding-English-General Domain (Large)
• GTE Text Embedding-Chinese-General Domain (Base)
• GTE Text Embedding-Qwen2-1.5B
• GTE Text Embedding-Multilingual (Base)
• GTE Ranking-Multilingual (Base)
• gte-multilingual-reranker-base
In addition to the open source models, the GTE series models are also offered as commercial APIs on Alibaba Cloud:
• Text Embedding Models: Three versions of the text embedding models are available: text-embedding-v1, v2, and v3. The latest release is v3, with detailed documentation provided here: 🔗 General Text Embedding Documentation
• Text Ranking Models: The gte-reranker model service is available and is continuously updated. The detailed documentation is provided here: 🔗 General Text Ranking Model Documentation (English Version coming soon)
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Join Alibaba Cloud Summit Hong Kong 2024 - Empowering Sustainable Futures with AI
5 Ways Alibaba is Leading Sustainability Efforts Demonstrated at COP29
1,076 posts | 263 followers
FollowFarruh - July 18, 2024
Amuthan Nallathambi - July 12, 2024
Alibaba Cloud Community - September 6, 2024
Farruh - February 26, 2024
Alibaba Cloud Community - November 5, 2024
Alibaba Cloud Community - July 19, 2024
1,076 posts | 263 followers
FollowAccelerate AI-driven business and AI model training and inference with Alibaba Cloud GPU technology
Learn MoreTop-performance foundation models from Alibaba Cloud
Learn MoreOffline SDKs for visual production, such as image segmentation, video segmentation, and character recognition, based on deep learning technologies developed by Alibaba Cloud.
Learn MoreSelf-service network O&M service that features network status visualization and intelligent diagnostics capabilities
Learn MoreMore Posts by Alibaba Cloud Community