What is an inverted index?
An inverted index is also called a postings file or an inverted file. An inverted index is an indexing method that is used to store mappings from terms to the positions in a document or a set of documents. The inverted index allows fast full-text searches. An inverted index is the most commonly used data structure that is used in document retrieval systems. You can use inverted indexes to quickly find a list of documents that contain a term and the position of the term in the documents, and obtain other information such as term frequency for analysis.
Items stored in an inverted index
Item | Description |
ttf | Short for total term frequency. The total term frequency specifies the total number of times that a term appears in all documents. |
df | Short for document frequency. The document frequency specifies the total number of documents that contain a specific term. |
tf | Short for term frequency. The term frequency specifies the number of times that a term appears in a document. |
docid | Short for document ID. A document ID is the unique identifier of a document in the engine. You can use the document ID to obtain other information about the document that you query. |
fieldmap | A field map is used to record the field information of a term. |
section information | You can divide some documents into sections and then specify additional information for each section. The information can be retrieved for subsequent processing. |
position | The position is used to record the position of a term in a document. |
positionpayload | You can specify payload information for different positions in the document. The payload information can be retrieved for subsequent processing. |
docpayload | Short for document payload. You can specify additional information for some documents. The information can be retrieved for subsequent processing. |
termpayload | You can specify additional information for some terms. The information can be retrieved for subsequent processing. |
Basic structures of an inverted index
Structure name | Description |
dictionary | A dictionary that stores mappings between terms and posting lists. The engine can use the dictionary to find the position of a term in a document. |
doclist | A document list stores information about documents that contain a specific term. |
positionlist | A position list stores the position information of a term in each document. |
truncatelist | A truncate list is used to improve the performance of the engine. An inverted index can be separately created for high-quality documents based on the configuration that you specify to improve retrieval performance. |
bitmap | A bitmap is used to improve the performance of the engine. Some inverted structures can be stored in bitmaps based on the configuration that you specify. This helps reduce the storage space that is occupied by inverted indexes and improve retrieval performance. |
Basic procedure for using an inverted index for retrieval
When you query term M by using an inverted index, the engine queries the dictionary file to find the starting position of term M in the postings file. Then, the engine parses the posting list to obtain the following parts of term M information: TermMeta, DocList, and PositionList. TermMeta stores the basic description of the term, including the df, ttf, and termpayload of the term. DocList contains information about the documents that contain the term. The information about each document includes the document ID, the term frequency in the document, the document payload, and the fields that contain the term. PositionList contains the position information of the term in each document. The position information includes the specific position of the term in the documents and positionpayload information.