Indexing and Queries
Understand how Indexing and Queries work in Lucene
DynamicWeb 10 uses Lucene.NET, which is a C# port of the Java based Apache Lucene search engine library, to create high-performance and full-featured text indexing and search capabilities. In this article we will dive into how indexing and querying works in Lucene, but let's first get the terminology in place when being in a search engine context:
- Queries: a user's request for information, it typically consists of keywords or phrases that the user wants to find in a collection of documents
- Querying: the process of searching through the indexed documents to find those that match the user's query
- Indexing: the process of converting text data into a searchable format
In summary, querying is about finding relevant documents based on a user's input, while indexing is about organizing documents so that they can be searched quickly and effectively.
Indexing steps
Indexing can also be thought of as an index at the end of a book; it allows you to quickly find specific topics or keywords without having to read through the entire book. Similarly, a Lucene-index enables efficient searches by quickly pointing to documents that contain the search terms. In Apache Lucene, indexing involves the following steps:
- Document Collection: Gather documents that need to be searchable
- Document Analysis and Indexing: Break down the text in documents into searchable terms using analyzers
- Inverted Index Creation: Build an inverted index that maps terms to the documents in which they appear
Document Collection
The process begins when documents are collected for indexing. In Lucene, a document is a fundamental unit of indexing and searching. Each document is a collection of fields, where each field represents a specific piece of information about the document. These fields can store various types of information, such as text, numbers, dates, URLs, etc.
Example: A product document
Let’s consider an e-commerce scenario where we're indexing product information for a search engine on an online shopping platform. Each product can be represented as a document with a collection of fields in Lucene:
- Product Name: The name of the product
- Description: A brief description of the product
- Price: The price of the product
- Category: The category to which the product belongs
- SKU: The Stock Keeping Unit, a unique identifier for the product
- URL: The URL to the product page
A concrete example of a product we could index, could be:
- Product Name: "Wireless Bluetooth Headphones"
- Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life"
- Price: 59.99
- Category: "Electronics"
- SKU: "WH123456"
- URL: "http://www.example.com/products/wh123456"
Document Analysis and Indexing
The next step for each field in a document is to determine if we want to store, index, and/or analyze it, depending on what needs to be done with its content. Here’s how each type of field is used:
Stored fields
Stored fields are fields whose exact content is saved in the index. These fields are primarily used for retrieval and display purposes. When a document is indexed, the stored fields are kept in their original form so that they can be retrieved and shown in search results.
- Purpose: To retain the original value for display or later retrieval
- Example Use Cases:
- Article Title: Displaying the title of an article in search results
- Publication Date: Showing the publication date of a document
- URL: Providing a clickable link to the product page in search results
Indexed fields
Indexed Fields are fields whose content is processed and added to an inverted index, making them searchable. If a field is indexed but not analyzed, it is indexed as a single term, making it suitable for exact-match searches. If a field is indexed and analyzed, it is tokenized and processed, allowing for more flexible searches
- Purpose: To make the field content searchable
- Example Use Cases:
- Exact-Match Searches: Searching for a specific SKU or ID
- Full-Text searches: Searching within the content of an article or description
Analyzed fields
Analyzed fields fields are bit more complicated and we've made an in-depth guide to them here, but basically, analyzed fields have their content processed by an analyzer before being indexed. This processing involves breaking down the text into individual terms (tokens), normalizing these terms (e.g., converting them to lowercase), and applying various filters (e.g., removing stopwords or applying stemming). The goal is to make the text content more searchable and relevant for full-text search queries.
- Purpose: To process the field's text content, making it searchable in a flexible and meaningful way.
- Example Use Cases:
- Phrase Searches: Finding exact phrases within the text
- Full-Text searches: Searching within the content of an article, product description, or any large body of text.
- Relevance Ranking: Improving search result relevancy by using normalized and filtered tokens
Example: Product document fields
Let's see what we might want to do with our example product:
- Products Name: This field should be stored, indexed, and analyzed so that it can be displayed in search results and used for searching
- Description: This field should be stored, indexed, and analyzed so that users can search for terms within the product descriptions
- Price: This field can be stored (for display purposes) and indexed if we want to perform range queries (e.g., searching for products within a certain price range)
- Category: This field should be stored and indexed to allow filtering and searching by category
- SKU: This field should be stored and indexed for exact-match searches
- URL: This field should be stored to provide a link to the product page in the search results
Inverted Index Creation
The inverted index is the backbone of Lucene's search functionality. It allows for fast and efficient searching by mapping terms to the documents that contain them. This structure is similar to the index at the end of a book, but inverted. Instead of pointing from a topic to a page number, it points from a term to the documents that contain that term. This sounds a bit confusing, but bear with me. It'll become clear as we move on.
The fields of a document and the inverted index function as follows:
- Analyzed and indexed fields: These fields are processed (split into tokens) and added to the inverted index. This enhances search functionality by enabling full-text searches. For example, the phrase "Wireless Bluetooth Headphones" might be split into tokens like "wireless", "bluetooth", and "headphones", all of which are stored in the inverted index.
- Non-analyzed indexed fields: These fields are added to the inverted index as complete terms without being split. This is useful for exact-match searches. For example, the SKU "WH123456" is stored in the inverted index as a single term.
- Stored fields: These fields are not included in the inverted index. Since stored fields are not meant for searching, they do not need to be part of the inverted index. For example, a URL stored in the field "http://www.example.com/products/wh123456" is saved exactly as it is for display in search results but is not searchable.
In summary, the inverted index is specifically designed for fields that need to be searchable, while stored fields are kept separately for retrieval without affecting the search index.
How the inverted index works
To understand how the inverted index works, let's break it down into steps for a field that is Indexed and Analyzed:
- Tokenization and analysis:
- When a document is indexed and analyzed, the text is broken down into individual terms (tokens), normalized, and processed through various filters (e.g., lowercasing, removing stopwords)
- Building the inverted index:
- For each term, the inverted index keeps a list of documents (and positions within those documents) where the term appears. This list is known as the postings list
- Segment management:
- New documents are initially added to an in-memory segment. Periodically, these segments are flushed to disk and merged with other segments to optimize performance
Example: Inverted indexing of a product document
Let's continue with our e-commerce example and see how the inverted index is created for a product document:
- Product Name: "Wireless Bluetooth Headphones" (Stored, Indexed, and Analyzed)
- Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life" (Stored, Indexed, and Analyzed)
- Price: 59.99 (Stored and Indexed)
- Category: "Electronics" (Stored and Indexed)
- SKU: "WH123456" (Stored and Indexed)
- URL: "http://www.example.com/products/wh123456" (Stored)
The first step is for the fields who are analyzed (in our example, the Product Name and Description) to be broken down into tokens:
- Product Name: Tokens: ["wireless", "bluetooth", "headphones"]
- Description: Tokens: ["high", "quality", "wireless", "bluetooth", "headphones", "noise", "cancellation", "long", "battery, "life"]
The second step is the inverted index, and is built by mapping each term to its postings list, which includes document IDs and positions within those documents. Let's assume the document ID for this product is 1 and this is the only product in our inverted matrix.
The inverted matrix would then look like this:
Term | Frequency | Document IDs |
---|---|---|
bluetooth | 2 | [1] |
battery | 1 | [1] |
cancellation | 1 | [1] |
headphones | 2 | [1] |
high | 1 | [1] |
life | 1 | [1] |
long | 1 | [1] |
noise | 1 | [1] |
quality | 1 | [1] |
wireless | 2 | [1] |
Explanation:
- Term "wireless": Appears twice in Document ID 1 (once in the Product Name and once in the Description)
- Term "bluetooth": Appears twice in Document ID 1 (once in the Product Name and once in the Description)
- Etc.
The third step is when new documents are added, so let's add a new product to see how the inverted index gets updated.
Here's the new product information:
- Product Name: "Noise Cancelling Over-Ear Headphones"
- Description: "Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design"
- Price: 89.99 (Stored and Indexed)
- Category: "Electronics"
- SKU: "NC987654"
- URL: "http://www.example.com/products/nc987654"
Tokenization and analysis of the new product:
- Product Name: Tokens: ["noise", "cancelling", "over", "ear", "headphones"]
- Description.Tokens: ["experience", "high", "fidelity", "sound", "with", "noise", "cancelling", "technology", "and", "comfortable", "over", "ear", "design"]
Assume that the ID of the new products document is 2, the updated inverted index would look like this:
Term | Frequency | Document IDs |
---|---|---|
bluetooth | 2 | [1] |
battery | 1 | [1] |
cancellation | 1 | [1] |
cancelling | 1 | [2] |
headphones | 3 | [1, 2] |
high | 1 | [1] |
life | 1 | [1] |
long | 1 | [1] |
noise | 3 | [1, 2] |
over | 2 | [2] |
ear | 2 | [2] |
quality | 1 | [1] |
wireless | 2 | [1] |
experience | 1 | [2] |
high | 1 | [2] |
fidelity | 1 | [2] |
sound | 1 | [2] |
technology | 1 | [2] |
comfortable | 1 | [2] |
design | 1 | [2] |
In some cases, the positions of terms within the documents are also included in the inverted index. Here's how the inverted index would look with the positions included:
Term | Frequency | Document IDs (Positions) |
---|---|---|
bluetooth | 2 | [1 (2), 1 (4)] |
battery | 1 | [1 (9)] |
cancellation | 1 | [1 (7)] |
cancelling | 1 | [2 (2)] |
headphones | 3 | [1 (3, 5), 2 (4)] |
high | 1 | [1 (1)] |
life | 1 | [1 (10)] |
long | 1 | [1 (8)] |
noise | 3 | [1 (6), 2 (1)] |
over | 2 | [2 (3), 2 (11)] |
ear | 2 | [2 (4), 2 (12)] |
quality | 1 | [1 (2)] |
wireless | 2 | [1 (1), 1 (3)] |
experience | 1 | [2 (1)] |
high | 1 | [2 (2)] |
fidelity | 1 | [2 (3)] |
sound | 1 | [2 (3)] |
technology | 1 | [2 (6)] |
comfortable | 1 | [2 (8)] |
design | 1 | [2 (10)] |
Explanation:
- Term "bluetooth": Appears twice in Document ID 1. In the second position in the Product Name and in the fourth position in the Description
- Term "headphones": Appears twice in Document ID 1 and once in Document ID 2. In the third and fifth position in Document ID 1 and in the fourth position in Document ID 2.
- Etc.
Search Queries
Once the inverted index has been created, you can start applying search queries to it. Lucene uses a query parser to translate user input into search queries that the search engine can understand. Here's a step-by-step explanation of how this works:
- Query Parsing: Translates the user's search input into a query object that Lucene can process. This object can represent simple term queries, phrase queries, Boolean queries, and more
- Analysing the Query: Query terms are analysed in a similar way to how the documents were analyzed during indexing. This involves tokenizing the input and applying the same analyzers to ensure consistency
- Searching the Inverted Index: Searches for documents that contain the analyzed query terms. It looks up each term in the index and retrieves the corresponding postings lists, which contain the document IDs and positions where the terms appear
- Scoring and Ranking: Calculates a relevance score for each document based on factors such as term frequency (how often a term appears in the document) and inverse document frequency (how rare the term is across all documents)
- Returning Results: Returns the documents sorted by their relevance scores. If the fields are stored, the original content can be retrieved and displayed in the search results
Example: Search query
Let's continue with the index product document example, and apply a search query to it. We start with an example of a user's search input made into a query object and analysed:
- Query: "wireless headphones"
- Analyzed Query Terms: ["wireless", "headphones"]
Here’s the inverted index with positions from our example:
Term | Frequency | Document IDs (Positions) |
---|---|---|
bluetooth | 2 | [1 (2), 1 (4)] |
battery | 1 | [1 (9)] |
cancellation | 1 | [1 (7)] |
cancelling | 1 | [2 (2)] |
headphones | 3 | [1 (3, 5), 2 (4)] |
high | 1 | [1 (1)] |
life | 1 | [1 (10)] |
long | 1 | [1 (8)] |
noise | 3 | [1 (6), 2 (1)] |
over | 2 | [2 (3), 2 (11)] |
ear | 2 | [2 (4), 2 (12)] |
quality | 1 | [1 (2)] |
wireless | 2 | [1 (1), 1 (3)] |
experience | 1 | [2 (1)] |
high | 1 | [2 (2)] |
fidelity | 1 | [2 (3)] |
sound | 1 | [2 (3)] |
technology | 1 | [2 (6)] |
comfortable | 1 | [2 (8)] |
design | 1 | [2 (10)] |
The scoring for the query "wireless headphones":
- Document 1 contains both terms "wireless" (positions 1, 3) and "headphones" (positions 3, 5)
- Document 2 contains the term "headphones" (position 4) but not "wireless"
Document 1 is more relevant and will receive a higher score.
The returning results of our example:
For the query "wireless headphones":
- Document 1: "Wireless Bluetooth Headphones" (High-quality wireless Bluetooth headphones with noise-cancellation and long battery life.)
- Document 2: "Noise Cancelling Over-Ear Headphones" (Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design.)
In conclusion, document 1 is more relevant to the query and will appear first in the results.