How Lucene.NET works
Understand how Indexing and Queries work in Lucene
DynamicWeb 10 uses Lucene.NET, which is a C# port of the Java based Apache Lucene search engine library, to create high-performance and full-featured text indexing and search capabilities. In this article we will dive into how indexing and querying works in Lucene, but let's first get the terminology in place when being in a search engine context:
- Indexing: The process of converting text data into a searchable format
- Queries: A request for information from an index, it typically consists of keywords or phrases that the user wants to find in a collection of documents
So, basically, you index a collection of documents (e.g. pages or products or users) and then use queries to quickly find documents which contain some terms defined in the query.
How documents are indexed
Indexing in Lucene is the process of converting text data into a searchable format. Think of an index at the end of a book; it allows you to quickly find specific topics or keywords without having to read through the entire book. Similarly, a Lucene-index enables efficient searches by quickly pointing to documents that contain the search terms.
In Lucene, indexing involves the following steps:
- Document collection: Gather documents that need to be searchable
- Document analysis and indexing: Break down the text in documents into searchable terms using analyzers
- Inverted Index Creation: Build an inverted index that maps terms to the documents in which they appear
Step 1: Documents are collected
The process begins when documents are collected for indexing. In Lucene, a document is a fundamental unit of indexing and searching. Each document is a collection of fields, where each field represents a specific piece of information about the document. These fields can store various types of information, such as text, numbers, dates, URLs, etc.
Let’s consider an e-commerce scenario where we're indexing product information for a search engine on an online shopping platform. Each product can be represented as a document with a collection of fields in Lucene:
- Product name: The name of the product
- Description: A brief description of the product
- Price: The price of the product
- Category: The category to which the product belongs
- SKU: The Stock Keeping Unit, a unique identifier for the product
- URL: The URL to the product page
A concrete example of a product we could index could be:
- Product name: "Wireless Bluetooth Headphones"
- Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life"
- Price: 59.99
- Category: "Electronics"
- SKU: "WH123456"
- URL:
http://www.example.com/products/wh123456
Step 2: Documents are analysed and indexed
The next step for each field in a document is to determine if we want to store, index, and/or analyze it, depending on what needs to be done with its content. Here’s how each type of field is used:
Stored fields are fields whose exact content is saved in the index. These fields are primarily used for retrieval and display purposes. When a document is indexed, the stored fields are kept in their original form so that they can be retrieved and shown in search results.
- Purpose: To retain the original value for display or later retrieval
- Example Use Cases:
- Article Title: Displaying the title of an article in search results
- Publication Date: Showing the publication date of a document
- URL: Providing a clickable link to the product page in search results
For our example product we might want to do the following:
- Product name: This field should be stored, indexed, and analyzed so that it can be displayed in search results and used for searching
- Description: This field should be stored, indexed, and analyzed so that users can search for terms within the product descriptions
- Price: This field can be stored (for display purposes) and indexed if we want to perform range queries (e.g., searching for products within a certain price range)
- Category: This field should be stored and indexed to allow filtering and searching by category
- SKU: This field should be stored and indexed for exact-match searches
- URL: This field should be stored to provide a link to the product page in the search results
Step 3: The inverted index is created
The inverted index is the backbone of Lucene's search functionality. It allows for fast and efficient searching by mapping terms to the documents that contain them. This structure is similar to the index at the end of a book, but inverted. Instead of pointing from a topic to a page number, it points from a term to the documents that contain that term. This sounds a bit confusing, but bear with me. It'll become clear as we move on.
The fields of a document and the inverted index function as follows:
- Analyzed and indexed fields: These fields are processed (split into tokens) and added to the inverted index. This enhances search functionality by enabling full-text searches. For example, the phrase "Wireless Bluetooth Headphones" might be split into tokens like "wireless", "bluetooth", and "headphones", all of which are stored in the inverted index.
- Non-analyzed indexed fields: These fields are added to the inverted index as complete terms without being split. This is useful for exact-match searches. For example, the SKU "WH123456" is stored in the inverted index as a single term.
- Stored fields: These fields are not included in the inverted index. Since stored fields are not meant for searching, they do not need to be part of the inverted index. For example, a URL stored in the field
http://www.example.com/products/wh123456is saved exactly as it is for display in search results but is not searchable.
In summary, the inverted index is specifically designed for fields that need to be searchable, while stored fields are kept separately for retrieval without affecting the search index.
To understand how the inverted index works, let's break it down into steps for a field that is Indexed and Analyzed:
- Tokenization and analysis:
- When a document is indexed and analyzed, the text is broken down into individual terms (tokens), normalized, and processed through various filters (e.g., lowercasing, removing stopwords)
- Building the inverted index:
- For each term, the inverted index keeps a list of documents (and positions within those documents) where the term appears. This list is known as the postings list
- Segment management:
- New documents are initially added to an in-memory segment. Periodically, these segments are flushed to disk and merged with other segments to optimize performance
Let's continue with our e-commerce example and see how the inverted index is created for a product document:
- Product name: "Wireless Bluetooth Headphones" (Stored, Indexed, and Analyzed)
- Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life" (Stored, Indexed, and Analyzed)
- Price: 59.99 (Stored and Indexed)
- Category: "Electronics" (Stored and Indexed)
- SKU: "WH123456" (Stored and Indexed)
- URL:
"http://www.example.com/products/wh123456"(Stored)
Tokenization and analysis
The first step is for the fields who are analyzed (in our example, the Product Name and Description) to be broken down into tokens:
- Product name: ["wireless", "bluetooth", "headphones"]
- Description: ["high", "quality", "wireless", "bluetooth", "headphones", "noise", "cancellation", "long", "battery, "life"]
Building the inverted index
The second step is building the inverted index by mapping each term to its postings list, which includes document IDs and positions within those documents. Let's assume the document ID for this product is 1 and this is the only product in our inverted matrix.
The inverted matrix would then look like this:
| Term | Frequency | Document IDs |
|---|---|---|
| bluetooth | 2 | [1] |
| battery | 1 | [1] |
| cancellation | 1 | [1] |
| headphones | 2 | [1] |
| high | 1 | [1] |
| life | 1 | [1] |
| long | 1 | [1] |
| noise | 1 | [1] |
| quality | 1 | [1] |
| wireless | 2 | [1] |
Two terms have a frequency of 2; wireless and bluetooth - they appear twice in the document (product name and description). The other terms only occur once and there have a frequency of 1.
Adding new documents
The third step is when new documents are added, so let's add a new product to see how the inverted index gets updated.
Here's the new product information:
- Product name: "Noise Cancelling Over-Ear Headphones"
- Description: "Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design"
- Price: 89.99 (Stored and Indexed)
- Category: "Electronics"
- SKU: "NC987654"
- URL:
http://www.example.com/products/nc987654
Tokenization and analysis of the new product:
- Product Name: Tokens: ["noise", "cancelling", "over", "ear", "headphones"]
- Description.Tokens: ["experience", "high", "fidelity", "sound", "with", "noise", "cancelling", "technology", "and", "comfortable", "over", "ear", "design"]
Assume that the ID of the new products document is 2, the updated inverted index would look like this:
| Term | Frequency | Document IDs |
|---|---|---|
| bluetooth | 2 | [1] |
| battery | 1 | [1] |
| cancellation | 1 | [1] |
| cancelling | 1 | [2] |
| headphones | 3 | [1, 2] |
| high | 1 | [1] |
| life | 1 | [1] |
| long | 1 | [1] |
| noise | 3 | [1, 2] |
| over | 2 | [2] |
| ear | 2 | [2] |
| quality | 1 | [1] |
| wireless | 2 | [1] |
| experience | 1 | [2] |
| high | 1 | [2] |
| fidelity | 1 | [2] |
| sound | 1 | [2] |
| technology | 1 | [2] |
| comfortable | 1 | [2] |
| design | 1 | [2] |
In some cases, the positions of terms within the documents are also included in the inverted index. Here's how the inverted index would look with the positions included:
| Term | Frequency | Document IDs (Positions) |
|---|---|---|
| bluetooth | 2 | [1 (2), 1 (4)] |
| battery | 1 | [1 (9)] |
| cancellation | 1 | [1 (7)] |
| cancelling | 1 | [2 (2)] |
| headphones | 3 | [1 (3, 5), 2 (4)] |
| high | 1 | [1 (1)] |
| life | 1 | [1 (10)] |
| long | 1 | [1 (8)] |
| noise | 3 | [1 (6), 2 (1)] |
| over | 2 | [2 (3), 2 (11)] |
| ear | 2 | [2 (4), 2 (12)] |
| quality | 1 | [1 (2)] |
| wireless | 2 | [1 (1), 1 (3)] |
| experience | 1 | [2 (1)] |
| high | 1 | [2 (2)] |
| fidelity | 1 | [2 (3)] |
| sound | 1 | [2 (3)] |
| technology | 1 | [2 (6)] |
| comfortable | 1 | [2 (8)] |
| design | 1 | [2 (10)] |
As you can see, the term "bluetooth" appears twice in Documment ID 1 - in the second position in Product name and in the fourth position in Description. The term "headphones" appears twice in Document ID 1 (third and fifth position) and once in Document ID 2 (fourth position). And so on and so forth.
How queries work
Once the inverted index has been created, you can start applying search queries to it. Lucene uses a query parser to translate user input into search queries that the search engine can understand. Here's a step-by-step explanation of how this works:
- Query parsing: Translates the user's search input into a query object that Lucene can process. This object can represent simple term queries, phrase queries, Boolean queries, and more
- Analysing the query: Query terms are analyzed in a similar way to how the documents were analyzed during indexing. This involves tokenizing the input and applying the same analyzers to ensure consistency
- Searching the inverted index: Searches for documents that contain the analyzed query terms. It looks up each term in the index and retrieves the corresponding postings lists, which contain the document IDs and positions where the terms appear
- Scoring and ranking: Calculates a relevance score for each document based on factors such as term frequency (how often a term appears in the document) and inverse document frequency (how rare the term is across all documents)
- Returning results: Returns the documents sorted by their relevance scores. If the fields are stored, the original content can be retrieved and displayed in the search results
Let's continue with the index product document example and apply a search query to it. We start with an example of a user's search input made into a query object and analysed:
- Query: "wireless headphones"
- Analyzed query terms: ["wireless", "headphones"]
Here’s the inverted index with positions from our example:
| Term | Frequency | Document IDs (Positions) |
|---|---|---|
| bluetooth | 2 | [1 (2), 1 (4)] |
| battery | 1 | [1 (9)] |
| cancellation | 1 | [1 (7)] |
| cancelling | 1 | [2 (2)] |
| headphones | 3 | [1 (3, 5), 2 (4)] |
| high | 1 | [1 (1)] |
| life | 1 | [1 (10)] |
| long | 1 | [1 (8)] |
| noise | 3 | [1 (6), 2 (1)] |
| over | 2 | [2 (3), 2 (11)] |
| ear | 2 | [2 (4), 2 (12)] |
| quality | 1 | [1 (2)] |
| wireless | 2 | [1 (1), 1 (3)] |
| experience | 1 | [2 (1)] |
| high | 1 | [2 (2)] |
| fidelity | 1 | [2 (3)] |
| sound | 1 | [2 (3)] |
| technology | 1 | [2 (6)] |
| comfortable | 1 | [2 (8)] |
| design | 1 | [2 (10)] |
The scoring for the query "wireless headphones":
- Document 1 contains both terms "wireless" (positions 1, 3) and "headphones" (positions 3, 5)
- Document 2 contains the term "headphones" (position 4) but not "wireless"
Document 1 is more relevant and will receive a higher score.
The returning results for the query "wireless headphones":
- Document 1: "Wireless Bluetooth Headphones" (High-quality wireless Bluetooth headphones with noise-cancellation and long battery life.)
- Document 2: "Noise Cancelling Over-Ear Headphones" (Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design.)
In conclusion, document 1 is more relevant to the query and will appear first in the results.
How analyzers work
Analyzers in Lucene are responsible for converting text into a form that is suitable for indexing. They perform a series of transformations on the input text to break it down into searchable terms - tokens - in a process known as tokenization. The tokens are then added to the index and are what you search for when you perform a search query.
An analyzer works like this:
- A tokenizer breaks the text into tokens
- A series of TokenFilters process the tokens further to e.g. remove stop words or convert all tokens to lowercase
A tokenizer is therefore a part of the analyzer and splits the text into tokens based on specific rules when documents are added to the index. The rules vary depending on the type of tokenizer used.
Tokenizers
Lucene offers several built-in tokenizers, each designed for different use cases:
- StandardTokenizer: Splits text into words, removing most symbols and punctuation. Good for general text.
- WhitespaceTokenizer: Splits text at whitespace. Good for tokenizing text with a known structure.
- LetterTokenizer: Splits text into tokens whenever a non-letter character is encountered. Useful for basic linguistic analysis and applications where only words are relevant
- KeywordTokenizer: Treats the entire text as a single token. Useful for certain types of data like IDs or URLs.
Let's see how each of these tokenizers handle this text:
"Lucene in Action, Second Edition (2010)"
This tokenizer splits the text into words and removes most punctuation, resulting in the following tokens:
Lucene
in
Action
Second
Edition
2010
Note that the comma and parenthesis have been removed from the text. This makes it easier for the search engine to process and match the terms accurately during queries.
TokenFilters
After tokenization, Lucene can apply several optional TokenFilters to modify the tokens further.
Types of TokenFilter:
- LowerCaseFilter: Converts all tokens to lowercase.
- StopFilter: Removes common "stop words" (like "the", "and", etc.).
- SynonymFilter: Adds synonyms of words to increase search hits.
For the text "Lucene in Action, Second Edition (2010)", using the StandardTokenizer followed by a LowerCaseFilter and StopFilter, the tokens would be:
lucene
action
second
edition
2010
As you can see, all the tokens has been converted to lowercase and the word 'in' has been removed since it's a stop word.
Standard analyzers
Here are a few common types of analyzers in Lucene:
- StandardAnalyzer: Combines the StandardTokenizer with the LowerCaseFilter and StopFilter. It's good for general text processing
- SimpleAnalyzer: Uses a LowerCaseTokenizer which divides text at non-letter characters and converts them to lower case
- WhiteSpaceAnalyzer: Uses a WhitespaceTokenizer to divide text at whitespace. It does not convert characters to lower case or remove stop words
- KeywordAnalyzer: Does not tokenize the text but treats the entire string as a single token
- CustomAnalyzer: Allows for the construction of an analyzer with custom components (tokenizer and filters)
Analyzers and queries
When you execute a query, the same Analyzer that was used for indexing should also be used for the query - here's why:
- Tokenization: Different analyzers tokenize text differently. If the search query is tokenized differently from the indexed text, it might not match as expected
- Normalization: Analyzers may apply normalization techniques such as lowercasing, stemming, or removing stop words. Inconsistent normalization between indexing and searching can lead to mismatches
- Filters: Some analyzers apply filters to remove or alter tokens. Using different filters for indexing and searching can result in unexpected behaviour
In short, using the same analyzer ensures that the search terms are processed in the same way as the indexed text.
Using different analyzers
To illustrate, here's an example where two different analyzers are used:
First we index documents using the WhitespaceAnalyzer:
- Documents:
- Document 1: "Lucene in Action"
- Document 2: "Introduction to Lucene for .Net"
- Analyzer:
- WhitespaceAnalyzer
- Indexed Tokens:
- Document 1: ["Lucene", "in", "Action"]
- Document 2: ["Introduction", "to", "Lucene", "for", ".NET"]
Then we search the content using the StandardAnalyzer:
- Query: "Lucene"
- Analyzer
- StandardAnalyzer which lowercases the text and removes stopwords
- Query Tokens:
- Lucene becomes ["lucene"]
- Search Process:
- The query token "lucene" (lowercased) is compared against the indexed tokens
- Document 1: Contains "Lucene", not "lucene" (case-sensitive mismatch)
- Document 2: Contains "Lucene", not "lucene" (case-sensitive mismatch)
- The query token "lucene" (lowercased) is compared against the indexed tokens
- Search Results:
- No matches found, because the token "lucene" does not match "Lucene" due to case differences.
Had we used the WhitespaceAnalyzer for the search both documents would be a match.
Using the same analyzer
In contrast, here are some examples of how queries work when we use the same analyzer for indexing and queries:
- Indexing:
- Document 1: ["lucene", "in", "action"]
- Document 2: ["introduction", "to", "lucene", "for", "net"]
- Query analysis:
- "Lucene" becomes ["lucene"]
- Search results:
- Both documents match because "lucene" is a term in both indexed documents
How facets work
Faceted search is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. In Lucene, facets are represented as a hierarchy of categories, and each document can be assigned to one or more categories.
Facets require a separate faceted index structure alongside the main Lucene index. The basic process involves defining which fields should be faceted when the documents are indexed. These fields can be either analyzed or not analyzed, affecting how faceting results are displayed and utilized.
When a field is analyzed during indexing, it is broken down into separate terms. If you use an analyzed field as a facet, the facet counts could potentially become misleading. This is because each individual term in the field would be treated as a separate facet.
Consider a field containing the category "Science Fiction".
If this field is analyzed, it would be broken down into two separate terms: "Science" and "Fiction":
- Indexed Field: "Science Fiction"
- Analyzer: Breaks down the field into tokens: ["Science", "Fiction"]
- Faceted Results:
- Science: 1
- Fiction: 1
In this case, the facet counts are not useful because they treat "Science" and "Fiction" as separate facets, which does not reflect the actual category "Science Fiction".
On the other hand, if a field is unanalyzed, then the entire content of the field is treated as a single term. This can be useful for certain types of data, such as categories or tags, where you want the entire field to be treated as a single facet. For example, if you have a field containing the category “Science Fiction”, and this field is not analyzed, then “Science Fiction” would be treated as a single facet.
Suppose you have a website where you sell books, and you want to provide faceting on the book genre and price.
- Indexed Data:
- Document 1: { Title: "A Brief History of Time", Genre: "Science", Price: 40 }
- Document 2: { Title: "The Selfish Gene", Genre: "Science", Price: 50 }
- Document 3: { Title: "Fahrenheit 451", Genre: "Science Fiction", Price: 15 }
- Fields:
- Genre: A good candidate for a non-analyzed field because genre names don’t need to be split into smaller terms.
- Price: Typically faceted as a range (e.g., $0-$20, $21-$40, etc.), and therefore also non-analyzed but treated differently as numeric range faceting.
When you facet on the "Genre" field, Lucene will aggregate data based on the exact terms stored in the index. Since "Genre" is unanalyzed, the terms are stored as they are, such as "Science" and "Science Fiction".
- Query: Searching for books related to "Science"
- Faceted Results:
- Science: 2
- Science Fiction: 1
Here, the facets correctly displays the books under the genre "Science".