How Lucene.NET works

Understand how Indexing and Queries work in Lucene

DynamicWeb 10 uses Lucene.NET, which is a C# port of the Java based Apache Lucene search engine library, to create high-performance and full-featured text indexing and search capabilities. In this article we will dive into how indexing and querying works in Lucene, but let's first get the terminology in place when being in a search engine context:

Indexing: The process of converting text data into a searchable format
Queries: A request for information from an index, it typically consists of keywords or phrases that the user wants to find in a collection of documents

So, basically, you index a collection of documents (e.g. pages or products or users) and then use queries to quickly find documents which contain some terms defined in the query.

How documents are indexed

Indexing in Lucene is the process of converting text data into a searchable format. Think of an index at the end of a book; it allows you to quickly find specific topics or keywords without having to read through the entire book. Similarly, a Lucene-index enables efficient searches by quickly pointing to documents that contain the search terms.

In Lucene, indexing involves the following steps:

Document collection: Gather documents that need to be searchable
Document analysis and indexing: Break down the text in documents into searchable terms using analyzers
Inverted Index Creation: Build an inverted index that maps terms to the documents in which they appear

Step 1: Documents are collected

The process begins when documents are collected for indexing. In Lucene, a document is a fundamental unit of indexing and searching. Each document is a collection of fields, where each field represents a specific piece of information about the document. These fields can store various types of information, such as text, numbers, dates, URLs, etc.

Let’s consider an e-commerce scenario where we're indexing product information for a search engine on an online shopping platform. Each product can be represented as a document with a collection of fields in Lucene:

Product name: The name of the product
Description: A brief description of the product
Price: The price of the product
Category: The category to which the product belongs
SKU: The Stock Keeping Unit, a unique identifier for the product
URL: The URL to the product page

A concrete example of a product we could index could be:

Product name: "Wireless Bluetooth Headphones"
Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life"
Price: 59.99
Category: "Electronics"
SKU: "WH123456"
URL: http://www.example.com/products/wh123456

Step 2: Documents are analysed and indexed

The next step for each field in a document is to determine if we want to store, index, and/or analyze it, depending on what needs to be done with its content. Here’s how each type of field is used:

Stored fields are fields whose exact content is saved in the index. These fields are primarily used for retrieval and display purposes. When a document is indexed, the stored fields are kept in their original form so that they can be retrieved and shown in search results.

Purpose: To retain the original value for display or later retrieval
Example Use Cases:
- Article Title: Displaying the title of an article in search results
- Publication Date: Showing the publication date of a document
- URL: Providing a clickable link to the product page in search results

For our example product we might want to do the following:

Product name: This field should be stored, indexed, and analyzed so that it can be displayed in search results and used for searching
Description: This field should be stored, indexed, and analyzed so that users can search for terms within the product descriptions
Price: This field can be stored (for display purposes) and indexed if we want to perform range queries (e.g., searching for products within a certain price range)
Category: This field should be stored and indexed to allow filtering and searching by category
SKU: This field should be stored and indexed for exact-match searches
URL: This field should be stored to provide a link to the product page in the search results

Step 3: The inverted index is created

The inverted index is the backbone of Lucene's search functionality. It allows for fast and efficient searching by mapping terms to the documents that contain them. This structure is similar to the index at the end of a book, but inverted. Instead of pointing from a topic to a page number, it points from a term to the documents that contain that term. This sounds a bit confusing, but bear with me. It'll become clear as we move on.

The fields of a document and the inverted index function as follows:

Analyzed and indexed fields: These fields are processed (split into tokens) and added to the inverted index. This enhances search functionality by enabling full-text searches. For example, the phrase "Wireless Bluetooth Headphones" might be split into tokens like "wireless", "bluetooth", and "headphones", all of which are stored in the inverted index.
Non-analyzed indexed fields: These fields are added to the inverted index as complete terms without being split. This is useful for exact-match searches. For example, the SKU "WH123456" is stored in the inverted index as a single term.
Stored fields: These fields are not included in the inverted index. Since stored fields are not meant for searching, they do not need to be part of the inverted index. For example, a URL stored in the field http://www.example.com/products/wh123456 is saved exactly as it is for display in search results but is not searchable.

In summary, the inverted index is specifically designed for fields that need to be searchable, while stored fields are kept separately for retrieval without affecting the search index.

To understand how the inverted index works, let's break it down into steps for a field that is Indexed and Analyzed:

Tokenization and analysis:
- When a document is indexed and analyzed, the text is broken down into individual terms (tokens), normalized, and processed through various filters (e.g., lowercasing, removing stopwords)
Building the inverted index:
- For each term, the inverted index keeps a list of documents (and positions within those documents) where the term appears. This list is known as the postings list
Segment management:
- New documents are initially added to an in-memory segment. Periodically, these segments are flushed to disk and merged with other segments to optimize performance

Let's continue with our e-commerce example and see how the inverted index is created for a product document:

Product name: "Wireless Bluetooth Headphones" (Stored, Indexed, and Analyzed)
Description: "High-quality wireless Bluetooth headphones with noise-cancellation and long battery life" (Stored, Indexed, and Analyzed)
Price: 59.99 (Stored and Indexed)
Category: "Electronics" (Stored and Indexed)
SKU: "WH123456" (Stored and Indexed)
URL: "http://www.example.com/products/wh123456" (Stored)

Tokenization and analysis

The first step is for the fields who are analyzed (in our example, the Product Name and Description) to be broken down into tokens:

Product name: ["wireless", "bluetooth", "headphones"]
Description: ["high", "quality", "wireless", "bluetooth", "headphones", "noise", "cancellation", "long", "battery, "life"]

Building the inverted index

The second step is building the inverted index by mapping each term to its postings list, which includes document IDs and positions within those documents. Let's assume the document ID for this product is 1 and this is the only product in our inverted matrix.

The inverted matrix would then look like this:

Term	Frequency	Document IDs
bluetooth	2	[1]
battery	1	[1]
cancellation	1	[1]
headphones	2	[1]
high	1	[1]
life	1	[1]
long	1	[1]
noise	1	[1]
quality	1	[1]
wireless	2	[1]

Two terms have a frequency of 2; wireless and bluetooth - they appear twice in the document (product name and description). The other terms only occur once and there have a frequency of 1.

Adding new documents

The third step is when new documents are added, so let's add a new product to see how the inverted index gets updated.

Here's the new product information:

Product name: "Noise Cancelling Over-Ear Headphones"
Description: "Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design"
Price: 89.99 (Stored and Indexed)
Category: "Electronics"
SKU: "NC987654"
URL: http://www.example.com/products/nc987654

Tokenization and analysis of the new product:

Product Name: Tokens: ["noise", "cancelling", "over", "ear", "headphones"]
Description.Tokens: ["experience", "high", "fidelity", "sound", "with", "noise", "cancelling", "technology", "and", "comfortable", "over", "ear", "design"]

Assume that the ID of the new products document is 2, the updated inverted index would look like this:

Term	Frequency	Document IDs
bluetooth	2	[1]
battery	1	[1]
cancellation	1	[1]
cancelling	1	[2]
headphones	3	[1, 2]
high	1	[1]
life	1	[1]
long	1	[1]
noise	3	[1, 2]
over	2	[2]
ear	2	[2]
quality	1	[1]
wireless	2	[1]
experience	1	[2]
high	1	[2]
fidelity	1	[2]
sound	1	[2]
technology	1	[2]
comfortable	1	[2]
design	1	[2]

In some cases, the positions of terms within the documents are also included in the inverted index. Here's how the inverted index would look with the positions included:

Term	Frequency	Document IDs (Positions)
bluetooth	2	[1 (2), 1 (4)]
battery	1	[1 (9)]
cancellation	1	[1 (7)]
cancelling	1	[2 (2)]
headphones	3	[1 (3, 5), 2 (4)]
high	1	[1 (1)]
life	1	[1 (10)]
long	1	[1 (8)]
noise	3	[1 (6), 2 (1)]
over	2	[2 (3), 2 (11)]
ear	2	[2 (4), 2 (12)]
quality	1	[1 (2)]
wireless	2	[1 (1), 1 (3)]
experience	1	[2 (1)]
high	1	[2 (2)]
fidelity	1	[2 (3)]
sound	1	[2 (3)]
technology	1	[2 (6)]
comfortable	1	[2 (8)]
design	1	[2 (10)]

As you can see, the term "bluetooth" appears twice in Documment ID 1 - in the second position in Product name and in the fourth position in Description. The term "headphones" appears twice in Document ID 1 (third and fifth position) and once in Document ID 2 (fourth position). And so on and so forth.

How queries work

Once the inverted index has been created, you can start applying search queries to it. Lucene uses a query parser to translate user input into search queries that the search engine can understand. Here's a step-by-step explanation of how this works:

Query parsing: Translates the user's search input into a query object that Lucene can process. This object can represent simple term queries, phrase queries, Boolean queries, and more
Analysing the query: Query terms are analyzed in a similar way to how the documents were analyzed during indexing. This involves tokenizing the input and applying the same analyzers to ensure consistency
Searching the inverted index: Searches for documents that contain the analyzed query terms. It looks up each term in the index and retrieves the corresponding postings lists, which contain the document IDs and positions where the terms appear
Scoring and ranking: Calculates a relevance score for each document based on factors such as term frequency (how often a term appears in the document) and inverse document frequency (how rare the term is across all documents)
Returning results: Returns the documents sorted by their relevance scores. If the fields are stored, the original content can be retrieved and displayed in the search results

Let's continue with the index product document example and apply a search query to it. We start with an example of a user's search input made into a query object and analysed:

Query: "wireless headphones"
Analyzed query terms: ["wireless", "headphones"]

Here’s the inverted index with positions from our example:

Term	Frequency	Document IDs (Positions)
bluetooth	2	[1 (2), 1 (4)]
battery	1	[1 (9)]
cancellation	1	[1 (7)]
cancelling	1	[2 (2)]
headphones	3	[1 (3, 5), 2 (4)]
high	1	[1 (1)]
life	1	[1 (10)]
long	1	[1 (8)]
noise	3	[1 (6), 2 (1)]
over	2	[2 (3), 2 (11)]
ear	2	[2 (4), 2 (12)]
quality	1	[1 (2)]
wireless	2	[1 (1), 1 (3)]
experience	1	[2 (1)]
high	1	[2 (2)]
fidelity	1	[2 (3)]
sound	1	[2 (3)]
technology	1	[2 (6)]
comfortable	1	[2 (8)]
design	1	[2 (10)]

The scoring for the query "wireless headphones":

Document 1 contains both terms "wireless" (positions 1, 3) and "headphones" (positions 3, 5)
Document 2 contains the term "headphones" (position 4) but not "wireless"

Document 1 is more relevant and will receive a higher score.

The returning results for the query "wireless headphones":

Document 1: "Wireless Bluetooth Headphones" (High-quality wireless Bluetooth headphones with noise-cancellation and long battery life.)
Document 2: "Noise Cancelling Over-Ear Headphones" (Experience high-fidelity sound with noise-cancelling technology and comfortable over-ear design.)

In conclusion, document 1 is more relevant to the query and will appear first in the results.

How analyzers work

Analyzers in Lucene are responsible for converting text into a form that is suitable for indexing. They perform a series of transformations on the input text to break it down into searchable terms - tokens - in a process known as tokenization. The tokens are then added to the index and are what you search for when you perform a search query.

An analyzer works like this:

A tokenizer breaks the text into tokens
A series of TokenFilters process the tokens further to e.g. remove stop words or convert all tokens to lowercase

A tokenizer is therefore a part of the analyzer and splits the text into tokens based on specific rules when documents are added to the index. The rules vary depending on the type of tokenizer used.

Tokenizers

Lucene offers several built-in tokenizers, each designed for different use cases:

StandardTokenizer: Splits text into words, removing most symbols and punctuation. Good for general text.
WhitespaceTokenizer: Splits text at whitespace. Good for tokenizing text with a known structure.
LetterTokenizer: Splits text into tokens whenever a non-letter character is encountered. Useful for basic linguistic analysis and applications where only words are relevant
KeywordTokenizer: Treats the entire text as a single token. Useful for certain types of data like IDs or URLs.

Let's see how each of these tokenizers handle this text:

"Lucene in Action, Second Edition (2010)"

This tokenizer splits the text into words and removes most punctuation, resulting in the following tokens:

Lucene
in
Action
Second
Edition
2010

Note that the comma and parenthesis have been removed from the text. This makes it easier for the search engine to process and match the terms accurately during queries.

As the name indicates, this tokenizer splits the text at whitespace, resulting in the following tokens:

Lucene
in
Action,
Second
Edition
(2010)

Note that the comma and parenthesis have not been removed, since the token splits at whitespace and does not remove punctuation. This can be useful in some situations such as URLs where punctuation has a specific significance.

This tokenizer splits the text into tokens at any non-letter character:

Lucene
in
Action
Second
Edition

Note that the comma and the entire '(2010)' have been removed, since this tokenizer only retains letter characters, splitting the text at any non-letter character. This can be useful in situations where you want to focus purely on alphabetic content and ignore numbers and punctuation, such as in spell-checking or linguistic analysis.

This tokenizer treats the entire text as a single token:

"Lucene in Action, Second Edition (2010)”

Note that no splitting or removal of characters occurs, since the entire text is treated as one token. This can be useful in situations where you need to preserve the exact format of the text, such as for IDs, URLs, or exact phrases.

TokenFilters

After tokenization, Lucene can apply several optional TokenFilters to modify the tokens further.

Types of TokenFilter:

LowerCaseFilter: Converts all tokens to lowercase.
StopFilter: Removes common "stop words" (like "the", "and", etc.).
SynonymFilter: Adds synonyms of words to increase search hits.

For the text "Lucene in Action, Second Edition (2010)", using the StandardTokenizer followed by a LowerCaseFilter and StopFilter, the tokens would be:

lucene
action
second
edition
2010

As you can see, all the tokens has been converted to lowercase and the word 'in' has been removed since it's a stop word.

Standard analyzers

Here are a few common types of analyzers in Lucene:

StandardAnalyzer: Combines the StandardTokenizer with the LowerCaseFilter and StopFilter. It's good for general text processing
SimpleAnalyzer: Uses a LowerCaseTokenizer which divides text at non-letter characters and converts them to lower case
WhiteSpaceAnalyzer: Uses a WhitespaceTokenizer to divide text at whitespace. It does not convert characters to lower case or remove stop words
KeywordAnalyzer: Does not tokenize the text but treats the entire string as a single token
CustomAnalyzer: Allows for the construction of an analyzer with custom components (tokenizer and filters)

Analyzers and queries

When you execute a query, the same Analyzer that was used for indexing should also be used for the query - here's why:

Tokenization: Different analyzers tokenize text differently. If the search query is tokenized differently from the indexed text, it might not match as expected
Normalization: Analyzers may apply normalization techniques such as lowercasing, stemming, or removing stop words. Inconsistent normalization between indexing and searching can lead to mismatches
Filters: Some analyzers apply filters to remove or alter tokens. Using different filters for indexing and searching can result in unexpected behaviour

In short, using the same analyzer ensures that the search terms are processed in the same way as the indexed text.

Using different analyzers

To illustrate, here's an example where two different analyzers are used:

First we index documents using the WhitespaceAnalyzer:

Documents:
- Document 1: "Lucene in Action"
- Document 2: "Introduction to Lucene for .Net"
Analyzer:
- WhitespaceAnalyzer
Indexed Tokens:
- Document 1: ["Lucene", "in", "Action"]
- Document 2: ["Introduction", "to", "Lucene", "for", ".NET"]

Then we search the content using the StandardAnalyzer:

Query: "Lucene"
Analyzer
- StandardAnalyzer which lowercases the text and removes stopwords
Query Tokens:
- Lucene becomes ["lucene"]
Search Process:
1. The query token "lucene" (lowercased) is compared against the indexed tokens
  - Document 1: Contains "Lucene", not "lucene" (case-sensitive mismatch)
  - Document 2: Contains "Lucene", not "lucene" (case-sensitive mismatch)
Search Results:
- No matches found, because the token "lucene" does not match "Lucene" due to case differences.

Had we used the WhitespaceAnalyzer for the search both documents would be a match.

Using the same analyzer

In contrast, here are some examples of how queries work when we use the same analyzer for indexing and queries:

Indexing:
- Document 1: ["lucene", "in", "action"]
- Document 2: ["introduction", "to", "lucene", "for", "net"]
Query analysis:
- "Lucene" becomes ["lucene"]
Search results:
- Both documents match because "lucene" is a term in both indexed documents

Faceted search is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. In Lucene, facets are represented as a hierarchy of categories, and each document can be assigned to one or more categories.

Facets require a separate faceted index structure alongside the main Lucene index. The basic process involves defining which fields should be faceted when the documents are indexed. These fields can be either analyzed or not analyzed, affecting how faceting results are displayed and utilized.

When a field is analyzed during indexing, it is broken down into separate terms. If you use an analyzed field as a facet, the facet counts could potentially become misleading. This is because each individual term in the field would be treated as a separate facet.

Consider a field containing the category "Science Fiction".

If this field is analyzed, it would be broken down into two separate terms: "Science" and "Fiction":

Indexed Field: "Science Fiction"
Analyzer: Breaks down the field into tokens: ["Science", "Fiction"]
Faceted Results:
- Science: 1
- Fiction: 1

In this case, the facet counts are not useful because they treat "Science" and "Fiction" as separate facets, which does not reflect the actual category "Science Fiction".

On the other hand, if a field is unanalyzed, then the entire content of the field is treated as a single term. This can be useful for certain types of data, such as categories or tags, where you want the entire field to be treated as a single facet. For example, if you have a field containing the category “Science Fiction”, and this field is not analyzed, then “Science Fiction” would be treated as a single facet.

Suppose you have a website where you sell books, and you want to provide faceting on the book genre and price.

Indexed Data:
- Document 1: { Title: "A Brief History of Time", Genre: "Science", Price: 40 }
- Document 2: { Title: "The Selfish Gene", Genre: "Science", Price: 50 }
- Document 3: { Title: "Fahrenheit 451", Genre: "Science Fiction", Price: 15 }
Fields:
- Genre: A good candidate for a non-analyzed field because genre names don’t need to be split into smaller terms.
- Price: Typically faceted as a range (e.g., $0-$20, $21-$40, etc.), and therefore also non-analyzed but treated differently as numeric range faceting.

When you facet on the "Genre" field, Lucene will aggregate data based on the exact terms stored in the index. Since "Genre" is unanalyzed, the terms are stored as they are, such as "Science" and "Science Fiction".

Query: Searching for books related to "Science"
Faceted Results:
- Science: 2
- Science Fiction: 1

Here, the facets correctly displays the books under the genre "Science".

Table of Contents