Analyzers

Understanding Lucene Analyzers

DynamicWeb uses Apache Lucene which is a powerful, high-performance, full-featured text search engine library.

One of the core components of Lucene is the process of indexing. Indexing in Lucene is the process of converting text data into a searchable format. Think of an index at the end of a book; it allows you to quickly find specific topics or keywords without having to read through the entire book. Similarly, a Lucene-index enables efficient searches by quickly pointing to documents that contain the search terms.

During indexing, documents are processed and their contents are transformed into a structured form that enables fast and efficient search queries. This is done through analyzers. In this article, we will delve into the inner workings of Lucene 4’s indexing process, with a particular focus on analyzers.

What are analyzers?

Analyzers in Lucene are responsible for converting text into a form that is suitable for indexing. They perform a series of transformations on the input text to break it down into searchable terms - tokens - in a process known as tokenization. The tokens are then added to the index and are what you search for when you perform a search query.

An analyzer works like this:

A Tokenizer breaks the text into tokens
A series of TokenFilters process the tokens further to e.g. remove stop words or convert all tokens to lowercase

A tokenizer is therefore a part of the analyzer and splits the text into tokens based on specific rules when documents are added to the index. The rules vary depending on the type of tokenizer used.

Tokenizers

Lucene offers several built-in tokenizers, each designed for different use cases:

StandardTokenizer: Splits text into words, removing most symbols and punctuation. Good for general text.
WhitespaceTokenizer: Splits text at whitespace. Good for tokenizing text with a known structure.
LetterTokenizer: Splits text into tokens whenever a non-letter character is encountered. Useful for basic linguistic analysis and applications where only words are relevant
KeywordTokenizer: Treats the entire text as a single token. Useful for certain types of data like IDs or URLs.

Let's see how each of these tokenizers handle this text:

"Lucene in Action, Second Edition (2010)"

StandardTokenizer

Using the StandardTokenizer:

This tokenizer splits the text into words and removes most punctuation, resulting in the following tokens:

Lucene
in
Action
Second
Edition
2010

Note that the comma and parenthesis have been removed from the text. This makes it easier for the search engine to process and match the terms accurately during queries.

WhitespaceTokenizer

Using the WhitespaceTokenizer:

As the name indicates, this tokenizer splits the text at whitespace, resulting in the following tokens:

Lucene
in
Action,
Second
Edition
(2010)

Note that the comma and parenthesis have not been removed, since the token splits at whitespace and does not remove punctuation. This can be useful in some situations such as URLs where punctuation has a specific significance.

LetterTokenizer

Using the LetterTokenizer:

This tokenizer splits the text into tokens at any non-letter character:

Lucene
in
Action
Second
Edition

Note that the comma and the entire '(2010)' have been removed, since this tokenizer only retains letter characters, splitting the text at any non-letter character. This can be useful in situations where you want to focus purely on alphabetic content and ignore numbers and punctuation, such as in spell-checking or linguistic analysis.

KeywordTokenizer

Using the KeywordTokenizer:

This tokenizer treats the entire text as a single token:

“Lucene in Action, Second Edition (2010)”

Note that no splitting or removal of characters occurs, since the entire text is treated as one token. This can be useful in situations where you need to preserve the exact format of the text, such as for IDs, URLs, or exact phrases.

TokenFilters

After tokenization, Lucene can apply several optional TokenFilters to modify the tokens further.

Types of TokenFilter:

LowerCaseFilter: Converts all tokens to lowercase.
StopFilter: Removes common "stop words" (like "the", "and", etc.).
SynonymFilter: Adds synonyms of words to increase search hits.

For the text "Lucene in Action, Second Edition (2010)", using the StandardTokenizer followed by a LowerCaseFilter and StopFilter, the tokens would be:

lucene
action
second
edition
2010

As you can see, all the tokens has been converted to lowercase and the word 'in' has been removed since it's a stop word.

Standard Lucene Analyzers

Here are a few common types of analyzers in Lucene:

StandardAnalyzer: Combines the StandardTokenizer with the LowerCaseFilter and StopFilter. It's good for general text processing
SimpleAnalyzer: Uses a LowerCaseTokenizer which divides text at non-letter characters and converts them to lower case
WhiteSpaceAnalyzer: Uses a WhitespaceTokenizer to divide text at whitespace. It does not convert characters to lower case or remove stop words
KeywordAnalyzer: Does not tokenize the text but treats the entire string as a single token
CustomAnalyzer: Allows for the construction of an analyzer with custom components (tokenizer and filters)

Searching Indexed Fields

Once the text has been broken down into terms and added to the index, you can perform search queries on it.

First, create a QueryParser with the same Analyzer used for indexing. Then, use the parse method to convert the search query into a Query object. Finally, pass this Query object to the search method of the IndexSearcher.

When you perform a search, the same Analyzer that was used for indexing should also be used for searching - here's why:

Tokenization: Different analyzers tokenize text differently. If the search query is tokenized differently from the indexed text, it might not match as expected
Normalization: Analyzers may apply normalization techniques such as lowercasing, stemming, or removing stop words. Inconsistent normalization between indexing and searching can lead to mismatches
Filters: Some analyzers apply filters to remove or alter tokens. Using different filters for indexing and searching can result in unexpected behaviour

In short, using the same analyzer ensures that the search terms are processed in the same way as the indexed text.

Example: Different analyzers

To illustrate, here's an example where two different analyzers are used:

First we index documents using the WhitepaceAnalyzer:

Documents:
- Document 1: "Lucene in Action"
- Document 2: "Introduction to Lucene for .Net"
Analyzer:
- WhitespaceAnalyzer
Indexed Tokens:
- Document 1: ["Lucene", "in", "Action"]
- Document 2: ["Introduction", "to", "Lucene", "for", ".NET"]

Then we search the content using the StandardAnalyzer:

Query: "Lucene"
Analyzer
- StandardAnalyzer which lowercases the text and removes stopwords
Query Tokens:
- Lucene becomes ["lucene"]
Search Process:
1. The query token "lucene" (lowercased) is compared against the indexed tokens
  - Document 1: Contains "Lucene", not "lucene" (case-sensitive mismatch)
  - Document 2: Contains "Lucene", not "lucene" (case-sensitive mismatch)
Search Results:
- No matches found, because the token "lucene" does not match "Lucene" due to case differences.

Had we used the WhitespaceAnalyzer for the search both documents would be a match.

Example: Analyzers and Query Expressions

To illustrate how analyzers work in practice with query expressions, consider the following example, where we use the same analyzer for indexing and searching:

The texts to be indexed are:

Document 1: "Lucene in Action"
Document 2: "Introduction to Lucene for .NET"

The query text is Lucene.

Using the StandardAnalyzer:

Indexing:
- Document 1: ["lucene", "in", "action"]
- Document 2: ["introduction", "to", "lucene", "for", "net"]
Query analysis:
- "Lucene" becomes ["lucene"]
Search results:
- Both documents match because "lucene" is a term in both indexed documents

Using the KeywordAnalyzer:

Indexing:
- Document 1: ["Lucene in Action"]
- Document 2: ["Introduction to Lucene for .NET"]
Query analysis:
- "Lucene" becomes ["Lucene"]
Search results:
- No matches, as neither document contains the exact keyword "Lucene"

Using the WhitespaceAnalyzer:

Indexing:
- Document 1: ["Lucene", "in", "Action"]
- Document 2: ["Introduction", "to", "Lucene", "for" ".NET"]
Query analysis:
- "Lucene" becomes ["Lucene"]
Search results:
- Both documents match because "Lucene" is a term in both indexed documents

Faceted Search

Faceted search is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. In Lucene, facets are represented as a hierarchy of categories, and each document can be assigned to one or more categories.

Facets require a separate faceted index structure alongside the main Lucene index. The basic process involves defining which fields should be faceted when the documents are indexed. These fields can be either analyzed or not analyzed, affecting how faceting results are displayed and utilized.

When a field is analyzed during indexing, it is broken down into separate terms. If you use an analyzed field as a facet, the facet counts could potentially become misleading. This is because each individual term in the field would be treated as a separate facet.

Consider a field containing the category "Science Fiction".

If this field is analyzed, it would be broken down into two separate terms: "Science" and "Fiction":

Indexed Field: "Science Fiction"
Analyzer: Breaks down the field into tokens: ["Science", "Fiction"]
Faceted Results:
- Science: 1
- Fiction: 1

In this case, the facet counts are not useful because they treat "Science" and "Fiction" as separate facets, which does not reflect the actual category "Science Fiction".

On the other hand, if a field is not analyzed, then the entire content of the field is treated as a single term. This can be useful for certain types of data, such as categories or tags, where you want the entire field to be treated as a single facet. For example, if you have a field containing the category “Science Fiction”, and this field is not analyzed, then “Science Fiction” would be treated as a single facet.

Example: Book genres & price

Suppose you have a website where you sell books, and you want to provide faceting on the book genre and price.

Indexed Data:
- Document 1: { Title: "A Brief History of Time", Genre: "Science", Price: 40 }
- Document 2: { Title: "The Selfish Gene", Genre: "Science", Price: 50 }
- Document 3: { Title: "Fahrenheit 451", Genre: "Science Fiction", Price: 15 }
Fields:
- Genre: A good candidate for a non-analyzed field because genre names don’t need to be split into smaller terms.
- Price: Typically faceted as a range (e.g., $0-$20, $21-$40, etc.), and therefore also non-analyzed but treated differently as numeric range faceting.

Faceting on Non-Analyzed Field: Genre

When you facet on the "Genre" field, Lucene will aggregate data based on the exact terms stored in the index. Since "Genre" is non-analyzed, the terms are stored as they are, such as "Science" and "Science Fiction".

Query: Searching for books related to "Science"
Faceted Results:
- Science: 2
- Science Fiction: 1

Here, the facets correctly displays the books under the genre "Science".

Faceting on Analyzed Field: Genre

Suppose, alternatively, you decide to analyze the "Genre" field for some reason, using a simple analyzer that breaks text into words based on spaces.

Indexed Values:
- "A Brief History of Time": ["Science"]
- "The Selfish Gene": ["Science"]
- "Fahrenheit 451": ["Science", "Fiction"]
Faceted Results for the same "Science" query:
- Science: 3
- Fiction: 1

Here, the simple analyzer breaks down the text based on spaces, so genres like "Science Fiction" are split into the separate terms "Science" and "Fiction". This demonstrates a limitation of using an analyzed field for genre names, as it can incorrectly categorize "Fahrenheit 451" under "Science" instead of "Science Fiction".

Faceting on a Numeric Field: Price

For the "Price" field, you can use range faceting to allow users to search for books within specific price ranges, e.g. $10 - $20. Numeric fields used in faceting are typically indexed using specific data structures that optimize range queries and are not analyzed in the traditional sense.

Faceted Ranges:
- $0 - $20: 1 (Fahrenheit 451)
- $21 - $40: 1 (A Brief History of Time)
- $41 - $60: 1 (The Selfish Gene)

Query: Show price distributions of available books.

Table of Contents