Operating indexes in production

Indexes are usually critical to a commerce solution; both storefronts, product catalogs, search experiences, filtering, user lookups, and integrations depend on them. When indexes are healthy, search is fast and predictable. When they are stale, incomplete, or corrupted, users can see outdated results, missing content, poor relevance, or failures in dependent features.

In production, the goal is usually not just to build indexes quickly. The goal is to keep at least one good index available at all times while updates are applied safely in the background. This article is for the person responsible for keeping a site fast, searchable, and up to date in a production environment.

It complements the other parts of the implementer documentation by focusing on operations:

Which balancing strategy to use
When and how often to rebuild indexes
What load index builds place on a solution
How to recognize when something has gone wrong
How to recover safely without taking a good index instance offline

Balancing strategies

An index can have more than one instance. This allows one instance to serve queries while another instance is being rebuilt. How exactly this happens is controlled by the index balancing strategy - there are two possibilities:

ActivePassive: One instance is considered online, and another instance is rebuilt in the background. This is often the easiest strategy to reason about operationally because it gives a clear "current serving instance" and a clear standby instance.
LastUpdated: The newest healthy instance becomes the preferred instance. This can be useful when you want the most recently completed build to become active automatically.

The balancing strategy is selected when you create an index - afterwards, it can be changed via the context menu for the index in the settings tree.

Rebuild frequencies

How often you rebuild an index depends on how quickly data changes and how expensive a rebuild is.

Small indexes
Large indexes

Small indexes are typically fast to rebuild and have limited effect on CPU, disk, and memory.

These are often suitable for:

frequent scheduled rebuilds
rebuilds after content changes
daytime rebuilds if traffic is modest

Examples:

small content indexes
limited product assortments
user indexes with moderate volume

The important thing is to choose a rebuilt cadence which matches the freshness requirements for the index. For this you can use business logic:

Search must reflect changes almost immediately - rebuild often or trigger targeted updates where possible
Search can lag by minutes - rebuild on a short schedule
Search can lag by hours - rebuild during maintenance windows

Understand the operational impact of a rebuild

The right rebuild schedule is usually a balance between freshness and stability. Rebuilding too often can create unnecessary pressure on the solution without improving the user experience in a meaningful way.

An index rebuild can affect:

application CPU usage
database load
disk throughput
memory pressure
background task concurrency
query latency if the same machine is already under load

Watch for these symptoms:

slow front-end responses during build windows
increased SQL response time
longer-than-normal build duration
multiple heavy background jobs competing for resources
queues of scheduled tasks falling behind

If builds are expensive, avoid stacking them with:

imports
synchronization jobs
feed generation
batch updates
deployment operations

When to rebuild manually

You can rebuild an index automatically and on a schedule and manually - in a production scenario, manual rebuilds are only appropriate when:

an index schema changed
relevance behavior changed significantly
index data was deleted
the UI shows an interrupted or failed instance that has not recovered automatically
the build state is inconsistent after maintenance or infrastructure incidents

You should avoid manual rebuilds when a build is already running, the underlying problem is still present, or you are about to restart the environment again.

A healthy production setup

A healthy setup should aim for all of the following:

at least one good index instance is always available for queries
unfinished or interrupted builds are visible
a failed or incomplete instance is repaired before other standby instances are rebuilt
a solution restart does not leave an index permanently stuck in an ambiguous state
operators can tell the difference between the last attempted build and the last successful build

To help you achieve this you may find the operations checklist useful.

Daily operations checklist

Check these regularly on busy solutions:

When was the last successful build for each critical index?
Is any index instance stuck in Starting or Running for too long?
Is there an Interrupted or Failed instance waiting for repair?
Are builds finishing within the expected time window?
Has index size or duration changed significantly after imports or schema changes?
Does one healthy instance remain available while another is rebuilding?

Operational dashboards and alerts

Good operational dashboards usually include:

Current lifecycle state
Current online instance
Last attempted build
Last successful build
Last heartbeat
Build duration trend
Failure reason or interruption reason

For business-critical indexes, you should also monitor and alert on:

No successful build within expected time window
An instance stuck in Running beyond expected duration
Any Failed or Interrupted instance
Missing healthy standby instance
Repeated long build durations

Useful metrics include:

Build duration
Time since last successful build
Active instance count
Failure count
Interruption count
Index size growth over time

Signs that your indexing strategy needs improvement

Consider revisiting design or scheduling if you see any of these patterns:

Builds regularly overlap with peak traffic
Indexes are large enough that rebuild windows are difficult to complete
One instance stays stale for a long time while others keep rebuilding
Operators cannot tell whether the latest build was successful
Troubleshooting depends on guessing rather than clear state and logs

Possible improvements include:

Adding or using multiple index instances
Moving rebuilds to quieter periods
Reducing rebuild frequency where freshness requirements allow it
Using better monitoring around duration, failures, and interruption events
Reviewing whether ActivePassive or LastUpdated better fits the solution

Common failure modes

Index builds can fail in several ways:

The application pool or process restarts during a build
The server runs out of disk space
The build process throws an exception
The source data is temporarily unavailable
A long-running build is interrupted by deployment or infrastructure maintenance
A machine is recycled or restarted before the build reaches completion
Diagnostics or state files become unreadable

The most important operational distinction is this:

Failed means the build ended with an explicit failure
Interrupted means the build did not reach a clean end, typically because the process stopped unexpectedly

These conditions should not be treated as normal completion.

What happens when a build fails or is interrupted

The exact behavior depends on configuration and version, but operationally you should expect the following principles:

The last good instance should remain the safe instance for queries
An incomplete instance should not be treated as healthy
The next recovery attempt should target the broken or unfinished instance first
After restart, stale running builds should be detected and moved into a recoverable state

Important

If one instance is incomplete, do not continue rotating rebuilds across the other instances as if nothing happened. Repair the incomplete instance first so you keep a clear good instance and a clear repair target.

Debugging a bad index state

When search results look stale or a build appears stuck, start with these questions:

Is there at least one healthy instance available?
Which instance was last built successfully?
Which instance is currently serving traffic?
Is another instance in Running, Failed, or Interrupted state?
Was there a recent restart, deployment, or infrastructure event?

Then inspect:

Index status in the UI
Diagnostics logs and tracker history
The durable build-state file, if available
Recent application restarts
Disk space and machine health

Safe recovery approach

When an index is in a bad state, use a cautious recovery sequence:

Confirm that one healthy instance is still available for queries.
Identify the broken or incomplete instance.
Review the latest failure or interruption details.
Check for environmental causes such as disk space, restarts, or data-source issues.
Rebuild the affected instance first.
Verify successful completion before rotating or rebuilding other instances.

If the solution was restarted during a build, a good operational model is:

Detect stale running state
Mark the build as interrupted
Restart the repair on that same instance
Keep the healthy instance serving traffic until repair is complete

Files and folders to know

The default production locations are:

Index data: /Files/System/Indexes
Diagnostics and tracker history: /Files/System/Diagnostics
Durable build state: /Files/System/Diagnostics/IndexBuildState/{repository}/{index}/{instance}/state.json

Tracker history for an instance is stored under a repository, index, and instance path below the diagnostics folder, typically with dated subfolders for each run.

Operationally:

The index folder contains the actual built index data used by the solution
Tracker history contains build history, progress, and failure details for previous runs
state.json contains the latest durable lifecycle state for an index instance

The durable state.json file typically contains information such as:

repository
index name
instance name
current lifecycle state
build name
operation id
start time
last heartbeat
finish time
last successful build time
resume cursor, if supported
server name
error summary and details

This file is intended to answer operational questions like:

Is this instance still building?
Did it fail explicitly or get interrupted?
When was it last known healthy?
Should this instance be repaired before another one is rebuilt?

Can these files be deleted?

Yes, but not casually.

Deleting tracker history removes build history and log context. It does not remove the index data itself.

This can be acceptable when:

you want to clean up old history
you are sure no build is currently running
you do not need the old diagnostics for troubleshooting

This is risky when:

a build is in progress
you still need failure details
you expect the system to infer recent behavior from tracker history

Deleting state.json removes the latest durable lifecycle record for that instance.

This can be acceptable as a last-resort cleanup step when:

no build is running
the file is clearly stale or corrupted
you plan to trigger a controlled rebuild afterward