Table of Contents

Operating indexes in production

Indexes are usually critical to a commerce solution; both storefronts, product catalogs, search experiences, filtering, user lookups, and integrations depend on them. When indexes are healthy, search is fast and predictable. When they are stale, incomplete, or corrupted, users can see outdated results, missing content, poor relevance, or failures in dependent features.

In production, the goal is usually not just to build indexes quickly. The goal is to keep at least one good index available at all times while updates are applied safely in the background. This article is for the person responsible for keeping a site fast, searchable, and up to date in a production environment.

It complements the other parts of the implementer documentation by focusing on operations:

  • Which balancing strategy to use
  • When and how often to rebuild indexes
  • What load index builds place on a solution
  • How to recognize when something has gone wrong
  • How to recover safely without taking a good index instance offline

Balancing strategies

An index can have more than one instance. This allows one instance to serve queries while another instance is being rebuilt. How exactly this happens is controlled by the index balancing strategy - there are two possibilities:

  • ActivePassive: One instance is considered online, and another instance is rebuilt in the background. This is often the easiest strategy to reason about operationally because it gives a clear "current serving instance" and a clear standby instance.
  • LastUpdated: The newest healthy instance becomes the preferred instance. This can be useful when you want the most recently completed build to become active automatically.

The balancing strategy is selected when you create an index - afterwards, it can be changed via the context menu for the index in the settings tree.

Rebuild frequencies

How often you rebuild an index depends on how quickly data changes and how expensive a rebuild is.

Small indexes are typically fast to rebuild and have limited effect on CPU, disk, and memory.

These are often suitable for:

  • frequent scheduled rebuilds
  • rebuilds after content changes
  • daytime rebuilds if traffic is modest

Examples:

  • small content indexes
  • limited product assortments
  • user indexes with moderate volume

The important thing is to choose a rebuilt cadence which matches the freshness requirements for the index. For this you can use business logic:

  • Search must reflect changes almost immediately - rebuild often or trigger targeted updates where possible
  • Search can lag by minutes - rebuild on a short schedule
  • Search can lag by hours - rebuild during maintenance windows

Understand the operational impact of a rebuild

The right rebuild schedule is usually a balance between freshness and stability. Rebuilding too often can create unnecessary pressure on the solution without improving the user experience in a meaningful way.

An index rebuild can affect:

  • application CPU usage
  • database load
  • disk throughput
  • memory pressure
  • background task concurrency
  • query latency if the same machine is already under load

Watch for these symptoms:

  • slow front-end responses during build windows
  • increased SQL response time
  • longer-than-normal build duration
  • multiple heavy background jobs competing for resources
  • queues of scheduled tasks falling behind

If builds are expensive, avoid stacking them with:

  • imports
  • synchronization jobs
  • feed generation
  • batch updates
  • deployment operations

When to rebuild manually

You can rebuild an index automatically and on a schedule and manually - in a production scenario, manual rebuilds are only appropriate when:

  • an index schema changed
  • relevance behavior changed significantly
  • index data was deleted
  • the UI shows an interrupted or failed instance that has not recovered automatically
  • the build state is inconsistent after maintenance or infrastructure incidents

You should avoid manual rebuilds when a build is already running, the underlying problem is still present, or you are about to restart the environment again.

A healthy production setup

A healthy setup should aim for all of the following:

  • at least one good index instance is always available for queries
  • unfinished or interrupted builds are visible
  • a failed or incomplete instance is repaired before other standby instances are rebuilt
  • a solution restart does not leave an index permanently stuck in an ambiguous state
  • operators can tell the difference between the last attempted build and the last successful build

To help you achieve this you may find the operations checklist useful.

Daily operations checklist

Check these regularly on busy solutions:

  1. When was the last successful build for each critical index?
  2. Is any index instance stuck in Starting or Running for too long?
  3. Is there an Interrupted or Failed instance waiting for repair?
  4. Are builds finishing within the expected time window?
  5. Has index size or duration changed significantly after imports or schema changes?
  6. Does one healthy instance remain available while another is rebuilding?

Operational dashboards and alerts

Good operational dashboards usually include:

  • Current lifecycle state
  • Current online instance
  • Last attempted build
  • Last successful build
  • Last heartbeat
  • Build duration trend
  • Failure reason or interruption reason

For business-critical indexes, you should also monitor and alert on:

  • No successful build within expected time window
  • An instance stuck in Running beyond expected duration
  • Any Failed or Interrupted instance
  • Missing healthy standby instance
  • Repeated long build durations

Useful metrics include:

  • Build duration
  • Time since last successful build
  • Active instance count
  • Failure count
  • Interruption count
  • Index size growth over time

Signs that your indexing strategy needs improvement

Consider revisiting design or scheduling if you see any of these patterns:

  • Builds regularly overlap with peak traffic
  • Indexes are large enough that rebuild windows are difficult to complete
  • One instance stays stale for a long time while others keep rebuilding
  • Operators cannot tell whether the latest build was successful
  • Troubleshooting depends on guessing rather than clear state and logs

Possible improvements include:

  • Adding or using multiple index instances
  • Moving rebuilds to quieter periods
  • Reducing rebuild frequency where freshness requirements allow it
  • Using better monitoring around duration, failures, and interruption events
  • Reviewing whether ActivePassive or LastUpdated better fits the solution

Common failure modes

Index builds can fail in several ways:

  • The application pool or process restarts during a build
  • The server runs out of disk space
  • The build process throws an exception
  • The source data is temporarily unavailable
  • A long-running build is interrupted by deployment or infrastructure maintenance
  • A machine is recycled or restarted before the build reaches completion
  • Diagnostics or state files become unreadable

The most important operational distinction is this:

  • Failed means the build ended with an explicit failure
  • Interrupted means the build did not reach a clean end, typically because the process stopped unexpectedly

These conditions should not be treated as normal completion.

What happens when a build fails or is interrupted

The exact behavior depends on configuration and version, but operationally you should expect the following principles:

  • The last good instance should remain the safe instance for queries
  • An incomplete instance should not be treated as healthy
  • The next recovery attempt should target the broken or unfinished instance first
  • After restart, stale running builds should be detected and moved into a recoverable state
Important

If one instance is incomplete, do not continue rotating rebuilds across the other instances as if nothing happened. Repair the incomplete instance first so you keep a clear good instance and a clear repair target.

Debugging a bad index state

When search results look stale or a build appears stuck, start with these questions:

  1. Is there at least one healthy instance available?
  2. Which instance was last built successfully?
  3. Which instance is currently serving traffic?
  4. Is another instance in Running, Failed, or Interrupted state?
  5. Was there a recent restart, deployment, or infrastructure event?

Then inspect:

  • Index status in the UI
  • Diagnostics logs and tracker history
  • The durable build-state file, if available
  • Recent application restarts
  • Disk space and machine health

Safe recovery approach

When an index is in a bad state, use a cautious recovery sequence:

  1. Confirm that one healthy instance is still available for queries.
  2. Identify the broken or incomplete instance.
  3. Review the latest failure or interruption details.
  4. Check for environmental causes such as disk space, restarts, or data-source issues.
  5. Rebuild the affected instance first.
  6. Verify successful completion before rotating or rebuilding other instances.

If the solution was restarted during a build, a good operational model is:

  • Detect stale running state
  • Mark the build as interrupted
  • Restart the repair on that same instance
  • Keep the healthy instance serving traffic until repair is complete

Files and folders to know

The default production locations are:

  • Index data: /Files/System/Indexes
  • Diagnostics and tracker history: /Files/System/Diagnostics
  • Durable build state: /Files/System/Diagnostics/IndexBuildState/{repository}/{index}/{instance}/state.json

Tracker history for an instance is stored under a repository, index, and instance path below the diagnostics folder, typically with dated subfolders for each run.

Operationally:

  • The index folder contains the actual built index data used by the solution
  • Tracker history contains build history, progress, and failure details for previous runs
  • state.json contains the latest durable lifecycle state for an index instance

The durable state.json file typically contains information such as:

  • repository
  • index name
  • instance name
  • current lifecycle state
  • build name
  • operation id
  • start time
  • last heartbeat
  • finish time
  • last successful build time
  • resume cursor, if supported
  • server name
  • error summary and details

This file is intended to answer operational questions like:

  • Is this instance still building?
  • Did it fail explicitly or get interrupted?
  • When was it last known healthy?
  • Should this instance be repaired before another one is rebuilt?

Can these files be deleted?

Yes, but not casually.

Deleting tracker history removes build history and log context. It does not remove the index data itself.

This can be acceptable when:

  • you want to clean up old history
  • you are sure no build is currently running
  • you do not need the old diagnostics for troubleshooting

This is risky when:

  • a build is in progress
  • you still need failure details
  • you expect the system to infer recent behavior from tracker history
To top