Skip to content

Metrics reference

Estimated time to read: 7 minutes

This page describes metrics for monitoring EraSearch's health and status. The content is intended for self-hosted EraSearch users. If you're looking for the fastest way to set up and use EraSearch, get started with EraSearch on EraCloud.

All EraSearch services include Prometheus-style metrics endpoints that can be used for scraping application metrics. The EraSearch Helm charts come with metric pod annotations that should be picked up by any pre-existing Kubernetes-based monitoring tools.

Note

Era Software recommends using a scrape interval of 10 seconds to ensure high fidelity resolution for the database metrics. This also ensures that alerts are accurate, and that action can be taken swiftly when problems do occur.

Metrics to alert on

The following EraSearch metrics are critical to the health of the database. If any of these metrics have a sustained positive rate over time, you may be losing database writes and need to take action.

  • quarry_bulk_cache_failure_total - This metric indicates failure writing to the Cache service/layer, signifying a communication failure between the API and Cache layers.

  • For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.

  • quarry_maxwell_bulk_upload_failure_total - This metric indicates failure writing to object storage via the Storage service.

  • For troubleshooting, the Storage service logs will have more information.

  • quarry_bulk_payloads_precommit_failure_total - This metric indicates a failure attempting to "pre-commit" data to the Cache service.

  • For troubleshooting, the Cache service logs will have more information.

  • quarry_seqnum_failure_total - This metric indicates a failure retrieving sequence numbers from the Coordinator service.

  • For troubleshooting, the Coordinator service logs will have more information.

  • quarry_rootset_save_error_count - This metric indicates failures by the Cache service backing up "rootsets" to object storage.

  • For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.

  • quarry_compaction_io_failure_total and quarry_compaction_upload_failure_total - These metrics indicate failures by the Cache Service for properly storing the output of compactions.

  • For troubleshooting, the Cache service logs will have more information.

  • alexandria_redis_connection_failures_total (available in v1.22) - This metric indicates failures connecting to the Coordinator service Redis backend.

  • For troubleshooting, the Coordinator service logs would be a good first step, however the issue may lie with Redis or within the network.

Metrics to watch

Include the following metrics in any monitoring dashboards. These metrics provide a high-level view of the database status and health.

Note

Era Software has a sample dashboard that you can use as a starting point for monitoring EraSearch using Grafana.

Ingest

Use the metrics below to observe the health and throughput of the ingest/write path into the database.

CPU-bound task latency

Use the quarry_cpu_queue_duration_ns_sum and quarry_cpu_queue_duration_ns_count metrics (or alternatively the quarry_cpu_queue_duration_ns summary metric) to measure how long CPU-bound tasks are queued. To visualize per pod, use:

sum by (pod) (
  rate(quarry_cpu_queue_duration_ns_sum[$interval])
  /
  rate(quarry_cpu_queue_duration_ns_count[$interval])
)

An average sustained latency greater than 1s typically means that the pod does not have adequate CPU resources and requires administrative action.

Disk-bound task latency

Use the quarry_blocking_queue_duration_ns_sum and quarry_blocking_queue_duration_ns_count metrics (or alternatively the quarry_blocking_queue_duration_ns summary metric) to measure how long disk-bound tasks are queued. To visualize per pod, use:

sum by (pod) (
  rate(quarry_blocking_queue_duration_ns_sum[$interval])
  /
  rate(quarry_blocking_queue_duration_ns_count[$interval])
)

An average sustained latency greater than 5ms typically means that the pod does not have adequate disk resources and requires administrative action.

Bytes indexed

The quarry_bulk_request_indexed_bytes metric provides the number of bytes indexed by EraSearch namespace. This is particularly helpful in measuring total system write throughput across all or a particular index. To aggregate by index, use:

sum by (ns) (rate(quarry_bulk_request_indexed_bytes[$interval]) > 0)

To see bytes indexed per pod, use:

sum by (pod) (rate(quarry_bulk_request_indexed_bytes[$interval]) > 0)

Request duration

To measure the average ingest response time by pod, use:

sum by (pod) (
  rate(quarry_bulk_request_duration_ns_sum[$interval])
  /
  rate(quarry_bulk_request_duration_ns_count[$interval])
)

Note that there is also a summary metric that can be used to measure quantiles, for example to view the 99th percentile max request duration:

max(rate(quarry_bulk_request_duration_ns{quantile="0.99"}[$interval]))

Request count

The quarry_bulk_request_total metric provides the total number of bulk write requests received by the API Service. To measure the request rate by pod use:

sum by (pod) (rate(quarry_bulk_request_total[$interval]))

Documents indexed

The quarry_bulk_docs_indexed_total metric provides the total number of documents indexed by the API Service. To measure the index rate by pod use:

sum by (pod) (rate(quarry_bulk_docs_indexed_total[$interval]))

Bulk request size

To measure the average bulk request size, use:

sum by (pod) (
  rate(quarry_bulk_request_bytes[$interval]) 
  / 
  rate(quarry_bulk_request_total[$interval])
)

Object storage

Object storage is a core component of the EraSearch architecture. The metrics below provide insight into how the Storage service interacts with the configured object storage provider.

Bytes written

The maxwell_object_write_bytes metric provides the number of bytes written to object storage per Storage service pod. To measure the bytes written to object storage by pod use:

sum by (pod) (rate(maxwell_object_write_bytes[$interval]))

Bytes read

The maxwell_object_read_bytes metric provides the number of bytes read from object storage per Storage service pod. To measure the bytes read from object storage by pod use:

sum by (pod) (rate(maxwell_object_read_bytes[$interval]))

Total writes

The maxwell_object_writes_total metric provides the number of total write requests issued to the object storage provider per Storage service pod. To measure the total writes to object storage by pod use:

sum by (pod) (rate(maxwell_object_writes_total[$interval]))

Total reads

The maxwell_object_reads_total metric provides the number of total read requests issued to the object storage provider per Storage service pod. To measure the total reads from object storage by pod use:

sum by (pod) (rate(maxwell_object_reads_total[$interval]))

Average write time

The maxwell_azure_blob_upload_duration_ns_sum / maxwell_azure_blob_upload_duration_ns_count or maxwell_s3_upload_duration_ns_sum / maxwell_s3_upload_duration_ns_count calculation provides the mean time taken for each write call to the respective object storage provider.

For Azure:

sum by (pod) (
  rate(maxwell_azure_blob_upload_duration_ns_sum[$interval])
  / 
  rate(maxwell_azure_blob_upload_duration_ns_count[$interval])
)

For AWS S3:

sum by (pod) (
  rate(maxwell_s3_upload_duration_ns_sum[$interval])
  / 
  rate(maxwell_s3_upload_duration_ns_count[$interval])
)

Average read time

The maxwell_s3_download_duration_ns_sum / maxwell_s3_download_duration_ns_count or maxwell_azure_blob_download_duration_ns_sum / maxwell_azure_blob_download_duration_ns_count calculation provides the mean time taken for each read call to the respective object storage provider.

For Azure:

sum by (pod) (
    rate(maxwell_azure_blob_download_duration_ns_sum[$interval])
    / 
    rate(maxwell_azure_blob_download_duration_ns_count[$interval])
)

For AWS S3:

sum by (pod) (
    rate(maxwell_s3_download_duration_ns_sum[$interval])
    / 
    rate(maxwell_s3_download_duration_ns_count[$interval])
)

Compactions

The Cache service periodically compacts the data on disk to optimize performance. Use the metrics below to understand when compactions occur, whether they were successful, and how long they typically take.

Average duration

The quarry_compaction_duration_ns_sum / quarry_compaction_duration_ns_count calculation provides the mean time taken per compaction. It can also be broken out per compaction level. To provide the average compaction duration by level and pod use:

sum by (pod, level) (
  rate(quarry_compaction_duration_ns_sum[$interval]) 
  / 
  rate(quarry_compaction_duration_ns_count[$interval])
)

Failures

The quarry_compaction_io_failure_total metric provides the number of IO failures that occurred when attempting to store a newly-compacted root.

sum by (pod) (rate(quarry_compaction_io_failure_total[$interval]))

The quarry_compaction_upload_failure_total metric provides the number of upload failures that occurred when attempting to upload a newly-compacted root.

sum by (pod) (rate(quarry_compaction_upload_failure_total[$interval]))

Adding the two together can provide an overall error rate for compactions.

Eviction

The Cache service periodically evicts (or removes) data from the local hot cache to prevent disk utilization from climbing to an unhealthy level. The metrics below can be used to understand when eviction occurs and how long it took.

Files evicted

The quarry_eviction_file_total_count metric provides the number of files evicted from any given Cache service pod.

sum by (pod) (rate(quarry_eviction_file_total_count[$interval]))

Estimated bytes to evict

The quarry_eviction_estimate_in_bytes metric provides the estimated size in bytes that can be evicted from any given Cache service pod.

sum by (pod) (rate(quarry_eviction_estimate_in_bytes[$interval]))

Time taken

The quarry_eviction_time_total_ns metric provides the amount of time taken to perform a cache eviction.

sum by (pod) (rate(quarry_eviction_time_total_ns[$interval]))

High and low watermarks

The quarry_eviction_low_watermark_in_bytes metric exposes the configured "low watermark" setting in bytes. This threshold is what the Cache service will use to know when to stop evicting data from the local hot cache.

max(quarry_eviction_low_watermark_in_bytes)

The quarry_eviction_high_watermark_in_bytes metric exposes the configured "high watermark" setting in bytes. This threshold is what the Cache service will use to know when to start evicting data from the local hot cache.

max(quarry_eviction_high_watermark_in_bytes)

Queries

Measuring query performance is critical in understanding the health of the database. The metrics below provide insight into how many queries are being run, how long they take, and how many results they are returning.

Query count

The quarry_search_query_count metric provides the number of queries issued to the database. This is particularly helpful in measuring total system read throughput.

sum by (endpoint, pod) ( rate(quarry_search_query_count[$interval]) )

Document count

The quarry_search_result_doc_count metric provides the number of documents returned from reads to the system.

sum by (endpoint, pod) ( rate(quarry_search_result_doc_count[$interval]) )

Search duration

The quarry_search_duration_ns_sum / quarry_search_duration_ns_count calculation provides the mean search duration taken for queries. It can be broken out per endpoint.

sum by (pod, endpoint) (
  rate(quarry_search_duration_ns_sum[$interval]) 
  / 
  rate(quarry_search_duration_ns_count[$interval])
)

Blocking time

The quarry_search_blocking_task_duration_ns metric provides the amount of time taken while blocking to serve a read request.

sum by (pod, endpoint) (rate(quarry_search_blocking_task_duration_ns[$interval]))

Cache hit ratio

The quarry_aggregation_cache_hit_total / (quarry_aggregation_cache_miss_total + quarry_aggregation_cache_hit_total) calculation provides the query cache hit ratio, when query aggregation caching is enabled.

sum by (pod) (
    rate(quarry_aggregation_cache_hit_total[$interval])
    / 
    (
        rate(quarry_aggregation_cache_miss_total[$interval])
        + 
        rate(quarry_aggregation_cache_hit_total[$interval])
    )
)

Rehydration

Rehydration is the process of automatically reading data from object storage when queried. The metrics below provide insight into when rehydration occurs, and how long it takes.

Average rehydration time

The quarry_ensure_roots_duration_ns_sum / quarry_ensure_roots_duration_ns_count calculation provides the mean time taken to rehydrate roots from object storage.

sum by (pod) (
  rate(quarry_ensure_roots_duration_ns_sum[$interval])
  /
  rate(quarry_ensure_roots_duration_ns_count[$interval])
)

Number of roots rehydrated

The quarry_ensure_roots_download_count metric provides the number of roots rehydrated from object storage.

sum by (pod) (rate(quarry_ensure_roots_download_count[$interval]))

Last update: September 27, 2022