Metrics reference
Estimated time to read: 7 minutes
Acquisition notice
In October 2022, ServiceNow acquired Era Software. The documentation on this site is no longer maintained and is intended for existing Era Software users only.
To get the latest information about ServiceNow's observability solutions, visit their website and documentation.
This page describes metrics for monitoring EraSearch's health and status. The content is intended for self-hosted EraSearch users. If you're looking for the fastest way to set up and use EraSearch, get started with EraSearch on EraCloud.
All EraSearch services include Prometheus-style metrics endpoints that can be used for scraping application metrics. The EraSearch Helm charts come with metric pod annotations that should be picked up by any pre-existing Kubernetes-based monitoring tools.
Note
Era Software recommends using a scrape interval of 10 seconds to ensure high fidelity resolution for the database metrics. This also ensures that alerts are accurate, and that action can be taken swiftly when problems do occur.
Metrics to alert on¶
The following EraSearch metrics are critical to the health of the database. If any of these metrics have a sustained positive rate over time, you may be losing database writes and need to take action.
-
quarry_bulk_cache_failure_total
- This metric indicates failure writing to the Cache service/layer, signifying a communication failure between the API and Cache layers.- For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.
-
quarry_maxwell_bulk_upload_failure_total
- This metric indicates failure writing to object storage via the Storage service.- For troubleshooting, the Storage service logs will have more information.
-
quarry_bulk_payloads_precommit_failure_total
- This metric indicates a failure attempting to "pre-commit" data to the Cache service.- For troubleshooting, the Cache service logs will have more information.
-
quarry_seqnum_failure_total
- This metric indicates a failure retrieving sequence numbers from the Coordinator service.- For troubleshooting, the Coordinator service logs will have more information.
-
quarry_rootset_save_error_count
- This metric indicates failures by the Cache service backing up "rootsets" to object storage.- For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.
-
quarry_compaction_io_failure_total
andquarry_compaction_upload_failure_total
- These metrics indicate failures by the Cache Service for properly storing the output of compactions.- For troubleshooting, the Cache service logs will have more information.
-
alexandria_redis_connection_failures_total
(available in v1.22) - This metric indicates failures connecting to the Coordinator service Redis backend.- For troubleshooting, the Coordinator service logs would be a good first step, however the issue may lie with Redis or within the network.
Metrics to watch¶
Include the following metrics in any monitoring dashboards. These metrics provide a high-level view of the database status and health.
Note
Era Software has a sample dashboard that you can use as a starting point for monitoring EraSearch using Grafana.
Ingest¶
Use the metrics below to observe the health and throughput of the ingest/write path into the database.
CPU-bound task latency¶
Use the quarry_cpu_queue_duration_ns_sum
and quarry_cpu_queue_duration_ns_count
metrics (or alternatively the quarry_cpu_queue_duration_ns
summary metric) to measure how long CPU-bound tasks are queued. To visualize per pod, use:
sum by (pod) (
rate(quarry_cpu_queue_duration_ns_sum[$interval])
/
rate(quarry_cpu_queue_duration_ns_count[$interval])
)
An average sustained latency greater than 1s typically means that the pod does not have adequate CPU resources and requires administrative action.
Disk-bound task latency¶
Use the quarry_blocking_queue_duration_ns_sum
and quarry_blocking_queue_duration_ns_count
metrics (or alternatively the quarry_blocking_queue_duration_ns
summary metric) to measure how long disk-bound tasks are queued. To visualize per pod, use:
sum by (pod) (
rate(quarry_blocking_queue_duration_ns_sum[$interval])
/
rate(quarry_blocking_queue_duration_ns_count[$interval])
)
An average sustained latency greater than 5ms typically means that the pod does not have adequate disk resources and requires administrative action.
Bytes indexed¶
The quarry_bulk_request_indexed_bytes
metric provides the number of bytes indexed by EraSearch namespace. This is particularly helpful in measuring total system write throughput across all or a particular index. To aggregate by index, use:
To see bytes indexed per pod, use:
Request duration¶
To measure the average ingest response time by pod, use:
sum by (pod) (
rate(quarry_bulk_request_duration_ns_sum[$interval])
/
rate(quarry_bulk_request_duration_ns_count[$interval])
)
Note that there is also a summary metric that can be used to measure quantiles, for example to view the 99th percentile max request duration:
Request count¶
The quarry_bulk_request_total
metric provides the total number of bulk write requests received by the API Service. To measure the request rate by pod use:
Documents indexed¶
The quarry_bulk_docs_indexed_total
metric provides the total number of documents indexed by the API Service. To measure the index rate by pod use:
Bulk request size¶
To measure the average bulk request size, use:
sum by (pod) (
rate(quarry_bulk_request_bytes[$interval])
/
rate(quarry_bulk_request_total[$interval])
)
Object storage¶
Object storage is a core component of the EraSearch architecture. The metrics below provide insight into how the Storage service interacts with the configured object storage provider.
Bytes written¶
The maxwell_object_write_bytes
metric provides the number of bytes written to object storage per Storage service pod. To measure the bytes written to object storage by pod use:
Bytes read¶
The maxwell_object_read_bytes
metric provides the number of bytes read from object storage per Storage service pod. To measure the bytes read from object storage by pod use:
Total writes¶
The maxwell_object_writes_total
metric provides the number of total write requests issued to the object storage provider per Storage service pod. To measure the total writes to object storage by pod use:
Total reads¶
The maxwell_object_reads_total
metric provides the number of total read requests issued to the object storage provider per Storage service pod. To measure the total reads from object storage by pod use:
Average write time¶
The maxwell_azure_blob_upload_duration_ns_sum / maxwell_azure_blob_upload_duration_ns_count
or maxwell_s3_upload_duration_ns_sum / maxwell_s3_upload_duration_ns_count
calculation provides the mean time taken for each write call to the respective object storage provider.
For Azure:
sum by (pod) (
rate(maxwell_azure_blob_upload_duration_ns_sum[$interval])
/
rate(maxwell_azure_blob_upload_duration_ns_count[$interval])
)
For AWS S3:
sum by (pod) (
rate(maxwell_s3_upload_duration_ns_sum[$interval])
/
rate(maxwell_s3_upload_duration_ns_count[$interval])
)
Average read time¶
The maxwell_s3_download_duration_ns_sum / maxwell_s3_download_duration_ns_count
or maxwell_azure_blob_download_duration_ns_sum / maxwell_azure_blob_download_duration_ns_count
calculation provides the mean time taken for each read call to the respective object storage provider.
For Azure:
sum by (pod) (
rate(maxwell_azure_blob_download_duration_ns_sum[$interval])
/
rate(maxwell_azure_blob_download_duration_ns_count[$interval])
)
For AWS S3:
sum by (pod) (
rate(maxwell_s3_download_duration_ns_sum[$interval])
/
rate(maxwell_s3_download_duration_ns_count[$interval])
)
Compactions¶
The Cache service periodically compacts the data on disk to optimize performance. Use the metrics below to understand when compactions occur, whether they were successful, and how long they typically take.
Average duration¶
The quarry_compaction_duration_ns_sum / quarry_compaction_duration_ns_count
calculation provides the mean time taken per compaction. It can also be broken out per compaction level. To provide the average compaction duration by level and pod use:
sum by (pod, level) (
rate(quarry_compaction_duration_ns_sum[$interval])
/
rate(quarry_compaction_duration_ns_count[$interval])
)
Failures¶
The quarry_compaction_io_failure_total
metric provides the number of IO failures that occurred when attempting to store a newly-compacted root.
The quarry_compaction_upload_failure_total
metric provides the number of upload failures that occurred when attempting to upload a newly-compacted root.
Adding the two together can provide an overall error rate for compactions.
Eviction¶
The Cache service periodically evicts (or removes) data from the local hot cache to prevent disk utilization from climbing to an unhealthy level. The metrics below can be used to understand when eviction occurs and how long it took.
Files evicted¶
The quarry_eviction_file_total_count
metric provides the number of files evicted from any given Cache service pod.
Estimated bytes to evict¶
The quarry_eviction_estimate_in_bytes
metric provides the estimated size in bytes that can be evicted from any given Cache service pod.
Time taken¶
The quarry_eviction_time_total_ns
metric provides the amount of time taken to perform a cache eviction.
High and low watermarks¶
The quarry_eviction_low_watermark_in_bytes
metric exposes the configured "low watermark" setting in bytes. This threshold is what the Cache service will use to know when to stop evicting data from the local hot cache.
The quarry_eviction_high_watermark_in_bytes
metric exposes the configured "high watermark" setting in bytes. This threshold is what the Cache service will use to know when to start evicting data from the local hot cache.
Queries¶
Measuring query performance is critical in understanding the health of the database. The metrics below provide insight into how many queries are being run, how long they take, and how many results they are returning.
Query count¶
The quarry_search_query_count
metric provides the number of queries issued to the database. This is particularly helpful in measuring total system read throughput.
Document count¶
The quarry_search_result_doc_count
metric provides the number of documents returned from reads to the system.
Search duration¶
The quarry_search_duration_ns_sum / quarry_search_duration_ns_count
calculation provides the mean search duration taken for queries. It can be broken out per endpoint.
sum by (pod, endpoint) (
rate(quarry_search_duration_ns_sum[$interval])
/
rate(quarry_search_duration_ns_count[$interval])
)
Blocking time¶
The quarry_search_blocking_task_duration_ns
metric provides the amount of time taken while blocking to serve a read request.
Cache hit ratio¶
The quarry_aggregation_cache_hit_total / (quarry_aggregation_cache_miss_total + quarry_aggregation_cache_hit_total)
calculation provides the query cache hit ratio, when query aggregation caching is enabled.
sum by (pod) (
rate(quarry_aggregation_cache_hit_total[$interval])
/
(
rate(quarry_aggregation_cache_miss_total[$interval])
+
rate(quarry_aggregation_cache_hit_total[$interval])
)
)
Rehydration¶
Rehydration is the process of automatically reading data from object storage when queried. The metrics below provide insight into when rehydration occurs, and how long it takes.
Average rehydration time¶
The quarry_ensure_roots_duration_ns_sum / quarry_ensure_roots_duration_ns_count
calculation provides the mean time taken to rehydrate roots from object storage.
sum by (pod) (
rate(quarry_ensure_roots_duration_ns_sum[$interval])
/
rate(quarry_ensure_roots_duration_ns_count[$interval])
)
Number of roots rehydrated¶
The quarry_ensure_roots_download_count
metric provides the number of roots rehydrated from object storage.