Era Software

No results match your query

Metrics reference

Estimated reading time: 8 minutes
  • reference
  • self-hosted
  • monitoring
  • alerting

This page describes metrics for monitoring EraSearch's health and status. The content is intended for self-hosted EraSearch users. If you're looking for the fastest way to set up and use EraSearch, get started with EraSearch on EraCloud.

All EraSearch services include Prometheus-style metrics endpoints that can be used for scraping application metrics. The EraSearch Helm charts come with metric pod annotations that should be picked up by any pre-existing Kubernetes-based monitoring tools.

Era Software recommends using a scrape interval of 10 seconds to ensure high fidelity resolution for the database metrics. This also ensures that alerts are accurate, and that action can be taken swiftly when problems do occur.

Metrics to alert on
Copy
Copied!

The following EraSearch metrics are critical to the health of the database. If any of these metrics have a sustained positive rate over time, you may be losing database writes and need to take action.

  • quarry_bulk_cache_failure_total - This metric indicates failure writing to the Cache service/layer, signifying a communication failure between the API and Cache layers.

    • For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.
  • quarry_maxwell_bulk_upload_failure_total - This metric indicates failure writing to object storage via the Storage service.

    • For troubleshooting, the Storage service logs will have more information.
  • quarry_bulk_payloads_precommit_failure_total - This metric indicates a failure attempting to "pre-commit" data to the Cache service.

    • For troubleshooting, the Cache service logs will have more information.
  • quarry_seqnum_failure_total - This metric indicates a failure retrieving sequence numbers from the Coordinator service.

    • For troubleshooting, the Coordinator service logs will have more information.
  • quarry_rootset_save_error_count - This metric indicates failures by the Cache service backing up "rootsets" to object storage.

    • For troubleshooting, the Cache service logs will have more information, however the failure could also be caused by upstream components: the Storage service, the object storage provider, or the network itself.
  • quarry_compaction_io_failure_total and quarry_compaction_upload_failure_total - These metrics indicate failures by the Cache Service for properly storing the output of compactions.

    • For troubleshooting, the Cache service logs will have more information.
  • alexandria_redis_connection_failures_total (available in v1.22) - This metric indicates failures connecting to the Coordinator service Redis backend.

    • For troubleshooting, the Coordinator service logs would be a good first step, however the issue may lie with Redis or within the network.

Metrics to watch
Copy
Copied!

Include the following metrics in any monitoring dashboards. These metrics provide a high-level view of the database status and health.

Note: Era Software has a sample dashboard that you can use as a starting point for monitoring EraSearch using Grafana.

Ingest
Copy
Copied!

Use the metrics below to observe the health and throughput of the ingest/write path into the database.

CPU-bound task latency
Copy
Copied!

Use the quarry_cpu_queue_duration_ns_sum and quarry_cpu_queue_duration_ns_count metrics (or alternatively the quarry_cpu_queue_duration_ns summary metric) to measure how long CPU-bound tasks are queued. To visualize per pod, use:

Copy
Copied!
sum by (pod) (
  rate(quarry_cpu_queue_duration_ns_sum[$interval])
  /
  rate(quarry_cpu_queue_duration_ns_count[$interval])
)

An average sustained latency greater than 1s typically means that the pod does not have adequate CPU resources and requires administrative action.

Disk-bound task latency
Copy
Copied!

Use the quarry_blocking_queue_duration_ns_sum and quarry_blocking_queue_duration_ns_count metrics (or alternatively the quarry_blocking_queue_duration_ns summary metric) to measure how long disk-bound tasks are queued. To visualize per pod, use:

Copy
Copied!
sum by (pod) (
  rate(quarry_blocking_queue_duration_ns_sum[$interval])
  /
  rate(quarry_blocking_queue_duration_ns_count[$interval])
)

An average sustained latency greater than 5ms typically means that the pod does not have adequate disk resources and requires administrative action.

Bytes indexed
Copy
Copied!

The quarry_bulk_request_indexed_bytes metric provides the number of bytes indexed by EraSearch namespace. This is particularly helpful in measuring total system write throughput across all or a particular index. To aggregate by index, use:

Copy
Copied!
sum by (ns) (rate(quarry_bulk_request_indexed_bytes[$interval]) > 0)

To see bytes indexed per pod, use:

Copy
Copied!
sum by (pod) (rate(quarry_bulk_request_indexed_bytes[$interval]) > 0)

Request duration
Copy
Copied!

To measure the average ingest response time by pod, use:

Copy
Copied!
sum by (pod) (
  rate(quarry_bulk_request_duration_ns_sum[$interval])
  /
  rate(quarry_bulk_request_duration_ns_count[$interval])
)

Note that there is also a summary metric that can be used to measure quantiles, for example to view the 99th percentile max request duration:

Copy
Copied!
max(rate(quarry_bulk_request_duration_ns{quantile="0.99"}[$interval]))

Request count
Copy
Copied!

The quarry_bulk_request_total metric provides the total number of bulk write requests received by the API Service. To measure the request rate by pod use:

Copy
Copied!
sum by (pod) (rate(quarry_bulk_request_total[$interval]))

Documents indexed
Copy
Copied!

The quarry_bulk_docs_indexed_total metric provides the total number of documents indexed by the API Service. To measure the index rate by pod use:

Copy
Copied!
sum by (pod) (rate(quarry_bulk_docs_indexed_total[$interval]))

Bulk request size
Copy
Copied!

To measure the average bulk request size, use:

Copy
Copied!
sum by (pod) (
  rate(quarry_bulk_request_bytes[$interval]) 
  / 
  rate(quarry_bulk_request_total[$interval])
)

Object storage
Copy
Copied!

Object storage is a core component of the EraSearch architecture. The metrics below provide insight into how the Storage service interacts with the configured object storage provider.

Bytes written
Copy
Copied!

The maxwell_object_write_bytes metric provides the number of bytes written to object storage per Storage service pod. To measure the bytes written to object storage by pod use:

Copy
Copied!
sum by (pod) (rate(maxwell_object_write_bytes[$interval]))

Bytes read
Copy
Copied!

The maxwell_object_read_bytes metric provides the number of bytes read from object storage per Storage service pod. To measure the bytes read from object storage by pod use:

Copy
Copied!
sum by (pod) (rate(maxwell_object_read_bytes[$interval]))

Total writes
Copy
Copied!

The maxwell_object_writes_total metric provides the number of total write requests issued to the object storage provider per Storage service pod. To measure the total writes to object storage by pod use:

Copy
Copied!
sum by (pod) (rate(maxwell_object_writes_total[$interval]))

Total reads
Copy
Copied!

The maxwell_object_reads_total metric provides the number of total read requests issued to the object storage provider per Storage service pod. To measure the total reads from object storage by pod use:

Copy
Copied!
sum by (pod) (rate(maxwell_object_reads_total[$interval]))

Average write time
Copy
Copied!

The maxwell_azure_blob_upload_duration_ns_sum / maxwell_azure_blob_upload_duration_ns_count or maxwell_s3_upload_duration_ns_sum / maxwell_s3_upload_duration_ns_count calculation provides the mean time taken for each write call to the respective object storage provider.

For Azure:

Copy
Copied!
sum by (pod) (
  rate(maxwell_azure_blob_upload_duration_ns_sum[$interval])
  / 
  rate(maxwell_azure_blob_upload_duration_ns_count[$interval])
)

For AWS S3:

Copy
Copied!
sum by (pod) (
  rate(maxwell_s3_upload_duration_ns_sum[$interval])
  / 
  rate(maxwell_s3_upload_duration_ns_count[$interval])
)

Average read time
Copy
Copied!

The maxwell_s3_download_duration_ns_sum / maxwell_s3_download_duration_ns_count or maxwell_azure_blob_download_duration_ns_sum / maxwell_azure_blob_download_duration_ns_count calculation provides the mean time taken for each read call to the respective object storage provider.

For Azure:

Copy
Copied!
sum by (pod) (
    rate(maxwell_azure_blob_download_duration_ns_sum[$interval])
    / 
    rate(maxwell_azure_blob_download_duration_ns_count[$interval])
)

For AWS S3:

Copy
Copied!
sum by (pod) (
    rate(maxwell_s3_download_duration_ns_sum[$interval])
    / 
    rate(maxwell_s3_download_duration_ns_count[$interval])
)

Compactions
Copy
Copied!

The Cache service periodically compacts the data on disk to optimize performance. Use the metrics below to understand when compactions occur, whether they were successful, and how long they typically take.

Average duration
Copy
Copied!

The quarry_compaction_duration_ns_sum / quarry_compaction_duration_ns_count calculation provides the mean time taken per compaction. It can also be broken out per compaction level. To provide the average compaction duration by level and pod use:

Copy
Copied!
sum by (pod, level) (
  rate(quarry_compaction_duration_ns_sum[$interval]) 
  / 
  rate(quarry_compaction_duration_ns_count[$interval])
)

Failures
Copy
Copied!

The quarry_compaction_io_failure_total metric provides the number of IO failures that occurred when attempting to store a newly-compacted root.

Copy
Copied!
sum by (pod) (rate(quarry_compaction_io_failure_total[$interval]))

The quarry_compaction_upload_failure_total metric provides the number of upload failures that occurred when attempting to upload a newly-compacted root.

Copy
Copied!
sum by (pod) (rate(quarry_compaction_upload_failure_total[$interval]))

Adding the two together can provide an overall error rate for compactions.

Eviction
Copy
Copied!

The Cache service periodically evicts (or removes) data from the local hot cache to prevent disk utilization from climbing to an unhealthy level. The metrics below can be used to understand when eviction occurs and how long it took.

Files evicted
Copy
Copied!

The quarry_eviction_file_total_count metric provides the number of files evicted from any given Cache service pod.

Copy
Copied!
sum by (pod) (rate(quarry_eviction_file_total_count[$interval]))

Estimated bytes to evict
Copy
Copied!

The quarry_eviction_estimate_in_bytes metric provides the estimated size in bytes that can be evicted from any given Cache service pod.

Copy
Copied!
sum by (pod) (rate(quarry_eviction_estimate_in_bytes[$interval]))

Time taken
Copy
Copied!

The quarry_eviction_time_total_ns metric provides the amount of time taken to perform a cache eviction.

Copy
Copied!
sum by (pod) (rate(quarry_eviction_time_total_ns[$interval]))

High and low watermarks
Copy
Copied!

The quarry_eviction_low_watermark_in_bytes metric exposes the configured "low watermark" setting in bytes. This threshold is what the Cache service will use to know when to stop evicting data from the local hot cache.

Copy
Copied!
max(quarry_eviction_low_watermark_in_bytes)

The quarry_eviction_high_watermark_in_bytes metric exposes the configured "high watermark" setting in bytes. This threshold is what the Cache service will use to know when to start evicting data from the local hot cache.

Copy
Copied!
max(quarry_eviction_high_watermark_in_bytes)

Queries
Copy
Copied!

Measuring query performance is critical in understanding the health of the database. The metrics below provide insight into how many queries are being run, how long they take, and how many results they are returning.

Query count
Copy
Copied!

The quarry_search_query_count metric provides the number of queries issued to the database. This is particularly helpful in measuring total system read throughput.

Copy
Copied!
sum by (endpoint, pod) ( rate(quarry_search_query_count[$interval]) )

Document count
Copy
Copied!

The quarry_search_result_doc_count metric provides the number of documents returned from reads to the system.

Copy
Copied!
sum by (endpoint, pod) ( rate(quarry_search_result_doc_count[$interval]) )

Search duration
Copy
Copied!

The quarry_search_duration_ns_sum / quarry_search_duration_ns_count calculation provides the mean search duration taken for queries. It can be broken out per endpoint.

Copy
Copied!
sum by (pod, endpoint) (
  rate(quarry_search_duration_ns_sum[$interval]) 
  / 
  rate(quarry_search_duration_ns_count[$interval])
)

Blocking time
Copy
Copied!

The quarry_search_blocking_task_duration_ns metric provides the amount of time taken while blocking to serve a read request.

Copy
Copied!
sum by (pod, endpoint) (rate(quarry_search_blocking_task_duration_ns[$interval]))

Cache hit ratio
Copy
Copied!

The quarry_aggregation_cache_hit_total / (quarry_aggregation_cache_miss_total + quarry_aggregation_cache_hit_total) calculation provides the query cache hit ratio, when query aggregation caching is enabled.

Copy
Copied!
sum by (pod) (
    rate(quarry_aggregation_cache_hit_total[$interval])
    / 
    (
        rate(quarry_aggregation_cache_miss_total[$interval])
        + 
        rate(quarry_aggregation_cache_hit_total[$interval])
    )
)

Rehydration
Copy
Copied!

Rehydration is the process of automatically reading data from object storage when queried. The metrics below provide insight into when rehydration occurs, and how long it takes.

Average rehydration time
Copy
Copied!

The quarry_ensure_roots_duration_ns_sum / quarry_ensure_roots_duration_ns_count calculation provides the mean time taken to rehydrate roots from object storage.

Copy
Copied!
sum by (pod) (
  rate(quarry_ensure_roots_duration_ns_sum[$interval])
  /
  rate(quarry_ensure_roots_duration_ns_count[$interval])
)

Number of roots rehydrated
Copy
Copied!

The quarry_ensure_roots_download_count metric provides the number of roots rehydrated from object storage.

Copy
Copied!
sum by (pod) (rate(quarry_ensure_roots_download_count[$interval]))