How to monitor Elasticsearch performance

Company

Datadog

Date Published

Sept. 26, 2016

Author

Emily Chang

Word count

6039

Language

English

Hacker News points

URL

www.datadoghq.com/blog/monitor-elasticsearch-performance-metrics

Summary

In this article, we will explore the various metrics that can be monitored in Elasticsearch to ensure optimal performance and availability of your cluster. We will cover the following topics: 1. Search and indexing performance 2. Memory and garbage collection 3. Host-level system and network metrics 4. Cluster health and node availability 5. Resource saturation and errors By monitoring these key areas, you can gain valuable insights into how your Elasticsearch cluster is performing and identify potential issues before they become critical problems. 1. Search and Indexing Performance Search performance in Elasticsearch is determined by the time it takes to execute a search request on a given index or set of indices. The latency of search requests depends on several factors, including: - Query complexity: More complex queries with many filters and aggregations will take longer to process than simpler ones. - Number of shards: Each shard is searched in parallel, so having more shards can improve search performance. However, if there are too many shards, the overhead of coordinating between them may actually slow down search times. - Hardware resources: The amount of CPU and memory available on each node will also affect search performance. To monitor search performance, you should keep an eye on the following metrics: - Search request latency: This is the time it takes for a search request to be processed by Elasticsearch, from when it is received until the response is returned. You can use this metric to identify any sudden spikes in search times that may indicate a problem with your cluster. - Number of search requests per second: This metric provides an overall measure of how much search traffic your cluster is handling at any given time. If you see a steady increase in search request rates over time, it may be a sign that your cluster is becoming more popular and needs to be scaled up accordingly. Indexing performance in Elasticsearch is determined by the rate at which new documents can be added or updated within an index. The latency of indexing requests depends on several factors, including: - Bulk size: When adding or updating multiple documents at once, it is more efficient to use a bulk request rather than sending individual update requests for each document. However, if the bulk size is too large, it may cause performance issues due to increased memory usage and slower garbage collection times. - Number of shards: Each shard has its own independent indexing process, so having more shards can improve indexing performance by allowing multiple documents to be added or updated in parallel. However, if there are too many shards, the overhead of coordinating between them may actually slow down indexing times. - Hardware resources: The amount of CPU and memory available on each node will also affect indexing performance. To monitor indexing performance, you should keep an eye on the following metrics: - Index request latency: This is the time it takes for an index request to be processed by Elasticsearch, from when it is received until the response is returned. You can use this metric to identify any sudden spikes in indexing times that may indicate a problem with your cluster. - Number of index requests per second: This metric provides an overall measure of how much indexing traffic your cluster is handling at any given time. If you see a steady increase in index request rates over time, it may be a sign that your cluster is becoming more popular and needs to be scaled up accordingly. 2. Memory and Garbage Collection Elasticsearch uses the Java Virtual Machine (JVM) for its runtime environment, which means that it relies heavily on memory management techniques such as garbage collection. In order to ensure optimal performance of your Elasticsearch cluster, you should monitor both the amount of available heap memory and the frequency and duration of garbage collection events. Heap Memory Usage: By default, Elasticsearch allocates 1GB of JVM heap memory per data node. However, depending on the size and complexity of your dataset, you may need to adjust this setting up or down based on your specific needs. To monitor heap memory usage in Elasticsearch, you should keep an eye on the following metrics: - Heap used percentage: This metric shows what percentage of total allocated heap memory is currently being used by Elasticsearch processes. If this value consistently stays above 85%, it may be a sign that your cluster is running out of available heap space and needs to be scaled up accordingly. - Heap committed percentage: In addition to tracking how much heap memory is currently in use, you should also monitor the amount of heap memory that has been reserved for future use by Elasticsearch processes. If this value consistently stays above 95%, it may be a sign that your cluster is nearing its maximum capacity and needs to be scaled up accordingly. Garbage Collection: The JVM performs garbage collection at regular intervals in order to free up unused heap memory. However, if there are too many objects being allocated or deallocated within the same time period, it can lead to increased CPU usage and longer response times for user queries. To monitor garbage collection performance in Elasticsearch, you should keep an eye on the following metrics: - Total time spent on garbage collections: This metric shows how much total CPU time has been consumed by all garbage collection events across all nodes within your cluster over a given period of time (e.g., past hour). If this value consistently stays above 10%, it may be a sign that your cluster is experiencing performance issues due to excessive amounts of garbage collection activity. - Number of minor and major garbage collections: Minor garbage collections occur more frequently than major ones, but they typically only free up small amounts of heap memory at a time. Major garbage collections, on the other hand, are much less frequent but can potentially free up large amounts of heap memory all at once. By tracking both types of garbage collection events separately, you can gain valuable insights into how efficiently your Elasticsearch cluster is utilizing its available resources. 3. Host-Level System and Network Metrics In addition to monitoring application-specific metrics within Elasticsearch itself, it is also important to collect and analyze various host-level system and network metrics from each of your nodes. This will help you gain a more comprehensive understanding of the overall health and performance characteristics of your entire infrastructure stack. Some key areas to focus on when monitoring host-level system and network metrics include: - Disk space usage: Running out of available disk space can lead to severe performance degradation or even complete data loss in extreme cases. To prevent this from happening, you should monitor both the total amount of free space remaining on each node as well as the rate at which it is being consumed over time. - CPU utilization: High levels of CPU usage can indicate that one or more processes within your Elasticsearch cluster are consuming an excessive amount of processing power and may need to be optimized further in order to reduce their overall resource consumption footprint. - Network throughput: Monitoring network traffic patterns across all nodes within your cluster can help you identify any potential bottlenecks or capacity constraints that might be limiting the scalability and performance capabilities of your Elasticsearch deployment. 4. Cluster Health and Node Availability Ensuring high levels of availability and resiliency for your Elasticsearch cluster is critical in order to minimize downtime and maximize user satisfaction with your search application. To achieve this goal, you should monitor both the overall health status of your entire cluster as well as the individual availability states of each individual node within it. Some key areas to focus on when monitoring cluster health and node availability include: - Cluster status color (green, yellow, red): Elasticsearch uses a simple traffic light system to indicate whether or not all primary shards and replica shards are currently assigned to nodes within your cluster. A green status indicates that all shards are fully available and functioning properly, while a yellow status means that some replica shards may be missing or unassigned but searches will still return complete results. Finally, a red status signifies that at least one primary shard is missing or unavailable, which means that searches will only return partial results and new indexing operations will be blocked until the issue has been resolved. - Number of nodes: In order to ensure optimal performance and availability for your Elasticsearch cluster, you should always strive to maintain an even distribution of primary shards and replica shards across all available nodes within it. By keeping track of how many nodes are currently online at any given time, you can quickly identify if there has been a sudden decrease in the total number of active nodes that could potentially indicate a problem with one or more individual nodes. - Number of initializing/unassigned shards: When new indices are created or existing ones are updated, their corresponding primary and replica shards will initially be placed into an "initializing" state while Elasticsearch attempts to assign them to suitable target nodes within your cluster. Once this process has been completed successfully, the shards will then transition into either a "started" or "unassigned" state depending on whether they have been assigned to primary-eligible nodes or not. By keeping track of how many shards are currently in each of these different states, you can gain valuable insights into how efficiently your Elasticsearch cluster is able to handle ongoing changes and updates to its underlying dataset. 5. Resource Saturation and Errors In addition to monitoring various performance-related metrics within your Elasticsearch cluster, it is also important to keep an eye on any potential resource saturation or error conditions that might arise over time. This will help you quickly identify and address any emerging issues before they have a chance to escalate into more serious problems down the line. Some key areas to focus on when monitoring for signs of resource saturation or errors within your Elasticsearch cluster include: - Thread pool queues and rejections: Each node in an Elasticsearch cluster maintains several different types of thread pools, which are used to manage how threads consume memory and CPU resources. By keeping track of both the current number of queued threads as well as any recent instances where new threads were unable to be allocated due to queue overflows or other capacity constraints, you can gain valuable insights into whether or not your nodes are currently able to keep up with their ongoing workloads without running out of available resources. - Cache usage metrics: Elasticsearch uses two main types of caches (fielddata and filter) in order to speed up response times for certain types of search queries by reducing the amount of data that needs to be read from disk during each query execution cycle. However, if these caches become too large or consume an excessive amount of heap memory, they may actually end up slowing down overall system performance instead of improving it. To prevent this from happening, you should monitor both the current size as well as any recent instances where evictions were performed in order to free up space within each cache instance. - Pending tasks: Primary nodes within an Elasticsearch cluster are responsible for handling various types of administrative tasks such as creating new indices or assigning shards to target nodes. If these primary nodes become overloaded with too many pending tasks, it can lead to significant delays in processing time and may even result in some critical operations being dropped altogether due to resource contention issues. To prevent this from happening, you should monitor both the current number of pending tasks as well as any recent instances where new tasks were unable to be assigned due to queue overflows or other capacity constraints. By monitoring these key areas within your Elasticsearch cluster on a regular basis, you can gain valuable insights into how efficiently it is currently able to process incoming search requests and handle ongoing changes and updates to its underlying dataset. This will enable you to quickly identify any potential performance bottlenecks or resource saturation issues that might be limiting the scalability and overall effectiveness of your search application, allowing you to take proactive measures in order to address these problems before they have a chance to escalate into more serious issues down the line.