Key metrics for monitoring VMware vSphere

Company

Datadog

Date Published

Nov. 19, 2020

Author

Jordan Obey

Word count

6113

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/vsphere-metrics

Summary

To ensure optimal performance in a vSphere environment, it is crucial to monitor key metrics that provide insight into resource usage and overall health. These metrics include CPU utilization, memory usage, disk I/O, network throughput, and tasks and events. By setting up alerts for these metrics, you can proactively identify potential issues before they impact your virtual machines (VMs) or applications running on them. In this guide, we will discuss the importance of monitoring key vSphere metrics and provide examples of how to interpret their values. We will also cover some common causes of performance degradation in a vSphere environment and suggest ways to address them. CPU Metrics: 1. CPU usage (%): This metric measures the percentage of CPU resources that are currently being utilized by all VMs running on an ESXi host. If this value consistently exceeds 80%, it may indicate that your VM configuration is not optimized, or that there is a resource contention issue between multiple VMs. 2. CPU ready (%): This metric represents the percentage of time that a VM has been waiting for access to the underlying physical CPU resources on an ESXi host. High values (e.g., > 5%) can indicate that your VM configuration is not optimized, or that there is a resource contention issue between multiple VMs. 3. CPU core usage (%): This metric measures the percentage of CPU cores that are currently being utilized by all VMs running on an ESXi host. If this value consistently exceeds 80%, it may indicate that your VM configuration is not optimized, or that there is a resource contention issue between multiple VMs. Memory Metrics: 1. Memory usage (%): This metric measures the percentage of physical memory resources that are currently being utilized by all VMs running on an ESXi host. If this value consistently exceeds 80%, it may indicate that your VM configuration is not optimized, or that there is a resource contention issue between multiple VMs. 2. Balloon driver (vmmemctl) capacity: Each VM in vSphere can have a balloon driver (named vmmemctl) installed on it. If an ESXi host runs low on physical memory that it needs to allocate, it can reclaim memory from the guest physical memory of virtual machines by sending requests to the balloon drivers to "inflate" by gathering unused memory from the VM. The ESXi host can take that memory from the "inflated" balloon driver and deallocate the appropriate mapped host physical memory, which it can then allocate to other VMs. This technique is known as memory ballooning. 3. Memory swapped in/out: When an ESXi host provisions a virtual machine, it allocates physical disk storage files known as swap files. Swap file size is determined by the VM's configured size, less any reserved memory. For instance, if a VM is configured with 3 GB of memory and has a 1 GB reservation, it will have a 2 GB swap file. By default, a VM's swap files are collocated with its virtual disk, on shared storage. 4. Active memory versus consumed memory: In order for a VMKernel to accurately discern how much memory is actively in use by VMs, it would need to monitor every memory page that has been read from or written to. This process, however, would require too much overhead. Instead, the VMKernel uses algorithmic learning to generate an estimate of each VM's active memory usage. 5. Memory usage: At the VM level, the mem.usage metric measures what percentage of its configured memory a VM is actively using. Ideally, a VM should not always be using all of its configured memory. If it is consistently using a large portion of its configured memory, the VM will be less resilient to any spikes in memory usage if its ESXi host cannot allocate additional memory. Disk Metrics: 1. Disk commands aborted: In vSphere, a single storage device cluster may hold datastores that serve many virtual machines. If there is a surge of commands from virtual machines to the storage hardware where datastores are located, that storage may become overloaded and unresponsive. 2. Disk bus resets: If a storage device is overwhelmed with too many read and write commands from an ESXi host, or if it encounters a hardware issue and fails to abort commands, it will clear out all commands waiting in its queue. This is called a disk bus reset. 3. Datastore provisioned capacity and actual VM usage: Storage is a finite resource. The diskspace.provisioned.latest metric tracks how much storage space is available on the datastores that the ESXi host communicates with, while virtualDisk.actualUsage lets you monitor how much disk space the VMs running on that host are actively using. 4. Disk latency: Monitoring latency is key to ensuring that your VMs are communicating with their virtual disks efficiently and without delay. Total disk latency measures the time it takes, in milliseconds, for an ESXi host to process a request sent from a VM to a datastore. 5. Queue latency: Depending on their configuration, storage devices like LUNs have a limited number of commands they can queue at any one time. When the volume of virtual machine commands sent from an ESXi host exceeds what a storage device can queue itself, those commands will begin to queue in the VMKernel. 6. Disk throughput: To ensure that your datastores, ESXi hosts, and VMs are processing read and write commands without interruption, monitor their I/O throughput for visibility into their activity. Network Metrics: 1. Network received and network transmitted: These metrics track the network throughput, in kilobytes per second, of the object you're observing whether it's a host or a VM. Tasks and Events: By default, vSphere records tasks and events that occur in the VMs, ESXi hosts, and the vCenter Server of your virtual environment. These can include user logins, VM power-downs, certification expirations, and host connects/disconnects. Monitoring the events in these log files can help you stay aware of overall activity within your vSphere clusters and also perform audits and investigate any issues that occur in your environment. In conclusion, monitoring key metrics in a vSphere environment is essential for maintaining optimal performance and ensuring that resources are being utilized efficiently. By setting up alerts for these metrics and regularly reviewing their values, you can proactively identify potential issues before they impact your virtual machines or applications running on them.