Monitoring your EKS cluster with Datadog

Company

Datadog

Date Published

April 4, 2019

Author

Maxim Brown

Word count

4967

Language

English

Hacker News points

None

URL

www.datadoghq.com/blog/eks-monitoring-datadog

Summary

In the first part of this series, we covered how to use Amazon EKS (Elastic Kubernetes Service) to deploy and manage containerized applications on AWS. We also discussed some of the key metrics you should monitor in order to maintain optimal performance for your EKS cluster and its underlying infrastructure. In this second part, we will go over how to use Datadog to monitor Amazon EKS clusters. Datadog is a monitoring service that provides deep insights into applications running on Kubernetes, Docker, AWS services, and other technologies. It includes features such as real-time metrics visualization, powerful alerting options, log management, distributed tracing for microservices architectures, and more. Datadog supports automatic integration with EKS clusters through its Kubernetes integration. This means that you can use Datadog to monitor your cluster’s health and performance without needing to install any additional software on your nodes or pods. However, if you want to collect custom metrics from your applications running on the cluster, or logs, traces, process data, etc., then you will need to deploy the Datadog Agent to your nodes. In this post, we’ll cover how to use Datadog to monitor an EKS cluster by: - Deploying the Datadog Agent to your nodes using a Kubernetes DaemonSet - Enabling Datadog’s AWS integrations for CloudWatch metrics and events - Collecting, visualizing, and alerting on key EKS metrics with Datadog - Using tags to filter and sort your resources by metadata from various sources - Automatically discovering services running in your containers using Autodiscovery - Enabling log collection from your nodes and pods - Instrumenting your applications to send custom metrics to Datadog - Setting up alerts based on metric thresholds, anomaly detection, or machine learning predictions By the end of this post, you should have a good understanding of how to use Datadog to monitor Amazon EKS clusters. Let’s get started! ## Part 2: Monitoring Amazon EKS with Datadog In this section, we will go over how to deploy and configure the necessary components in order to start monitoring your EKS cluster with Datadog. We assume that you have already set up an EKS cluster as described in Part 1 of this series. ### Installing the Datadog Agent on your nodes using a Kubernetes DaemonSet Datadog provides a Helm chart for deploying its Agent to your Kubernetes nodes. Helm is a package manager for Kubernetes that makes it easy to install, upgrade, and manage applications running on your cluster. If you haven’t already done so, follow the instructions in the official Helm documentation to set up Helm on your local machine or CI/CD pipeline. Once you have installed Helm, add the Datadog repository by running: ```bash $ helm repo add datadog https://helm.datadoghq.com ``` Next, update your list of available charts from all repositories with: ```bash $ helm repo update ``` Now you can install the Datadog Agent by running: ```bash $ helm install datadog/datadog-agent --set apiKey=<YOUR_DATADOG_API_KEY> ``` Replace `<YOUR_DATADOG_API_KEY>` with the API key for your Datadog account. You can find this in the Account Settings section of the Datadog web interface. This command will create a Kubernetes DaemonSet named `datadog-agent`, which will automatically deploy one instance of the Datadog Agent to each node in your cluster. The DaemonSet manifest includes environment variables that tell the Agent how to communicate with your Datadog account, as well as volume mounts and hostPort configurations for collecting metrics from Docker containers running on your nodes. You can view the status of the deployed DaemonSet by running: ```bash $ kubectl get daemonsets -n datadog-agent ``` This should show one instance of the `datadog-agent` DaemonSet running on each node in your cluster. If you don’t see any instances, make sure that Helm has successfully installed the chart by checking its status with: ```bash $ helm list -n datadog-agent ``` If there are errors or warnings associated with the installation, consult the official Datadog Agent documentation for troubleshooting tips. ### Enabling Datadog’s AWS integrations for CloudWatch metrics and events Datadog can automatically collect CloudWatch metrics and events from your AWS account in order to provide deep visibility into the performance and health of various infrastructure components in your EKS cluster, such as EC2 instances, ELB load balancers, EBS volumes, etc. To enable this integration, you will need to configure an IAM role with read-only access to CloudWatch metrics and events. First, create a new IAM policy that includes the necessary permissions for querying CloudWatch data: ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:GetMetricStatistics", "cloudwatch:ListMetrics" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "events:DescribeRule", "events:ListRules" ], "Resource": "*" } ] } ``` Next, create a new IAM role and attach this policy to it. Make sure that the trust relationship for this role includes `sts:AssumeRoleWithWebIdentity`, which allows AWS STS (Security Token Service) to assume this role on behalf of your Datadog account. Finally, update the DaemonSet manifest for the Datadog Agent by adding an additional environment variable named `DD_CLOUDWATCH_REGION` with the value set to the region where your EKS cluster is running (e.g., `us-west-2`). This tells the Agent which AWS CloudWatch API endpoint to use when querying for metrics and events. After making these changes, redeploy the DaemonSet by running: ```bash $ helm upgrade datadog-agent datadog/datadog-agent --set apiKey=<YOUR_DATADOG_API_KEY>,DD_CLOUDWATCH_REGION=<YOUR_AWS_REGION> ``` Replace `<YOUR_AWS_REGION>` with the region where your EKS cluster is running. ### Collecting, visualizing, and alerting on key EKS metrics with Datadog With the Datadog Agent deployed to your nodes and configured to collect CloudWatch metrics and events from your AWS account, you should start seeing data flowing into Datadog within a few minutes. You can view this data in real-time by navigating to various sections of the Datadog web interface, such as: - The Metrics Explorer for visualizing time series charts based on metric queries - The Live Containers dashboard for monitoring the status and performance of your containers across all nodes in your cluster - The Host Map view for seeing a high-level overview of your node infrastructure - The Log Explorer for searching, filtering, and analyzing logs from your applications running on the cluster - The APM Trace view for diving deep into individual distributed request traces across multiple services and components Datadog includes out-of-the-box dashboards and widgets that are specifically designed to help you monitor key performance indicators (KPIs) related to Amazon EKS clusters. These include metrics such as: - Kubernetes Cluster Overview dashboard, which provides a high-level summary of the overall health and performance of your cluster, including information about resource usage, network traffic, storage utilization, etc. - Kubernetes Node Overview dashboard, which shows detailed metrics for each individual node in your cluster, such as CPU usage, memory usage, disk I/O operations, network throughput, etc. - Kubernetes Pod Overview dashboard, which provides similar information but at the pod level instead of the node level. You can customize these built-in dashboards or create your own by using Datadog’s powerful query language (DLQ) to construct complex metric queries and visualizations. You can also use tags to filter and sort your resources based on metadata from various sources, such as Kubernetes pod labels, Docker container image names, AWS EC2 instance types, etc. In addition to visualizing metrics in real-time, Datadog also provides a number of powerful alerting options that allow you to detect potential issues before they cause serious problems for your infrastructure and its users. These include threshold alerts based on specific metric values or rate changes, anomaly detection using machine learning algorithms, forecasting predictions based on historical data trends, etc. Datadog integrates with popular notification services like PagerDuty, Slack, Microsoft Teams, etc., making it easy to configure alert notifications that will automatically reach the right teams when something goes wrong. You can read more about how to use Datadog’s alerts in our documentation. In the next section, we will go over some additional features and capabilities of Datadog that you may find useful for monitoring Amazon EKS clusters. ### Additional features and capabilities of Datadog for monitoring Amazon EKS clusters Datadog includes a number of other features and integrations that can help you gain deeper insights into the performance and health of your EKS cluster and its underlying infrastructure components. Some examples include: - Automatic service discovery using Kubernetes annotations, which allows Datadog to automatically identify what’s running on your containers and start collecting metrics from those services without needing any manual configuration. - Log management with automatic log ingestion, parsing, and enrichment for various technologies you may be running on your EKS cluster, such as Kubernetes, Docker, AWS CloudWatch logs, etc. - Distributed tracing for microservices architectures using the Datadog Agent’s built-in support for open source tracing libraries like Jaeger, Zipkin, OpenTracing, etc. - Process monitoring with real-time visibility into individual processes running on your nodes and containers, including information about CPU usage, memory usage, network connections, file descriptors, etc. - Custom metric reporting from your applications running on the cluster using the DogStatsD protocol or one of Datadog’s supported client libraries for various programming languages like Python, Java, Go, Ruby, etc. By leveraging these additional features and capabilities of Datadog, you can gain even more visibility into the behavior and performance of your EKS cluster and its underlying infrastructure components. This will enable you to make better-informed decisions about how to optimize and scale your applications running on the cluster in order to meet changing business requirements or user demands. In conclusion, Amazon EKS provides a powerful platform for deploying and managing containerized applications on AWS. However, monitoring the performance and health of these applications and their underlying infrastructure components can be challenging due to the dynamic nature of containerized environments. Datadog is an ideal solution for monitoring Amazon EKS clusters because it supports automatic integration with both Kubernetes and AWS CloudWatch metrics and events, making it easy to collect, visualize, and alert on key performance indicators (KPIs) related to your cluster infrastructure. Additionally, Datadog includes a wide range of other features and integrations that can help you gain deeper insights into the behavior and performance of your applications running on the cluster. If you don’t yet have a Datadog account, you can sign up for a 14-day free trial and start monitoring your EKS clusters today.