/plushcap/analysis/datadog/how-to-monitor-istiod

How to monitor istiod

What's this blog post about?

In this tutorial, we will show you how to monitor the newer, monolithic versions of Istio using Prometheus and Grafana. We will also demonstrate how to use Datadog for comprehensive visibility into your Istio cluster. First, let's introduce some key concepts related to monitoring Istio: - Metrics are numerical values that represent a specific aspect of the system being monitored. In the context of Istio, metrics can include request rates, error counts, and latency distributions for service-to-service traffic within your mesh. - Logs provide detailed information about individual events or transactions within your system. For example, you might use logs to investigate a specific spike in request latencies or an unexpected increase in the number of 500 response codes from one of your services. - Traces are sequences of related requests that pass through multiple services and components within your mesh. By collecting traces from Envoy proxies, you can visualize the architecture of your service mesh and understand how traffic flows between different parts of your system. Istio generates telemetry data within istiod and your mesh of Envoy proxies via Prometheus, and you can access this data by enabling several popular monitoring tools that Istio includes as a pre-configured bundle. To get greater detail for ad hoc investigations, you can access istiod and Envoy logs via kubectl, and troubleshoot your Istio configuration with istioctl analyze. Installing Istio add-ons Each Envoy proxy and istiod container in your Istio cluster will run a Prometheus instance that emits the metrics we introduced earlier. You can access istiod metrics at <ISTIOD_CONTAINER_IP>:15014/metrics, and Envoy metrics at <ENVOY_CONTAINER_IP>:15000/stats. You can quickly set up monitoring for your cluster by enabling Istio's out-of-the-box add-ons. Istio's Prometheus add-on uses Kubernetes’ built-in service discovery to fetch the DNS addresses of istiod pods and Envoy proxy containers. You can then open Istio's Grafana dashboards (which we introduced in Part 2) to visualize metrics for istiod and your service mesh. And if you enable Zipkin or Kiali, you can visualize traces collected from Envoy. This helps you understand your mesh's architecture and visualize the performance of service-to-service traffic. Beginning with version 1.4.0, Istio has deprecated Helm as an installation method, and you can install Istio's monitoring add-ons by using the istioctl CLI. To install the add-ons, run the following command: istioctl install --set <KEY1>=<VALUE1> --set <KEY2>=<VALUE2> The second column in the table below shows the value of the --set flags you should add to enable specific add-ons. Once you’ve enabled an add-on, you can open it by running the command in the third column. |Add-on | How to enable | How to view | |Prometheus | --set values.prometheus.enabled=true | istioctl dashboard prometheus | |Grafana | --set values.grafana.enabled=true | istioctl dashboard grafana | |Kiali | --set values.tracing.enabled=true --set values.kiali.enabled=true | istioctl dashboard kiali | |Zipkin | --set values.tracing.enabled=true --set values.tracing.provider=zipkin | istioctl dashboard zipkin | Istio and Envoy logging Both istiod and Envoy log error messages and debugging information that you can use to get more context for troubleshooting (we addressed this in more detail in Part 2). istiod publishes logs to stdout and stderr by default. You can access istiod logs with the kubectl logs command, using the -l app=istiod option to collect logs from all istiod instances. The -f flag prints new logs to stdout as they arrive: kubectl logs -f -l app=istiod -n istio-system Envoy access logs are disabled by default. You can run the following command to configure Envoy to print its access logs to stdout. istioctl install --set meshConfig.accessLogFile="/dev/stdout" To print logs for your Envoy proxies, run the following command: kubectl logs -f <SERVICE_POD_NAME> -c istio-proxy If you want to change the format of your Envoy logs and the type of information they include, you can use the --set flag in istioctl install to configure two options. First, you can set global.proxy.accessLogEncoding to JSON (the default is TEXT) to enable structured logging in this format. Second, the accessLogFormat option lets you customize the fields that Envoy prints within its access logs, as we discussed in more detail in Part 2. istioctl analyze If your Istio metrics are showing unexpected traffic patterns, anomalously low sidecar injections, or other issues, you may have misconfigured your Istio deployment. You can use the istioctl analyze command to see if this is the case. To check for configuration issues in all Kubernetes namespaces, run the following command: istioctl analyze --all-namespaces The output will be similar to the following: Warn [IST0102] (Namespace app) The namespace is not enabled for Istio injection. Run 'kubectl label namespace app istio-injection=enabled' to enable it, or 'kubectl label namespace app istio-injection=disabled' to explicitly mark it as not needing injection Info [IST0120] (Policy grafana-ports-mtls-disabled.istio-system) The Policy resource is deprecated and will be removed in a future Istio release. Migrate to the PeerAuthentication resource. Error: Analyzers found issues when analyzing all namespaces. See https://istio.io/docs/reference/config/analysis for more information about causes and resolutions. In this case, the first warning explains why mesh metrics are missing in the app namespace: because we have not yet enabled sidecar injection. After enabling automatic sidecar injection for the app namespace, we can watch our sidecar injection metrics to ensure our configuration is working. Monitoring istiod with Datadog Datadog gives you comprehensive visibility into the health and performance of your mesh by enabling you to visualize and alert on all the data that Istio generates within a single platform. This makes it easy to navigate between metrics, traces, and logs, and to set intelligent alerts. In this section, we’ll show you how to monitor istiod with Datadog. Datadog monitors your Istio deployment through a collection of Datadog Agents, which are designed to maximize visibility with minimal overhead. One Agent runs on each node in your cluster, and gathers metrics, traces, and logs from local Envoy and istiod containers. The Cluster Agent passes cluster-level metadata from the Kubernetes API server to the node-based Agents along with any configurations needed to collect data from Istio. This enables node-based Agents to get comprehensive visibility into your Istio cluster, and to enrich metrics with cluster-level tags. Set up Datadog’s Istio integration We recommend that you install the Datadog Operator within your Istio cluster using the Helm package manager. The Operator will track the states of Datadog resources, compare them to user configurations, and apply changes accordingly. In this section, we will show you how to: - Deploy the Datadog Operator - Annotate your Istio services so the Datadog Agents can discover them Deploy the Datadog Operator Before you deploy the Datadog Operator, you should create a Kubernetes namespace for all of your Datadog-related resources. This way, you can manage them separately from your Istio components and mesh services, and exempt your Datadog resources from Istio's automatic sidecar injection. Run the following command to create a namespace for your Datadog resources: kubectl apply -f - <<EOF { "apiVersion": "v1", "kind": "Namespace", "metadata": { "name": "datadog", "labels": { "name": "datadog" } } } EOF Next, create a manifest called dd_agent.yaml that the Datadog Operator will use to install the Datadog Agents. Note that you’ll only need to provide the keys below the spec section of a typical Kubernetes manifest (the Operator installation process will fill in the rest): dd_agent.yaml credentials: apiKey: "<DATADOG_API_KEY>" # Fill this in appKey: "<DATADOG_APP_KEY>" # Fill this in # Node Agent configuration agent: image: name: "datadog/agent:latest" config: # The Node Agent will tolerate all taints, meaning it can be deployed to # any node in your cluster. # https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ tolerations: - operator: Exists # Cluster Agent configuration clusterAgent: image: name: "datadog/cluster-agent:latest" config: metricsProviderEnabled: true clusterChecksEnabled: true # We recommend two replicas for high availability replicas: 2 You’ll need to include your Datadog API key within the manifest. You won't have to provide an application key to collect data from Istio, but one is required if you want the Datadog Operator to send data to Datadog for troubleshooting Datadog Agent deployments. You can find your API and application keys within Datadog. Once you’ve created your manifest, use the following Helm (version 3.0.0 and above) and kubectl commands to install and configure the Datadog Operator: helm repo add datadog https://helm.datadoghq.com helm install -n datadog my-datadog-operator datadog/datadog-operator kubectl -n datadog apply -f "dd_agent.yaml" # Configure the Operator The helm install command uses the -n datadog flag to deploy the Datadog Operator and the resources it manages into the datadog namespace we created earlier. After you’ve installed the Datadog Operator, it will deploy the node-based Agents and Cluster Agent automatically. $ kubectl get pods -n datadog NAME READY STATUS RESTARTS AGE datadog-datadog-operator-5656964cc6-76mdt 1/1 Running 0 1m datadog-operator-agent-hp8xs 1/1 Running 0 1m datadog-operator-agent-ns6ws 1/1 Running 0 1m datadog-operator-agent-rqhmk 1/1 Running 0 1m datadog-operator-agent-wkq64 1/1 Running 0 1m datadog-operator-cluster-agent-68f8cf5f9b-7qkgr 1/1 Running 0 1m datadog-operator-cluster-agent-68f8cf5f9b-v8cp6 1/1 Running 0 1m Automatically track Envoy proxies within your cluster The node-based Datadog Agents are pre-configured to track Envoy containers running on their local hosts. This means that Datadog will track mesh metrics from your services as soon as you have deployed the node-based Agents. Configure the Datadog Agents to track your Istio deployment Since Kubernetes can schedule istiod and service pods on any host in your cluster, the Datadog Agent needs to track the containers running the relevant metrics endpoints—no matter which pods run them. We’ll show you how to configure Datadog to use endpoints checks, which collect metrics from the pods backing the istiod Kubernetes service. With endpoints checks enabled, the Cluster Agent ensures that each node-based Agent is querying the istiod pods on its local host. The Cluster Agent then populates an Istio Autodiscovery template for each node-based Agent. The Datadog Cluster Agent determines which Kubernetes services to query by extracting configurations from service annotations. It then fills in the configurations with up-to-date data from the pods backing these services. You can configure the Cluster Agent to look for the istiod service by running the following script (note that the service you’ll patch is called istio-pilot in version 1.5.x and istiod in version 1.6.x.): kubectl -n istio-system patch service <istio-pilot|istiod> --patch "$(cat<<EOF metadata: annotations: ad.datadoghq.com/endpoints.check_names: '["istio"]' ad.datadoghq.com/endpoints.init_configs: '[{}]' ad.datadoghq.com/endpoints.instances: | [ { "istiod_endpoint": "http://%%host%%:15014/metrics", "send_histograms_buckets": true } ] EOF )" The ad.datadoghq.com/endpoints.instances annotation includes the configuration for the Istio check. Once the istiod service is annotated, the Cluster Agent will dynamically fill in the %%host%% variable with the IP address of each istiod pod, then send the resulting configuration to the Agent on the node where that pod is running. After applying this change, you’ll see istiod metrics appear within Datadog. Get the visibility you need into istiod's performance Datadog helps you monitor metrics for istiod, Envoy, and 650+ integrations, giving you insights into every component of your Istio cluster. For example, you can create a dashboard that visualizes request rates from your mesh alongside Kubernetes resource utilization (below), as well as metrics from common Kubernetes infrastructure components like Harbor and CoreDNS. Istio's most recent benchmark shows that Envoy proxies consume 0.6 vCPU for every 1,000 requests per second—to ensure that all services remain available, you should keep an eye on per-node resource utilization as you deploy new services or scale existing ones. You can also use Datadog to get a clearer picture into whether to scale your istiod deployment. By tracking istiod's work and resource utilization metrics, you can understand when istiod pods are under heavy load and scale them as needed. The dashboard below shows key throughput metrics for each of istiod’s core functions—handling sidecar injection requests, pushing xDS requests, creating certificates, and validating configurations—along with high-level resource utilization metrics (CPU and memory utilization) for the istiod pods in your cluster. Cut through complexity with automated alerts You can reduce the time it takes to identify issues within a complex Istio deployment by letting Datadog notify you if something is amiss. Datadog enables you to create automated alerts based on your istiod and mesh metrics, plus APM data, logs, and other information that Datadog collects from your cluster. In this section, we’ll show you two kinds of alerts: metric monitors and forecast monitors. Metric monitors can notify your team if a particular metric's value crosses a threshold for a specific period of time. This is particularly useful when monitoring automated sidecar injections, which are critical to Istio's functionality. You’ll want to know as soon as possible if your cluster has added new pods without injecting sidecars so you can check for misconfigurations or errors within the injection webhook. You can do this by setting a metric monitor on the difference between sidecar_injection_success_total and sidecar_injection_requests_total, which tells you how many sidecar requests were skipped or resulted in an error). If you get alerted that this value is unusually low, you can immediately investigate.

Company
Datadog

Date published
Sept. 23, 2021

Author(s)
Paul Gottschling

Word count
4948

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.