Deploying the OpenTelemetry Collector to AKS

Deploying the OpenTelemetry Collector to AKS

5 Min. Read

While investigating some issues users raised around the OpenTelemetry Collector running in AKS, I found a few nuances that are worth noting. In this article, I’ll go over some changes you have to implement in your values.yaml to make it work for you.

What is the OpenTelemetry Collector?

The Collector is the focal point for telemetry inside your cluster. Instead of your containerized applications sending directly to your OpenTelemetry-capable backend (the place that allows you to ask questions of your telemetry), you send that data to an internal location first, then forward the data on.

Why is the Collector useful in AKS?

When you run applications in a Kubernetes cluster, they’re more “portable.” The applications themselves don’t realize that they’re running in Kubernetes, and everything they need is injected in an agnostic way.

However, from an observability perspective, knowledge of the surrounding environment is important to get a full understanding of what’s going on. That could be the metrics associated with the pod that’s hosting the code, or it could be the information about the node that the pod is running on. This surrounding information can help answer questions like:

  • Was the CPU high on the pod that served this request?
  • Was the node that served this request under heavy network load?
  • Are the requests being evenly distributed over multiple pods?

These questions, and the answers we get, help us truly understand how our application works inside the wider context of production.


See how powerful Honeycomb and OpenTelemetry are together.


How to deploy the Collector in AKS

The OpenTelemetry Collector team provide Helm charts that make it incredibly easy to deploy the services.

The recommended approach to deploying Collectors in K8s doesn’t work in Azure, so we need to make some small adjustments to the process.

Kubeletstats and certificates

The first nuance with default Collector deployment and AKS is that the kubeletstatsreceiver won’t work and will throw an error regarding certificates. In AKS, the Kubelet API server uses self-signed certificates instead of using the kube-root CA to sign them. We can get around this by adding a property to our values.yaml to allow the receiver to skip the validation of the authority while still using the SSL endpoint and the security tokens.

Add this to your values.yaml file for the Collector that receives Kubelet stats.

config:
  receivers:
    kubeletstats:
      insecure_skip_verify: true

Kubernetes attributes processor and NAT

The second nuance is that the K8s attributes processor will fail to lookup any of the pod if you’re using kubenet networking for your AKS cluster. This is the default networking mode for AKS.

This means that your telemetry won’t get enriched with information like the deployment name, node name etc. by the Collector. This information provides the context to link our infrastructure metrics data to our application telemetry, so without it, we’re a bit blind.

The default setup for the Collector uses a DaemonSet and makes all your applications send their telemetry to the node IP using the K8s Downward API to set that as an environment variable. For example:

env:
  - name: "OTEL_COLLECTOR_NAME"
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP

However, in AKS, in Kubenet networking mode, the calls to that host/node IP are proxied through a NAT gateway. Therefore, the Collector only sees a connection from an IP like 10.244.2.1 instead of the pod IP. We can fix this using a K8s service.

This approach is preferred as it escapes the software networking load by routing locally, so you won’t have telemetry flowing between different nodes, which can incur significant costs if that traffic ends up transitioning VNETs or regions. However, K8s introduced something called internalTrafficPolicy which solves that issue by sending the traffic internal to a node where possible.

Add this to your values.yaml for the Collector that is deployed as a DaemonSet:

service:
  enabled: true

Instead using the downward API in your applications, add a reference to the namespace and the service name. The service name is generated using the name of the Helm release and the string opentelemetry-collector. For example:

Helm Release Name: otelcol
Namespace: observability
Service name: otelcol-opentelemetry-collector

Your configuration for the applications would be:

env:
  - name: "OTEL_COLLECTOR_NAME"
    value: otelcol-opentelemetry-collector.observability

By doing this, your applications will now send to the service instead of directly to the pod. However, since the default policy from the Helm chart for the Collector is internalTrafficPolicy: local, they will resolve to the pod on the same node.

Conclusion

Setting up the OpenTelemetry Collector in the Azure Kubernetes Service (AKS) is really easy, but there are significant nuances that make it feel like it can’t work out of the box. That said, they’re pretty easy to get around without you having to create your own Helm charts.

I hope this article was helpful! If you’d like to read more about OpenTelemetry, I wrote a series on OpenTelemetry best practices. Happy learning!

Don’t forget to share!

Related posts