New eGuide takes a closer look at Prometheus, ELK and Jaeger:
Open source tooling has its benefits. No licensing costs and you’re free to download, get started and modify over time. If something goes wrong when running the software, you can go dive into the code and debug it. Plus, if you need a specific feature, no one is stopping you from developing it and working to get it contributed upstream so the entire community benefits. However, the zero cost to get started doesn’t last for the long-term. You need to maintain the tools and invest human resource capital which then takes away from other important initiatives. As you scale and grow, complexity rises and operations can get gnarly pretty quickly. Docs and ongoing support when you need it the most can be limiting and over time you end up managing your primary service plus a whole set of tools (and systems to run them) so you can manage that primary service. It can feel like a double whammy. Additionally, not everyone on the team ends up using the same tools so lack of shared visibility causes confusion and frustration.
Popular open source tools fall into one of three buckets – logs, metrics, and tracing. Many vendors refer to these as ‘the three pillars of observability’ and we feel that is a misleading characterization. Observability is much more than just having access to important system data so that you understand what is happening at any point in time. It’s about having the ability to deeply introspect and ask new questions when there’s an issue or incident that needs attention. It’s also about the ongoing learning and institutional memory you build up over time by observing production across the entire software delivery lifecycle. For today, this e-guide focuses on 3 specific OS tools that serve a distinct purpose but fall-short and fail to address important challenges for most engineering teams. Inside the guide, you will learn Honeycomb’s “hot take” that specifically addresses what these 3 tools are generally used for and how they fit with the observability landscape.
- Prometheus – A time series database for metrics
- Elasticsearch/Logstash/Kibana – called “ELK” for short – A log storage, processing, and querying stack
- Jaeger – A system for distributed tracing
Metrics are essential to inform you about the general health of the system, and serve as indicators about how well it’s behaving in production. CPU, memory and disk throughput tell you about stability and service reliability. However, when it comes to solving problems or seeing how new code is behaving in production, you must be able to query high resolution data that tells you how users are actually experiencing your service. In fact Prometheus discourages from using data that has too many dimensions – aka high cardinality – which you will find is the best type of data for solving problems quickly.
Most services generate a high volume of logs over time and many teams rely on them as a way to search through to understand what happened at any point in time when faced with an issue or customer inquiry. It’s often treated like an audit-able archive and more useful when you want to know what happened on a single machine. On a distributed cluster, things get more complicated, and tools such as ELK are required. While ELK helps to store, parse and query system logs, you must know what you’re searching for and so answering application level unknown unknowns is near impossible
When teams undergo pressure to scale and maintain a reliable service, the desire to adopt distributed tracing grows. Jaeger, which came out of Uber, gives teams the ability to see end-to-end how service requests perform. With trace waterfall views, the spans with associated metadata tell you where potential latencies and errors occur, assuming you have instrumented and built in telemetry about your service. Jaeger is great to get you started on tracing but if you use it in production you must choose between Apache Cassandra or Elastic-search for persisting traces. This in turn requires another data store, more resources and more opportunities for failure.
If you are using any one of these OS tools today or perhaps you’re considering taking this approach, be informed of ideal use-cases and existing capabilities plus required team effort to get up and running, now and over time. As Observability increases in popularity, consider using a tool that gives you high resolution and visibility into your service when you need to quickly debug an incident and proactively optimize the system for the best user experience. Oh and do let us know what you think about the guide.