Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
December 19, 2023
Modern systems look very different than they did years ago. For the most part, development organizations have moved away from building traditional monoliths towards developing containerized applications running across a highly distributed infrastructure. While this change has made systems inherently more resilient, the increase in overall complexity has made it more important and more challenging to effectively identify and address problems at their root cause when issues occur.
Part of the solution to this challenge lies in leveraging tools and platforms that can effectively monitor the health of services and infrastructure. To that end, this post will explain best practices for Prometheus monitoring of services and infrastructure. In addition, it will outline the reasons why Prometheus alone is not enough to monitor the complex, highly distributed system environments in use today.
Prometheus is an open-source monitoring and alerting toolkit that was first developed by SoundCloud in 2012 for cloud-native metrics monitoring.
In the world of monitoring and observability, we have three primary data types: logs, metrics, and traces. Metrics serve as the data stopwatch that helps you track service level objectives (SLOs) and service level indicators (SLIs) in a time series data.
This is great for folks who care about using many histograms or high-cardinality (metrics on tiny time increments) data. However, many customers need more out of their observability environments and these days, most folks have adopted OpenTelemetry for unifying collectors to gather data from all three data sources. If you want to learn more about the unified collection and its benefits, check out this article: Unified Kubernetes Monitoring with OpenTelemetry.
Let’s take a look at what Prometheus can monitor, its architecture and how it works in practice.
Organizations use Prometheus to collect metrics regarding service and infrastructure performance. Depending upon the use case, Prometheus metrics may include performance markers such as CPU utilization, memory usage, total requests, requests per second, request count, exception count and more. When leveraged effectively, this data can assist organizations in identifying system issues in a timely manner.
Prometheus architecture is central to the Prometheus server, which performs the actual monitoring functions. The Prometheus server is made up of three major components:
Time series database - This component is responsible for storing metrics data. This data is stored as a time series, meaning that the data is represented in the database as a series of timestamped data points belonging to the same metric and set of labeled dimensions.
Worker for data retrieval - This component does exactly what its name implies: it pulls metrics from “targets,” which can be applications, services or other system infrastructure components. It then takes these metrics and pushes them to the time series database. The data retrieval worker collects these metrics by scraping HTTP endpoints, also known as a Prometheus instance, on the targets.
By default, the endpoint is < hostaddress >/metrics. You leverage a Prometheus exporter to configure Prometheus to a target. At its core, an exporter is a service that fetches metrics from the target, formats them properly, and exposes the /metrics endpoint so that the data retrieval worker can pull the data for storage in the time series database. To push metrics from jobs that cannot be scraped, the Prometheus Pushgateway allows you to push time series from short-lived service-level batch jobs to an intermediary job that Prometheus can scrape.
HTTP server - This server accepts queries in a Prometheus query language (PromQL) to pull data from the time series database. The HTTP server can be leveraged by the Prometheus graph UI or other data visualization tools, such as Grafana, to provide developers and IT personnel with an interface for querying and visualizing these metrics in a useful, human-friendly format.
The Prometheus Alertmanager is also worth mentioning here. Rules can be set up within the Prometheus configuration to define limits that will trigger an alert when they are exceeded. When this happens, the Prometheus server pushes alerts to the Alertmanager. From there, the Alertmanager handles deduplicating, grouping and routing these alerts to the proper personnel via email or other alerting integration.
As we know, modern development architectures have a much higher level of complexity than those of more than a decade ago. Today's systems contain many servers running containerized applications and services, like a Kubernetes cluster. These services are loosely coupled, calling one another in order to provide functionality to the end user. Architecturally, these services might also be de-coupled and run on multiple cloud environments as well. The complex nature of these systems can have the effect of obscuring the causes of failures.
Organizations need granular insight into system behavior to address this challenge, and collecting and aggregating log event data is critical to this pursuit. This log data can correlate with performance metrics, enabling organizations to gain the insights and context necessary for efficient root cause analysis. While Prometheus collects metrics, it does not collect log data. Therefore, it does not provide the level of detail necessary to support effective incident response on its own.
Furthermore, Prometheus faces challenges when scaled significantly — a situation often unavoidable in the era of highly distributed modern systems. Prometheus was not originally built to query and aggregate metrics from multiple instances. Configuring it to do so requires adding additional complexity to the organization’s Prometheus deployment. This complicates the process of attaining a holistic view of the entire system, which is a critical aspect of performing incident response with any level of efficiency.
Finally, Prometheus was not built to retain metrics data for long periods of time. Access to this type of historical data can be invaluable for organizations managing complex environments. For one, organizations may want to analyze these metrics to detect patterns that occur over a few months or even a year to gain an understanding of system usage during a specific time period. Such insights can dictate scaling strategies when systems may be pushed to their limits.
While Prometheus is a great tool for gathering high-level metrics for SLOs and SLIs, site reliability engineers and security analysts must drill down into logs to find what exactly may have gone wrong. For that reason, you need a unified telemetry collection across all data types — logs, metrics, and traces. We need to shed outdated legacy processes and mindsets to innovate and use the newest best practices to ensure the best possible digital customer experience.
All of these challenges can be best addressed by leveraging unified Kubernetes monitoring with Sumo Logic’s OpenTelemetry Collector and setting up the latest Helm Chart. Additionally, OTel enables auto-instrumentation capabilities, saving you a lot of time in setting up your collectors. You can still aggregate Prometheus data next to this collector, but there is no reason to use it as middleware for metrics unless your infrastructure necessitates specific esoteric skill sets. For example, familiarity with PromQL or the need for specific histograms unavailable in the Sumo Logic monitoring environment. Learn more in our ultimate guide to OpenTelemetry.
It helps to use OpenTelemetry as a standard to achieve a smaller collection footprint and save time on instrumentation for effective security and monitoring best practices. Learn more about Kubernetes monitoring in this ebook.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial