Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
October 7, 2020
Platform architects, SREs, developers and DevOps staff for mission-critical modern apps know shaving 15 minutes off service incidents that take 20 minutes to resolve, four times a year, is the difference between meeting a 99.99% availability objective and missing it. After a successful worldwide preview, Sumo Logic Observability is now ready for site reliability, DevOps, developers and platform engineers to resolve incidents faster, maximize availability, and optimize their cloud infrastructure, microservices, and application operations for reliability objectives.
Reliability as an outcome is not new. Modern applications rely on diverse combinations of cloud infrastructure, container and orchestration tools, and technologies. These increasingly distributed environments are inherently complex to troubleshoot. Dependencies between layers in an application stack such as between microservices and their underlying cloud resources make troubleshooting cumbersome. Modern applications also emit huge amounts of logs, metrics, traces, and metadata: it’s not unusual for an application to generate hundreds of gigabytes of logs per day, several tens of thousands of time series per minute, several million traces per day and metadata from hundreds of app and infrastructure entities. Even mature Development teams like Sumo Logic are faced with new problems more often than not -- because, simply put, unknown behaviors and failures are a property of distributed systems you have to deal with. Furthermore, all data used for troubleshooting needs to be protected and secured.
In what follows, we describe the latest Sumo Logic Observability innovations starting with an example of troubleshooting an incident in a modern application. Consider the highly simplified mobile banking application shown below. In this example, the app is built on Kubernetes and AWS infrastructure. But of course, this same process can be applied to other technology stacks or application architectures. Consumers trigger bill payments transactions from the AWS Application Load Balancer (ALB) to the payment-service orchestrated by Kubernetes. The payment-service posts transactions to the accounts-service (another Kubernetes service) which stores transactions in the RDS database.
An elevated error rate for the payment-service would be the first sign of trouble that triggers an alert to an on-call engineer. The engineer would have to hypothesize and diagnose several scenarios that might be causing the elevated errors:
Suppose the on-call engineer determines excessive connections to the RDS instance overloaded the database and resulted in higher latency in the accounts-service response time which then resulted in payment-service errors. While the immediate resolution might involve provisioning additional or larger RDS instances, deeper troubleshooting is required to determine why the RDS instance got into such a state in the first place. The latter may be caused by poorly written queries, underlying AWS issues, software flaws (e.g. connections that were left open), or bad architecture (e.g. single points of failure).
Sumo Logic Observability generalizes the workflow implied in this example to the three broad stages of delivering to reliability outcomes highlighted in the figure below:
Of course, none of this works without collecting logs, metrics, traces, and metadata at the application, microservices, cloud, orchestrator, and container layers. These datasets, by themselves, are merely siloes; to accelerate troubleshooting, as shown in the example, the user should be able to connect the dots between logs, metrics, and traces by pivoting on entities (either services or resources) from the initial alert (from an error log, in the example) to a microservice transaction trace (e.g. the payment-service or accounts-service) to a metric for a Kubernetes pod or deployment or an AWS resource.
Sumo Logic Observability’s entity-driven workflow is at the core of capabilities for monitoring, diagnosing, and troubleshooting modern apps as described below.
Sumo Logic Observability combines logs, metrics, and trace datasets into a single platform and leverages an entity model that enables users to correlate signals between logs, metrics, and traces as they go from an alert to root cause. These entities are discovered automatically from the metadata across logs, metrics, and traces generated by the application and it’s infrastructure.
As it relates to monitoring, Sumo Logic Observability now includes:
For diagnosing incidents, Sumo Logic Observability now includes:
For troubleshooting incidents, Sumo Logic Observability now includes the following advanced analytics innovations:
Underpinning these capabilities is expanded support for Open Source frameworks including OpenTelemetry for tracing data and Telegraf for increasing the breadth of technologies we collect metrics from. Our existing Redis and NGINX apps are now enhanced to leverage logs and metrics. We have also added new apps for JMX and NGINX Ingress Controller, a common component in Kubernetes stacks.
To support observability outcomes without breaking budgets, Sumo Logic Observability now includes the ability for customers to tier data based on analytics requirements, an industry-first credits-based licensing model for ultimate flexibility and cardinality-independent pricing for ephemeral resources in container environments. Sumo Logic platform is end-to-end encrypted, has a 24x7 security operations center, and is certified and attested for PCI-DSS, HIPAA, AICPA-SOC2, ISO 27001, GDPR, and FedRamp (in-progress).
In subsequent blog posts, we will delve into additional details for each of these capabilities.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial