New standards for observability and reliability

Not so long ago in my career, I remember when it was relatively acceptable for infrastructure or development teams to solve a problem by rebooting a server or just “turning things off and on again.” It didn’t matter what caused the problem or how long the reboot would fix things, provided they were fixed for now.

Security teams were always held to a different standard. It wasn’t good enough to say that things were now secure or that they didn’t know why or how a security incident had occurred. Security teams were accountable for knowing what happened, how it was fixed, and how that same vulnerability could be healed in the future.

Increasingly, engineering teams responsible for the reliability and performance of critical, customer-facing applications are being held to this same level of accountability. This is where adopting a DevSecOps practice can be crucial to an organization.

Why are standards for observability and reliability increasing?

As organizations have embraced digital transformation, more and more of an organization’s revenue, reputation, and overall success are tied to mission-critical applications. Whether it’s a payment portal, a shopping cart, or a single button that allows a user to share an experience with others, it’s vital that your technology works and – from the end user’s perspective – works flawlessly.

Whether your application goes down because of a bug that slipped unnoticed into production or due to the exploitation of an unidentified vulnerability, the impact is similar in that the end-user’s experience has been negatively impacted. And that’s just the beginning, we often hear stories about the cost of downtime, with some estimates of over $6k per minute ranging to $16k per minute (or $1 million per hour) or more depending on your industry and size.

This cost can come from lost revenue, tarnished reputation, diminished customer trust and confidence and other brand impacts. In fact, as technology has expanded into almost every industry, customers have a higher expectation for their digital experiences and it’s even easier for them to change solutions after a negative experience.

At Sumo Logic, a recent IDC ROI report showed an 82% reduction in unplanned downtime as part of our business value. When every moment of delay costs money, reputation, and valued customers, reliability, security, and observability are non-negotiable.

That said, chaos and unpredictability are widespread. No matter how well you build and secure your business, you’ll need to anticipate the unknown and prepare for increasing complexity. That is the foundation of modern observability and security.

What are the new standards for reliability and observability?

As a CEO, the first question I’ll ask if there’s a problem is, of course, “How can I help?” But soon thereafter, it’s “What happened?” and “How can we make sure this doesn’t happen again?”

In the security world, these questions are typically answered by investigating audit logs, conducting forensic analysis, identifying the root cause, and analyzing how the organization should modify its security posture to avoid similar incidents. Now, observability teams need to do the same.

Uncovering the root cause of infrastructure and application reliability issues has become increasingly challenging. Maintaining highly resilient, cloud-native, cost-effective applications at scale in a way that meets the expectations of modern customers is well beyond a human-scale problem. It is a machine-scale problem and arguably an AI-scale problem for bleeding-edge applications.

Optimizing reliability with accountability requires the right telemetry, AI and machine learning, as well as real-time customer journey insights mapped to clear business objectives. Ultimately, business outcomes matter most, and engineers have used SRE practices to reverse engineer what Sumo Logic calls reliability management.

Telemetry – Metrics give you directional input on where an issue is occurring, and traces give you directional input on what part of your stack may be contributing to the issue as it relates to customer transactions, but logs are the critical telemetry that provides the atomic-level insights to identify the actual root cause of your reliability issues.
AI and ML – When we say that apps are complex and only growing more complicated, just think of Netflix with over 1000 microservices or the staggering rate of changes with Amazon’s tens of millions of deployments per year. Organizations grapple with so much data that AI must assist in all parts of the monitoring and troubleshooting lifecycle including data correlation, anomaly detection, root cause analysis, change intelligence, and even incident remediation to minimize MTTR in a way that enables organizations to serve their customers and grow their business.
Real-time insights – technical teams need true reliability management to build powerful log and metric-based SLIs to measure and report on reliability, customer impact, and business objectives. Without measuring how customers are impacted, you can only report on metrics that don’t attach well to business objectives. For example, reporting that 25 percent of customers in EMEA had a negative experience purchasing a product, and five percent of those customers left the website with their cart open, is a much better way to communicate business impact against objectives than reporting a “five-minute outage with the CartService”.

Sumo Logic is built to support teams across the DevSecOps lifecycle. Sometimes this is centralized purely in observability teams or security teams, but we’ve learned from experience that customers that center shared data built on logs are typically the most successful.

As observability teams evolve and enhance monitoring, organizations need to hold development, security and operations teams to the same standards of reliability. I’m excited to see how improved observability and security evolve.

Learn more about monitoring and troubleshooting with Sumo Logic’s SaaS Log Analytics Platform.

BY SECURITY USE CASE

BY OBSERVABILITY USE CASE

BY INDUSTRY

BY COMPETITION

LEARN

ENGAGE

TRAIN

COMMUNITY

BY SECURITY USE CASE

BY OBSERVABILITY USE CASE

BY INDUSTRY

BY COMPETITION

LEARN

ENGAGE

TRAIN

COMMUNITY

From “rebooting” to reliable and secure applications: Optimizing the customer experience

Table of contents

Why are standards for observability and reliability increasing?

What are the new standards for reliability and observability?

Article Tags

Company

RESOURCES

PRODUCTS

SUPPORT & LEARN

INTEGRATIONS

INITIATIVE