Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
March 2, 2023
There is something unique about how Sumo Logic CTO, Christian Beedgen, presents at events. At Illuminate, he expanded upon ideas he shared at SLOconf, turning reliability management into a logical and fundamentally humane solution.
I may not be as entertaining as Christian when he presents, but if you want the summary without the jokes or details, this blog is for you. Read on to learn why this approach to reliability management is a pragmatic way to deal with a wide range of challenges that companies face while implementing digital transformation. Or you can watch the talk with the memes here:
Wait! I know digital transformation can feel so passe. As geeks, nerds and digital natives, we’ve all been thinking about digital transformation since before the before times. But the rest of the world is still catching up.
Digital transformation is sweeping the business landscape of the 21st century. Integrating digital technology into all areas of business changes how organizations operate and deliver value to customers. COVID’s impact accelerated this transformation as businesses were forced online to conduct operations.
Digital transformation is not just about technology. It’s also about a cultural change that requires organizations to continually challenge existing methods and systems, get comfortable with experimentation, and embrace failure as a means of learning and improving.
Digital interactions, such as online banking, food delivery, shopping, travel, and education, have dramatically changed how customers interact with businesses, and their expectations for those experiences. The digital customer experience is becoming increasingly central to our way of life.
As businesses shift into the digital world, success of the business relies on the success of your digital experience. Providing a secure and reliable digital experience has become the main battlefield in inching ahead of the competition.
How is your digital experience being delivered? Through applications. The applications that drive your business can be easily defined: they are mission-critical and revenue-generating. Applications are directly linked to your digital experience – their reliability and security are directly linked to business success.
Among the many things businesses need to be for people, the most fundamental is the need to be reliable and secure.
What is reliability today? The pinnacle of a reliable digital experience is availability and performance at all times, globally. For example, buying online involves a complex ecosystem of interconnected applications that work together – and today’s customers are demanding, with little tolerance for downtime. Any perceived friction in the digital experience can quickly result in bounce or churn.
Security is another non-negotiable. With the growing volume of Personally Identifiable Information (PII) shared online, security is indispensable to delivering a seamless digital experience. It is critical to safeguard customer information from potential breaches or leaks. This is not only a moral obligation but also a requirement for self-preservation, as a security breach can lead to severe consequences for the business and the customer.
All of this is happening with widespread cloud adoption in lockstep with the undeniable dominance of digital transformation. Today, cloud-native applications are what enable always-on global businesses.
Moving to the cloud improves efficiency across industries but also introduces new challenges in terms of reliability and security. The reality is that the digital experience is at the mercy of the reliability and security of cloud-native applications, and there is much to learn and unlearn in how we operate digital businesses in the cloud.
Achieving reliability takes work. As the second law of thermodynamics taught us: entropy always increases. The second law also presents an intriguing aspect: isolated systems tend to degrade into a more disordered state. Chaos truly is the way of the universe.
As digital businesses, we must accept that chaos is an inherent aspect of our shared digital universe. This concept has significant implications for us – in running cloud-native apps, we are all responsible for various systems, whether they are commercially important or even life-preserving.
Operators and administrators today attempt to anticipate the chaos through monitoring.
I don't think simply alerting on all available resource consumption metrics in your infra is going to achieve anything other than burning your team to a crisp. And trust me, this is yet another one of those many things we have ourselves learned the hard way at Sumo. We went through many iterations of the on-call zombie apocalypse. Yes, we all like coffee but come on.
Monitoring can effectively manage systems, but it is reactive and focuses on known issues. Handling the chaos of running and managing cloud-native applications effectively takes much more than this approach.
Observability is not just a 21st-century way of saying monitoring. Observability results from recognizing that chaos cannot be anticipated by relying on what we know.
In contrast to monitoring, observability invokes the mindset of anticipating the unknown. We achieve observability by instrumenting systems so that telemetry about their behavior is available. Today’s applications and systems are too complex to just attach a debugger.
Observability is the result of the realization that the chaos we need to manage cannot be anticipated by purely relying on what we know.
By centralizing telemetry into a place where we can explore what the systems are likely to be doing, it’s possible to troubleshoot and understand what is going on when the universe tries to pull one over on us.
However, even with observability as it is viewed today, we have a limited understanding of the purpose of our activities and the tools we build and use to support them. The framing of Application Performance Management and Observability (APM and O) falls dramatically short of what truly matters. We can’t just blindly follow the false prophet of Observability.
The industry is prone to following a false prophet because all of this is nothing more than a means to an end. Sure, it is exciting and yes, it is absolutely critical but it is just a means. It is most definitely not the end… The way I see it, the end is reliability.
Ultimately, the outcome is what matters. And, the outcome that counts for a business in the digital age is to have a reliable digital experience for the apps that drive the business.
When we apply an outcome-focused lens, we are striving for reliability. Outcomes do not just materialize. We need to manage them, and we always want to start at the end and work our way backward.
We need to put our focus on reliability management.
But how can we get there? And what mindset shift do we need to make to achieve this?
As operators and administrators, we must excuse ourselves from the perceived need to look at the world in absolutes. In reality, we don't need to be absolutely reliable. This is how we achieve reliability – moving away from the idea of absolute reliability and aiming for a more pragmatic approach.
Learning from Google's Site Reliability Engineering credo, we already know that one hundred percent is the wrong reliability target.
Aiming for 100% reliability will slow down processes and create unnecessary hurdles between code and production. Competition is everywhere, looking to overtake you. This is the reality we are dealing with: a successful digital business must be agile, and its applications must be agile.
Keeping up ultimately boils down to your ability to ship code as fast, if not faster, than your competitors. To compete in a fast-paced, digital-native environment, you need to be reliably reliable enough, and more reliable than your competitors.
When you aim for 100%, any change presents a risk to perfect availability and performance. Arbitrary hurdles are erected between you and your code going to production.
The universe doesn't wait for anyone.
So, how can we be the very best of good without falling into the trap of trying to be perfect?
We can sustain ourselves in a chaotic universe by adopting the Service Level Objective (SLO) approach, taking a page from Google's SRE. The SLO approach embraces the idea that perfection is not practical, and instead focuses on the customer experience.
It prioritizes what truly matters rather than worrying about every infrastructure layer.
We can create error budgets and troubleshoot when necessary by managing our applications against the nines instead of the perfect 100%. When we exceed our error budgets, we know we need to focus on improving the reliability of our system, identifying and prioritizing the parts of the system that need attention instead of trying to monitor and optimize every aspect of the infrastructure. This could mean making changes to the code, implementing new security measures, or upgrading infrastructure.
To illustrate, consider a 30-day sliding window. You take the small fraction subtracted from 100% and allocate it as a budget over time. In this example, you have a budget for the last 30 days, during which a small amount of failure is acceptable. This is a vital and fundamentally practical aspect of this methodology as it means that you don't have to immediately alert people in the middle of the night for every minor observation of a failure.
As a result, the SLO methodology is fundamentally humane. It offers reliable digital customer experiences while giving your team freedom from the tyranny of the pager. (Or the Slack or the late-night text from your boss.)
At Sumo Logic, we are incredibly excited about the SLO approach and its potential to revolutionize reliability management. To channel this excitement, we are actively participating in the OpenSLO project, which aims to provide a vendor-independent way to express and manage SLOs as code. Our team is fully committed to supporting this project, and we are also hard at work incorporating the SLO methodology into our product.
Within Sumo, you can easily define SLOs, create monitors from them, and receive alerts when your error budget is being consumed. We also provide detailed overview dashboards for SLOs, allowing you to understand their current state and error budget consumption over time.
The fast-paced nature of the digital world can be overwhelming. You can stay ahead of the curve by focusing on reliability as an outcome and managing toward it.
Beating the universe is not easy, but by adopting a reliability management mindset, you can improve the digital experience for your customers and stay ahead of the chaos. The SLO approach is a practical and effective methodology for achieving this goal.
Let's embrace the radical realism required to sustain ourselves in a chaotic universe and start managing for reliability.
Download our guide to reliability management for more practical advice on how to get started.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial