Ensuring scalability, performance, and cost-effectiveness is a constant challenge for cloud-native log management and observability. At Sumo Logic, we faced this challenge head-on by transitioning from a stateful, Lucene-based architecture to a completely stateless, Parquet-based architecture. This transformation lets us improve data storage efficiency, streamline operational complexity, and meet the demands of an ever-increasing data scale.
Let’s walk through the journey of how we made this architectural shift, the reasoning behind it, the challenges we faced, and the benefits that come from adopting a stateless architecture powered by Parquet.
The starting point: Lucene-based architecture
We used to create Lucene indexes from customers’ logs. Lucene is a famous search engine library that helps query the data quickly. Because Lucene had the inverted index created, we used to get a compression of only 4x. That means that for every 100GB of raw logs, we created a 25GB index.
Downloading of these indexes took time, so we had to cache the indexes locally on disk. Also, because 95% of queries were of the last 15 days, we cached just the last 15 days' indexes. So, we maintained two clusters - cached and cacheless. The cached cluster was mainly used to serve queries for the past 15 days. But in case the cached cluster was unable to serve the query, we would also route it to the cacheless cluster. The cacheless cluster was used to serve all the queries beyond 15 days and act as a fallback for the cached cluster.Because of this stateful architecture, we faced several challenges:
Scalability
To cache the last 15 days indexes, we would provision the hardware for a customer according to the daily ingest. But there were cases when customers did a lot of querying making it hard to serve via the cached cluster. Because cached clusters are not autoscalable, the queries would go to the cacheless cluster, making it slow. Those slow queries would trigger our internal alerts, making our on-call load very high.
Operational overhead
In cases where queries were not getting served via the cached cluster, the customers would complain of slow performance and we had to provision more hardware for them. This led to a lot of operational overhead for the engineering team.
The transition: moving to a stateless, Parquet-based architecture
Recognizing these challenges, we decided to pivot to a stateless architecture using Parquet as the primary data storage format. This shift marked a significant change in how Sumo Logic managed, stored, and queried data.
Why Parquet?
Apache Parquet is a columnar storage format optimized for analytics use cases, making it a natural fit for log management and observability. Its design allows for efficient compression and encoding schemes, which translates into reduced storage costs and faster query performance.
Parquet’s columnar storage format provides better compression and performance compared to Lucene. By storing data in columns, similar data types are grouped together, allowing for more efficient compression. This reduces the storage footprint significantly, which is crucial for handling the large volumes of data Sumo Logic ingests. With Parquet we could achieve 16x compression, which means for every 100 GB of raw data we create a 6GB Parquet file. Because of this, we were able to unlock the stateless architecture as it reduced our need to keep the indexes cached.
Since we moved to cacheless architecture completely, these are a few of the benefits that we observed:
Scalable
This architecture is completely auto-scalable. We moved from EC2 instances to Kubernetes making the autoscaling even faster.No operational overhead
Now that the architecture is auto-scalable, we do not need hardware provisioning. To scale, we just needed to put minReplicas and leave them on K8s.Performance predictability
In earlier architecture, query performance was not predictable. The same query in the same time range could take two different times just because it was either served from cached or cacheless. With this architecture, we were able to bring predictability to query performance and remove such discrepancies.Unlocking Flex Licensing
This architecture allowed us to unlock a new scan-based pricing model in Flex. When we used to provision on the basis of ingest, it was extremely hard to charge customers on scan but with this architecture change, we could charge based on the usage.
Final thoughts
Moving from a stateful Lucene-based architecture with 15-day cached data on disk to a stateless Parquet-based system has been a monumental shift at Sumo Logic. The transition has provided us with a more scalable, resilient, and cost-effective platform that can keep pace with the growing demands of modern observability and log management.
At Sumo Logic, we are committed to continuous innovation, and this transition is just one of the many steps we’re taking to build a platform that delivers unmatched performance, scalability, and reliability for our customers.
Discover more about our latest product innovations on our what’s new page, or start your free trial to try it for yourself.Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.