Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
October 31, 2024
Sumo Logic Mo Copilot is a natural language assistant that helps first responders derive insights from logs and resolve issues faster using contextual suggestions and plain English queries. It has been in preview since May 2024 with dozens of customers.
Choosing a foundation model was a critical step in its development. Let’s explore our high-level requirements for Copilot, the role of foundation models and the rationale for standardizing on Amazon Bedrock.
As a reminder, Amazon Bedrock “is a fully managed service that provides a single API to access and utilize various high-performing foundation models (FMs) from leading AI companies. It offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI practices”.
One of the critical first steps in the development of Sumo Logic Mo Copilot was to establish customer and user outcomes through formal indicators.
Early on, we realized that customers use Sumo Logic to get answers during a crisis, such as an application incident or potential security compromise, and recover quickly. Rather than requiring them to formulate a question, we chose to recommend insights based on our best understanding of the incident context. We refer to this feature as contextual suggestions. If required, users can ask a question in plain English.
In either case, the AI translates the suggestion or question into a Sumo Logic Log Analytics platform query. Contextual suggestions are a powerful way to democratize Sumo Logic insights for early career users while the natural language interface works well with practitioners who need a little help writing Sumo Logic log queries.
Given this troubleshooting and investigation context, our overall metric is to minimize resolution time for application and security incidents. This is typically unknowable, as incident response involves many steps including forming hypotheses, collating multiple data sources, narrowing down to a resolution, and taking corrective action.
While our Copilot influences some of these steps by suggesting insights versus having to formulate hypotheses and translate plain English questions, it is not an incident response system. Instead, we chose to measure indicators that correlate with minimizing user frustration and delays during incidents such as:
Accuracy of translations from natural language to Sumo Logic, 90%+ on labeled data
Click through rate on contextual suggestions
Latency of suggestions and translations, p95<3 seconds
Other indicators include number of users per customer and number of queries per user per customer. If Mo promotes ease of use, we would expect to see high numbers for these indicators compared to log search, all other things being equal.
Unit economics was also a requirement as we wanted to make sure that the marginal query from a customer was still profitable for Sumo Logic.
Unlike many classical ML features in Sumo Logic, Copilot attracted privacy and compliance scrutiny by customers even during our requirements discovery process. Given the popularity of OpenAI and ChatGPT, customers understandably assumed that Copilot was powered by such a public commercial model. Customers valued the power of GenAI but wanted guarantees that none of their data would be accessible to the technology. As a result, for Copilot, we were looking for GenAI models that would be compatible with our existing compliance posture.
Given the nascency and rapid pace of GenAI innovation, like many software companies, we were learning by doing. As a result, we were also keen to keep our options open as we learned more about open source models, fine tuning, RAG, agentic approaches, prompt routing, and other emerging trends. Regional availability had to align with Sumo Logic’s deployment strategy.
Lastly, as explained in our MLOps framework, production grade AI features require hypothesis and experimentation driven iterations with the ability to observe the AI in production with respect to customer KPIs and operational metrics.
To summarize, our requirements for a foundational model were:
Ability to meet latency, accuracy and COGS goal
Compatible with data privacy and compliance requirements
Flexibility to handle emerging requirements
Regional footprint
Compatibility with Sumo Logic MLOps and Observability requirements
At the outset, we realized that Large Language Models (LLMs) were fundamentally different compared to the classical ML we were used to. Much of the inner workings of LLMs are the subject of active academic research. Unlike classical ML projects, our early hypothesis-driven experimentation was on coaxing outcomes (accuracy with low latency) from that technology rather than understanding our problem domain and related datasets.
We felt that public benchmarks of LLM performance on standardized tasks were not meaningful for our use cases. So we conducted a bakeoff between three prominent commercial foundation models.
We curated an AWS/WAF dataset with labels for query translations and metrics for accuracy and latency. Accuracy was measured based on the ability of the LLM to translate queries based on a query complexity rubric. Complex queries have more than five or more pipe (“|”) delimiters as explained in this blog. A key finding was that smaller language models (e.g. Anthropic Claude Haiku) had acceptable accuracy with dramatically lower latency and costs. Table 1 lists the results of our point in time assessment. Note the foundation models have evolved significantly since that time.
Table 1: Model leaderboard for AWS/WAF labeled dataset
Foundation Model | ||||
Dataset: Complexity | Metrics | Anthropic Claude Instant v1 | Anthropic Claude 2.1.1 | OpenAI GPT4.1 Turbo |
AWS/WAF: Low complexity | Latency msec | 2359 | 8100 | 2216 |
Accuracy (similarity) | 0.79 | 0.79 | 0.84 | |
AWS/WAF: Medium complexity | Latency msec | 3734 | 12230 | 3631 |
Accuracy (similarity) | 0.61 | 0.64 | 0.66 | |
AWS/WAF: High complexity | Latency msec | 3026 | 12422 | 3558 |
Accuracy (similarity) | 0.39 | 0.41 | 0.39 |
Next, as with all our AI/ML features, we undertook a legal and compliance review of Sumo Logic Mo Copilot. Hyperscalers (e.g. Amazon, Azure) had recognized early that most customers were apprehensive about LLMs using customer data for model training purposes and had created privacy and confidentiality guarantees. For example, Amazon Bedrock says “Amazon Bedrock helps ensure that your data stays under your control. When you tune a foundation model, we base it on a private copy of that model. This means your data is not shared with model providers, and is not used to improve the base models.”
As a result, it was very apparent early on that licensing foundational models directly from their creators was not appropriate. For Sumo Logic, AWS is already an approved data sub-processor. As a result, licensing Amazon Bedrock was no different in its compliance implications from other AWS PaaS and IaaS services we use.
Amazon Bedrock’s abstraction of foundation models meant that we had the choice to assess and use best in class models, from commercial and open source providers, for our use cases should the need arise. This flexibility was key compared to Azure’s OpenAI service. Over the recent months, Amazon Bedrock has also matured with respect to fine tuning and RAG approaches and we plan to assess them for future releases of our Copilot.
Sumo Logic’s global deployment requirements aligned well with AWS’ roll out of Amazon Bedrock.
Finally, our classical ML features use a number of AWS PaaS services (e.g. AWS Sagemaker, AWS Athena), so we had sufficient experience with ML Observability for AWS. We plan to grow our competencies for incremental Gen AI requirements. For example, our data security blog describes security monitoring best practices for Amazon Bedrock.
In short, given its incumbency, flexibility and compliance posture, Amazon Bedrock was an easy choice.
Data and control flows in Sumo Logic Copilot.
In the figure above, blue components are Sumo Logic. Red components are AWS PaaS services, including Amazon Bedrock. The AI curates suggestions from the customer’s log sources by analyzing the schema and values referenced in them. A similar approach is used to power natural language questions posed by the user.
In either case, the system creates a Sumo Logic logs query and executes it like other log queries in the Sumo Logic platform. The network interactions within Sumo Logic conform to our Service Oriented Architecture (SOA) framework. Calls to AWS are via public AWS API, subject to security and compliance controls.
Actual results of Copilot latency and accuracy on labeled datasets, comprising 5000+ example translations from natural language to Sumo Logic queries, are in the table below. For context, we set out to achieve 90%+ translation accuracy, which we are able to achieve for simple complexity queries.
Translation complexity | Latency (msec) | Weighted Accuracy |
Complex | 2,631 | 0.76 |
Moderate | 2,115 | 0.81 |
Simple | 1,512 | 0.97 |
Sumo Logic Copilot metrics for natural language to Sumo Logic query translation task
Amazon Bedrock has been a crucial enabler for time to market for Sumo Logic Mo Copilot by meeting our requirements, while also providing the foundation to evolve as we add additional use cases over time.
Learn more about how log analytics are vital for AI innovation.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial