Generative AI QE: Insights from testing Sumo Logic Mo Copilot

Generative AI is transforming industries by automating tasks and delivering AI tools, such as AI assistant Sumo Logic Mo Copilot, to enhance operational efficiency. But, these advancements also challenge traditional quality engineering (QE) methodologies.

Unlike conventional software testing, AI models produce dynamic, context-sensitive outputs, requiring a new approach to validation and testing.

At Sumo Logic, we faced similar challenges while testing Mo Copilot. So, how did we streamline best practices for QE when designing a new AI solution? Let’s walk through the strategies we implemented, the lessons we learned, and how they contributed to delivering an optimal AI assistant for all your log search needs.

What is Sumo Logic Mo Copilot?

Sumo Logic Mo Copilot is an AI-powered assistant designed to help you gain insights from logs and resolve issues faster using natural language queries. Whether you need insights or to troubleshoot issues, Mo Copilot converts plain English questions into accurate Sumo Logic queries.

Mo Copilot also provides Explore suggestions, which are recommended queries based on your selected source category, such as AWS WAF. While these features are user-friendly and improve efficiency, they also pose unique testing challenges that require a new testing approach.

Challenges faced while testing Sumo Logic Mo Copilot

Testing features of Generative AI models, such as Mo Copilot, differ fundamentally from traditional testing due to the subjective and dynamic nature of their outputs.

Some of the key challenges we faced include:

Subjectivity and dynamic outputs: By nature, natural language processing (NLP) models interpret and generate responses that vary significantly depending on input phrasing. Ensuring consistent and accurate outputs was a significant hurdle.
Diverse input handling: Customers can provide random and unexpected inputs, making it crucial to test edge cases effectively and handle the inputs at the system level.
Test data generation: Creating data to cover various scenarios, including valid, invalid, and borderline cases, was a continuous effort.
Scalability: Testing required simulating real-world environments with large-scale data ingestion while ensuring consistent performance and accuracy.
Defining metrics: Establishing relevant metrics to identify regressions and ensure output quality posed unique challenges.
Efficient issue reporting: Logging every translation failure or suggestion mismatch in Jira was not feasible.

Strategies we implemented to maintain QE best practices

To address the challenges above, we employed several new testing approaches to adhere to QE best practices.

Reverse prompt engineering

Testing a generative AI model requires diverse and realistic data. We leveraged two key approaches for this:

Using Sumo Logic’s App Catalog, which houses a rich collection of dashboards and queries, we reverse-engineered NLP inputs to simulate real-world use cases. This helped us ensure our test data closely mirrored customer scenarios, providing more accurate and reliable outputs.
Beyond using in-product models, we also explored external models like ChatGPT to generate synthetic data. While Mo Copilot utilized AWS Bedrock’s model, we used ChatGPT to produce alternative test cases. This approach helped diversify our dataset and allowed us to validate the model’s robustness without overlap.

Performance testing with synthetic data

Testing across more than 200 Sumo Logic apps required simulating large-scale environments. Using an in-house tool, we continuously generated and ingested synthetic data into one organization. While manual testing covered 4–5 apps, this automation helped us evaluate performance at scale and detect bottlenecks effectively.

Sumo-on-Sumo feedback loop

Instead of manually logging issues in Jira, we built a feedback mechanism within the Sumo Logic UI using thumbs-up and thumbs-down buttons. Feedback from these interactions was automatically logged in Sumo Logic dashboards, allowing developers to analyze and address issues efficiently, and ensure rapid feedback, and continuous improvement.

Guardrails for relevance

We implemented strict guardrails to ensure Mo Copilot only responded to Sumo Logic-related queries. Testing these boundaries was critical to prevent irrelevant or misleading responses, safeguarding the product’s usability and customer trust.

Golden data for regression testing

Regression is a key concern with AI models. To mitigate this, we created a golden dataset and integrated it into our automated testing pipeline. Each new prompt or model update was evaluated against this dataset to ensure consistent performance.

Metrics for relevance and accuracy

Collaborating closely with the Sumo Logic AI Model team, we established clear metrics, such as relevance scores and accuracy rates, to monitor the model’s performance and identify issues early. Some additional metrics the team added include:-

Velocity: The speed at which the AI model processes and delivers accurate results.
Over specificity: The tendency of the AI model to provide answers that are overly detailed or narrow beyond the user's intent.
Diversity: The range and variety of responses or outputs generated by the AI model.
Relevance: The degree to which the AI model's output aligns with the user's query or intended context.
Execution: The AI model's ability to perform tasks or generate outputs without errors or inefficiencies.

Dogfooding and early customer previews for real-world validation

Internal testing and customer previews helped us significantly in gathering diverse feedback to ensure Mo Copilot was performing at its best. We identified edge cases and improved the model based on real-world usage patterns.

Key lessons learned from QE testing

Fast iteration is crucial
AI model updates are frequently updated, and prompt changes can have cascading effects. Establishing a robust automation pipeline with golden data is essential to catch regressions early.
Metrics matter
Clear and well-defined metrics are vital for monitoring AI model performance and maintaining quality. Close collaboration with the AI model team allowed us to refine these metrics.
Centralized monitoring: We created a common Sumo Logic dashboard to monitor critical KPIs, such as API performance, model performance, and accuracy metrics. This gave the team a unified view of the system's health and facilitated rapid response to any anomalies.
Adaptability in QE
Testing Generative AI models requires a shift from traditional QE methodologies. Teams must embrace new strategies such as reverse prompt engineering, continuous feedback mechanisms, and scalable synthetic testing to keep pace with the dynamic nature of AI systems.

Experience Mo Copilot in action

As Generative AI continues to evolve, so too will our testing strategies. Testing Sumo Logic Mo Copilot taught us invaluable lessons about adapting to the unique challenges of Generative AI.

By leveraging techniques like reverse prompt engineering and Sumo-on-Sumo feedback loops, we ensured that Mo Copilot maintained and achieved the standard of accuracy, relevance, and scalability that our customers expect.

Want to try Mo Copilot out for yourself? Try Sumo Logic today with our 30-day free trial.

Complete visibility for DevSecOps

Reduce downtime and move from reactive to proactive monitoring.

Start free trial

Generative AI QE: Insights from testing Sumo Logic Mo Copilot

What is Sumo Logic Mo Copilot?

Challenges faced while testing Sumo Logic Mo Copilot