Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
November 27, 2023
It’s time to stop firefighting. With Sumo Logic’s AWS Observability, companies like Snoop have been able to simplify data collection, achieve unified visibility across AWS accounts and regions and leverage machine learning to troubleshoot — fast.
This re:Invent, we’re excited to showcase how our capabilities for AWS have evolved. Offering a unified approach to monitoring and troubleshooting for AWS, Sumo Logic lets DevOps and SRE teams improve the reliability of their services and cut troubleshooting toil in just a few clicks.
Looking for lightning-speed troubleshooting? Here’s how Sumo Logic can help you find the root cause and reclaim your time.
In the fast-paced world of e-commerce, timely order processing and inventory updates are crucial for maintaining customer satisfaction. But what happens when an efficient, serverless architecture starts showing intermittent delays?
Here the processing and inventory update system for our e-commerce site leverages Amazon SQS for queuing orders, AWS Lambda for the core business logic, and Amazon RDS as the persistent data store. Customers are reporting experiencing intermittent delays in placing orders and during checkout.
To understand what might be going wrong, you first need a centralized view of your AWS environment that brings together your relevant logs and metrics. With AWS observability, you unlock a comprehensive view across your AWS accounts, regions and individual namespaces. This content is provided out of the box after deploying the solution via the CloudFormation template or Terraform.
AWS observability comes with pre-built alerts for different AWS services, including Amazon SQS, AWS Lambda, and Amazon RDS. These alerts can notify you about the issue with the e-commerce site. In our example, the “Amazon SQS - Message processing not fast enough” alert was triggered.
From the alert, you can determine the characteristic of the issue – if it triggers often, how long it has been unresolved, and other relevant details. In addition, you can understand how long messages are waiting in the queue before they are processed.
Now, with this knowledge, the troubleshooting begins.
You start your investigation by diving into SQS, where messages from the Order Processing Service are queued. CloudWatch metrics for SQS provide the first clues.
You observe that the NumberOfMessagesSent
is much higher than NumberOfMessagesReceived
,
indicating that messages are
being queued faster than they are being consumed. The ApproximateAgeOfOldestMessage
metric shows that
some messages
have been in the queue for a long time, which could indicate a bottleneck.
Next, you turn your attention to AWS Lambda, responsible for processing SQS messages to update your inventory. Log entries give evidence of prolonged function execution and timeouts, suggesting potential issues with the Lambda function's efficiency or resource allocation.
Here, Sumo Logic’s out-of-the-box dashboards for AWS Lambda error analysis indicate the following log entry.
Because the Lambda function interacts with an Amazon RDS instance, checking RDS would be your next step.
The RDS performance metrics show high CPU utilization and errors related to database locks.
Again, Sumo Logic’s out-of-the-box dashboards for Amazon RDS error log analysis help to locate particular log error messages confirming the database issue.
2023-11-09T01:45:00Z [ERROR] Deadlock found when trying to get lock; try restarting transaction
A closer look into the RDS slow query logs analysis out of the box dashboard revealed sub-optimal queries significantly dragging down performance.
# Query_time: 899.00 Lock_time: 0.594385 Rows_sent: 45 Rows_examined: 54392 SELECT * FROM inventory;
You can see that the culprit is a full table scan caused by a missing index.
By thoroughly examining each component of the serverless architecture, you can now address any delays. As the next steps, you can adjust the Lambda function's timeout settings and increase the memory allocation. Additionally, you can add an index to the RDS instance to speed up the problematic query.
Without a unified view of your AWS environment, and the ability to pivot between services and centralized logging, getting to the root cause of this issue may have been extremely difficult, if not impossible. You can learn more from our helpful guides:
Looking to reclaim your time? Get started today with AWS observability, which you can deploy in minutes via the
CloudFormation template or Terraform. Learn more and start your trial here.
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial