Evaluate your SIEM
Get the guideComplete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.
May 1, 2013
The enterprise software market is being shaken to its foundation, and Etsy’s open-source tool StatsD is one of the tools providing the vibrations. One of the most interesting tools spawned by the DevOps movement, StatsD delivers highly specific (and highly relevant) metrics directly from Etsy’s code—providing a smoother experience than relying on the more generic metrics provided by application performance management (APM) vendors. With just a few lines of code, developers can measure any part of their application they choose, in the way they choose. This is very similar to the freedom that developers gain with a proper log analysis tool—they can dump any data they want into a log and analyze it later. Freed from the issue of storage, and of the mechanics of log analysis, they can focus on using the data to enhance performance management, troubleshooting, business intelligence etc.
First a little background on StatsD. The basis for the StatsD project started at Flickr, and was expanded at Etsy. This is appropriate since John Allspaw and his team helped kick-start the DevOps movement at Flickr, before coming over to Etsy. From the technical perspective StatsD is, in their own words:
A network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services.
So, StatsD modules forward clear-text metrics over UDP. StatsD supports a few different types of metrics, as well as analytics, but for the sake of simplicity, we will only cover two areas here: Counting and Timing.
The counting metric sends the metric name, the amount to increment/decrement, and possibly the sampling interval:
Counter.sample:1|c
The timing metric looks very similar, with a metric name and value:
timing.sample:320|ms
For current users of StatsD, the question might be: Why would I want to put this in Sumo Logic, as opposed to using a tool like graphite to create dashboards? Sumo Logic provides a few key benefits when combined with StatsD metrics:
To generate the data, I created a simple perl script using the StatsD perl module Net::Statsd. I then created a Syslog Source on a Linux Collector over the standard port of 514. The Sumo Logic Syslog Source, essentially a listener for text over UDP, can receive the StatsD message just fine.
One caveat, though - since the StatsD messages do not include a timestamp, Sumo Logic will assign the ingest time as the timestamp. This means that is essential that you set the timezone setting correctly. I tested this with thousands of events, and there were no issues.
To make some interesting, and relevant, metrics I added extra logic to my perl script to create some patterns with the rand() function and some math:
use Net::Statsd;# Configure where to send events# That's where your StatsD daemon is listening.$Net::Statsd::HOST = 'localhost'; # Default$Net::Statsd::PORT = 514; # Default# Initial Values$basepercent = 0.50;$webTime = 50;$appTime = 100;$dbTime = 150;$basecount = 5;# Infinite loopwhile(1) {$basepercent = ($basepercent + (rand(100) + 50)/100)/2;$webTime = $basepercent*($webTime + 50 + rand(750))/2;$appTime = $basepercent*($appTime + 100 + rand(1000))/2;$dbTime = $basepercent*($dbTime + 150 + rand(1200))/2;Net::Statsd::timing('web.time',$webTime);Net::Statsd::timing('app.time',$appTime);Net::Statsd::timing('db.time',$dbTime);$k = 0;$basecount = $basepercent*($basecount + rand(5))/2;while($k < $basecount){Net::Statsd::increment('site.logins');$k++;}sleep(5 + rand(10))}
Once the metrics were successfully being ingested into Sumo Logic, I needed to create some useful searches and Dashboard Monitors. With the StatsD counter function, I simply wanted to extract the data, drop it into 1m buckets, and sum up the number of increments to the counter over each minute.
The key-value structure of a StatsD message can be easily parsed with our keyvalue operator. Basically, I just told Sumo Logic to look for a lower case key name with “.” in it [a-z.]+ and a numerical value d+. I only searched for “site.logins”, but you could use the statement to look for any number of different counters in the same dashboard.
_sourceCategory=*StatsD*| keyvalue regex "([a-z.]+?):(d+?)|c" "site.logins" as logins| timeslice by 1m| sum(logins) by _timeslice
With the timing metrics, an average over each minute seems most relevant (though other functions like max, min, or standard deviations could be useful here). I pulled out all three timings together, by looking for key that looks like *.time - ?<tier>[a-z]+).time . Since I named my metrics web.time, app.time, and db.time, I was able to put each of the “tier” metrics on the same graph.
_sourceCategory=*StatsD* AND time| parse regex "(?<tier>[a-z]+).time:(?<test_time>d+)|ms"| timeslice by 1m| avg(test_time) by _timeslice, tier| transpose row _timeslice column tier</test_time></tier>
As I ran each of these searches, I clicked the “Add to Dashboard” button on the far right to add them a newly created StatsD dashboard. I included a screenshot below (the tier metrics are on the left, and the counter is on the right):
You can see from this example how easy it is to analyze data in the StatsD format. Once the data is in Sumo Logic, the sky is the limit in terms of what you can do with it. There are other metrics and backend functions that Sumo Logic can support over the long term, but this simple integration provides the majority of functionality needed.
Let us know you think, and sign up for a free account to try it out yourself!
Reduce downtime and move from reactive to proactive monitoring.
Build, run, and secure modern applications and cloud infrastructures.
Start free trial