Are you a bit unsure about the difference between log aggregation and Application Performance Monitoring (APM)? If so, you’re hardly alone. These are closely related types of operations, and it can be easy to conflate them—or assume that if you are doing one of them, there’s no reason to do the other.
In this post, we’ll take a look at log aggregation vs APM, and the relationship between these two data accumulation/analysis domains, and why it is important to address both of them with a suite of domain-appropriate tools, rather than a single tool.
Defining APM
First, let’s look at Application Performance Monitoring, or APM. Note that APM can stand for both Application Performance Monitoring and Application Performance Management, and in most of the important ways, these terms really refer to the same thing—monitoring and managing the performance of software under real-world conditions, with emphasis on the user experience, and the functional purpose of the software.
Since we’ll be talking mostly about the monitoring side of cloud APM, we’ll treat the acronym as being interchangeable with Application Performance Monitoring, but with the implicit understanding that it includes the performance management functions associated with APM.
What does APM monitor, and what does it manage?
Most of the elements of APM fall into two key areas: user experience, and resource-related performance. While these two areas interact (resource use, for example, can have a strong effect on user experience), there are significant differences in the ways in which they are monitored (and to a lesser degree, managed):
APM: User Experience
The most basic way to monitor application performance in terms of user experience is to monitor response time. How long does it take after a user clicks on an application input element for the program to display a response? And more to the point, how long does it take before the program produces a complete response (i.e., a full database record displayed in the correct format, rather than a partial record or a spinning cursor)?
Load is Important
Response time, however, is highly dependent on load—the conditions under which the application operates, and in particular, the volume of user requests and other transactions, as well as the demand placed on resources used by the application.
To be accurate and complete, user experience APM should include in-depth monitoring and reporting of response time and related metrics under expected load, under peak load (including unreasonably high peaks, since unreasonable conditions and events are rather alarmingly common on the Internet), and under continuous high load (an important but all too often neglected element of performance monitoring and stress testing).
Much of the peak-level and continuous high-level load monitoring, of course, will need to be done under test conditions, since it requires application of the appropriate load, but it can also be incorporated into real-time monitoring by means of reasonably sophisticated analytics: report performance (and load) when load peaks above a specified level, or when it remains above a specified level for a given minimum period of time.
APM: Resource Use
Resource-based performance monitoring is the other key element of APM. How is the application using resources such as CPU, memory, storage, and I/O? When analyzing these metrics, the important numbers to look at are generally percentage of the resource used, and percentage still available. This actually falls within the realm of metrics monitoring more than APM, and requires tools dedicated to metrics monitoring.
If percent used for any resource (such as compute, storage or memory usage) approaches the total available, that can (and generally should) be taken as an indication of a potential performance bottleneck. It may then become necessary to allocate a greater share of the resource in question (either on an ongoing basis, or under specified conditions) in order to avoid such bottlenecks. Remember: bottlenecks don’t just slow down the affected processes. They may also bring all actions dependent on those processes to a halt.
Once Again, Load
Resource use, like response time, should be monitored and analyzed not only under normal expected load, but also under peak and continuous high loads. Continuous high loads in particular are useful for identifying potential bottlenecks which might not otherwise be detected.
Log Aggregation
It should be obvious from the description of APM that it can make good use of logs, since the various logs associated with the deployment of a typical Internet-based application provide a considerable amount of performance-related data. Much of the monitoring that goes into APM, however, is not necessarily log-based, and many of the key functions which logs perform are distinct from those required by APM.
Logs as Historical Records
Logs form an ongoing record of the actions and state of the application, its components, and its environment; in many ways, they serve as a historical record for an application. As we indicated, much of this data is at least somewhat relevant to performance (load level records, for example), but much of it is focused on areas not closely connected with performance:
- Logs, for example, are indispensable when it comes to analyzing and tracing many security problems, including attempted break-ins. Log analysis can detect suspicious patterns of user activity, as well as unusual actions on the part of system or application resources.
- Logs are a key element in maintaining compliance records for applications operating in a regulated environment. They can also be important in identifying details of specific transactions and other events when they require verification, or are in dispute.
- Logs can be very important in tracing the history and development of functional problems, both at the application and infrastructure level—as well as in analyzing changes in the volume or nature of user activity over time.
APM tools can also provide historical visibility into your environment, but they do it in a different way and at a different level. They trace performance issues to specific lines of code. This is a different kind of visibility and is not a substitute for the insight you gain from using log aggregation with historical data in order to research or analyze issues after they have occurred.
The Need for Log Aggregation
The two greatest problems associated with logs are the volume of data generated by logging, and the often very large number of different logs generated by the application and its associated resources and infrastructure components. Log aggregation is the process of automatically gathering logs from disparate sources and storing them in a central location. It is generally used in combination with other log management tools, as well as log-based analytics.
It should be clear at this point that APM and log aggregation are not only different—It also does not make sense for a single tool to handle both tasks. It is, in fact, asking far too much of any one tool to take care of all of the key tasks required by either domain.
Each of them requires a full suite of tools, including monitoring, analytics, a flexible dashboard system, and a full-featured API. A suite of tools that can fully serve both domains, such as that offered by Sumo Logic, can, on the other hand, provide you with the full stack visibility and search capability into your network, infrastructure and application logs.