There are many vendors of and approaches to implementing Observability. Before we go through them all, however we first need to ask, what is Observability really (what distinguishes it from monitoring), and who needs Observability? In other words, what kinds of applications and situations warrant investing in purchasing and/or implementing an Observability solution?
What is Observability?
As with any new term in the monitoring or management industry, every vendor with a solution that might be considered Observability tries to define Observability in a manner advantageous to them. So is here is a proposed definition. Observability means that you have the data that you need (the logs, metrics, traces and dependency maps) for every single unit of work that your application and its underlying system software perform that is of interest to the business. A unit of work can be a business transaction (checkout at an online store), a call between two microservices, a sequence of calls between a set of microservices (a trace and its component spans), a batch job, a file transfer from one location to another, a write of a specific value to a database or a file or a click of a mouse on a screen by a user. Of interest to the business means that if this unit of work fails, if it is slow (latency or response time) or if the application and the system cannot handle the required volume of work (throughput) then there are negative business consequences (loss of revenue, loss of market share, impacts to customer satisfaction, etc.).
Every single unit of work, means that you are not ever missing the required data for ANY of those transactions or units of work. You have that data (logs, metrics, traces and dependency maps) for every single one of them. This is essential because what sets observability apart from just plain old monitoring is that monitoring inherently relies upon sampling. Most application performance management products sample the environment every minute. Instana, the leader in real time APM samples every second. The exception to sampling on the part of the APM vendors is that the vendors that have implemented distributed tracing, generally are able to collect the traces and spans for every trace and span. The challenge lies in collecting the metrics, logs, and dependencies with the same fidelity as the traces.
Observability creates a set of requirements that monitoring vendors have to meet in order to be considered a viable Observability Platform:
- Comprehensive instrumentation of every unit of work for metrics, logs, traces and dependencies
- High cardinality analytics to sort the good observations from the ones that require analysis and attention
- A real-time and high scalable big data back end that can simultaneously handle the ingest of all of this data, the analysis of all of this data, and the queries from users who are using the system and expect real time data and responses to their queries.
An Overview of Observability, Its Data Types, Its Analytics and Its Business Benefits
Who Needs Observability?
So when is Application Performance Management (APM) not good enough and when is Observability necessary? Observability is necessary under the following conditions:
- Your application performs a large number of units of work per unit of time and for business reasons you cannot afford to miss any of them. For example if you an online retailer, ideally no user or customer should ever have a bad experience. You should know about every single slow interaction and every single error seen by every user. Another example would be if you are a financial institution doing hundreds or thousands of funds transfers in a second. None of them should ever fail and you again need to know about every single one of them.
- The load on your application varies a lot and as a result the number of instances of your application scale up and down as well. If you have hundreds or thousands of microservices in a Kubernetes cluster being scaled up and down, you are going to need Observability because things may come and go so quickly that a tool that samples every minute could miss the existence of a microservice that only lives for a few seconds.
- You are aggressively rolling updates into production via your CI/CD process and you need to know immediately after an update whether it is good to be expanded upon or whether it needs to be rolled back. Waiting an average of 30 seconds (which is what happens when your tool samples every minute) potentially means 30 seconds of slow or failed transactions which in a high volume system is completely unacceptable.
Who (Which Vendors) Can You Buy Observability From?
At this point in our industry, every monitoring vendor claims to be an AIOps vendor. We are not quite there with Observability yet, but we are getting there fast. So here comes our attempt to accurately characterize the available options in this space, list the differences between them, and articulate the tradeoffs involved in implementing them and using them:
- LightStep – LightStep gets credit for getting the Observability party started. Ben Sigelman the Founder of LightStep implemented the first large scale distributed tracing system while at Google. This is documented in the Dapper Paper from Ben. LightStep is able to collect traces and spans for every call between any set of microservices. LightStep therefore meets the standard of not sampling and not missing anything. The issues with LightStep are that it is only appropriate for microservices based applications (it does not work very well for monoliths and N-Tier distributed SOA applications since it has no code profiling for these types of applications), and that the developer of the application must instrument their code with the LightStep tracing libraries. This opens up a huge debate as to whether or not your developers should spend their time instrumenting their code and how you be be sure that all of your developers follow the same standards for how this instrumentation is done.
- Honeycomb – The Founder of Honeycomb, Charity Majors is a thought leader and innovator in the Observability space, and also gets credit for getting the party started. The key innovation from Honeycomb is a big data back end that can handle “wide logs” that include metrics, traces, and dependencies. So if you are willing to instrument your code with the Honeycomb libraries, and instrument it comprehensively and consistently across your team, then you can implement a true observability solution with Honeycomb. Note that this also subjects you to the debate as to whether or not this is a good use of your precious developer resources.
- Instana – Instana is an APM solution built from the ground up for highly scaled out and dynamic microservices environments. It combines comprehensive tracing (it does not sample traces, it collects them all), with one-second metric collection, log collection, and collection of the dependency maps for every transaction and trace. It is one of only two APM vendors whose agents dynamically inject instrumentation into containers and application run-times which is the perfect approach for highly dynamic environments undergoing rapid change and evolution. It is crucial to note that Instana does not require any effort on the part of developers to instrument their code – this is all done automatically by the Instana agent. Instana has recently added code profiling via its acquisition of StackImpact, making it a solution that spans monoliths, N-Tier distributed applications, and microservices based applications. Instana also recently acquired BeeInstant which when integrated into the Instana back will give Instana the kind of real time and high scalable back end required for Observability at scale. See our write up on these two acquisitions here. An excellent interview of Pavlo Baron the CTO of Instana on the subject of observability is here.
- New Relic – New Relic was the first APM vendor to exclusively deliver its solution on a SaaS basis (New Relic hosts the back end for all of its customers) so New Relic has been subjected to torrents of monitoring data for longer than any other APM vendor. At FutureStack 19 New Relic announced its New Relic One Observability Platform with the inclusion of distributed tracing, open collection of data from non-New Relic sources, and the ability to build custom applications on top of the New Relic platform. New Relic has also announced the New Relic DB a highly scalable and high performance metric platform. This puts New Relic in a strong position to build a complete Observability Platform. It just needs to add comprehensive metric and dependency map collection, along with the associated high cardinality analytics. New Relic also needs to add an agent that instruments as automatically as those from Instana and Dynatrace.
- Dynatrace – Several years ago, Dynatrace embarked upon a complete rewrite of its APM solution with the goal of having an enterprise grade solution that was suitable for the diversity in the modern enterprise, and completely modern scaled out and dynamic microservices based architectures. At this point the new Dynatrace product has effectively replaced the legacy AppMon product although there are still many customers using the old AppMon product. The strongest features of the new Dynatrace platform are its coverage of enterprise infrastructure and applications, its topology mapping across the entire hardware and software set of infrastructure and services that support each application, and dynamic agent (like Instana) that automatically detects what is running in the container and injects the appropriate instrumentation. Dynatrace has an excellent discussion of its approach to observability here. Dynatrace just needs to work on comprehensive and real-time metric and dependency collection and analysis.
Summary
Not every application needs observability. But if you need to know the performance and quality of every interaction between your application and its users, and within your application, then observability is an absolute requirement. However, implementing comprehensive observability needs to be done with great care so that you are sure that you get the results that you want at a cost in terms of money and time acceptable to you.