This article explains the basics of observability for developers. We'll look at why observability should interest you, its current level of maturity, and what to look out for to make the most of its potential.
Two years ago, James Governor of the developer analyst firm Redmonk remarked, "Observability is making the transition from being a niche concern to becoming a new frontier for user experience, systems, and service management in web companies and enterprises alike." Today, observability is hitting the mainstream.
As 2022 gets underway, you should expect to hear more about observability as a concern to take seriously. However, lots of developers are still unsure about what observability actually is—and some of the descriptions of the subject can be vague and imprecise. Read on to get a foundation in this emerging topic.
What is observability?
The discipline of observability grew out of several separate strands of development, including application performance monitoring (APM) and the need to make orchestrated systems such as Kubernetes more comprehensible. Observability aims to provide highly granular insights into the behavior of systems, along with rich context.
Overall, implementing observability is conceptually fairly simple. To enable observability in your projects, you should:
- Instrument systems and applications to collect relevant data (e.g. metrics, traces, and logs).
- Send this data to a separate external system that can store and analyze it.
- Provide visualizations and insights into systems as a whole (including query capability for end users).
The final step—the query and visualization capabilities—are key to the power of observability.
The theoretical background for the approach comes from system control theory—essentially asking, "How well can the internal state of a system be inferred from outside?"
This requirement has been set down as a constraint in a blog posting by Charity Majors: "Observability requires that you not have to predefine the questions you will need to ask, or optimize those questions in advance." Meeting this constraint requires the collection of sufficient data to accurately model the system's internal state.
Incident resolution is a really good fit for observability, and it's where the practice originated. Site reliability experts (SREs) and DevOps teams typically focus on incident response. They are interested in a holistic understanding of complex behavior that replaces fragmentary or pre-judged views based on just one or two pieces of the overall system.
But if the right data is being collected, the stakeholders for observability are much broader than just SREs, production support, and DevOps folks. Observability gives rise to different goals depending on the stakeholder group using the data.
The promise of a wider use for observability can be seen in some of the discussions about whether observability systems should also collect business-relevant metrics, costs, etc. Such data provides additional context and possible use cases for an observability system, but some practitioners argue that it dilutes the goal of the system.
Why is observability important?
As applications increasingly move to the cloud, they are becoming more complex. There are typically more services and components in modern applications, with more complex topology as well as a much faster pace of change (driven by practices such as CI/CD).
The growing complexity of applications parallels the increasing popularity of technologies with genuinely new behaviors that were created for the cloud. The highlights here include containerized environments, dynamically scaling services (especially Kubernetes), and Function-as-a-Service deployments such as AWS Lambda.
This new world makes root cause analysis and incident resolution potentially a lot harder, yet the same questions still need answering:
- What is the overall health of my solution?
- What is the root cause of errors and defects?
- What are the performance bottlenecks?
- Which of these problems could impact the user experience?
Observability is therefore at the heart of architecting robust, reliable systems for the cloud. The search for an architectural solution to these questions is perhaps best expressed in a posting by Cindy Sridharan: "Since it's still not possible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture."
This new cloud-native world is especially important to Red Hat users because so much of it is based on open source and open standards. The core runtimes, instrumentation components, and telemetry are all open source, and observability components are managed through industry bodies such as the Cloud Native Computing Foundation (CNCF).
What data do we need to collect?
Observability data is often conceptualized in terms of three pillars:
- Distributed traces
- Metrics and monitoring
- Logs
Although some have questioned the value of this categorization, it is a relatively simple mental model, and so is quite useful for developers who are new to observability. Let's discuss each pillar in turn.
Distributed traces
A distributed trace is a record of a single service invocation, usually corresponding to a single request from an individual user. The trace includes the following metadata about each request:
- Which instance was called
- Which containers they were running on
- Which method was invoked
- How the request performed
- What the results were
In distributed architectures, a single service invocation typically triggers downstream calls to other services. These calls, which contribute to the overall trace, are known as spans, so a trace forms a tree structure of spans. Each span has associated metadata.
The span perspective corresponds to the extrinsic view of service calls in traditional monitoring. Distributed traces are used to instrument service calls that are request-response oriented. There are additional difficulties associated with calls that do not fit this pattern that tracing struggles to account for.
Metrics and monitoring
Metrics are numbers measuring specific activity over a time interval. A metric is typically encoded as a tuple consisting of a timestamp, name, value, and dimensions. The dimensions are a set of key-value pairs that describe additional metadata about the metric. Furthermore, it should be possible for a data storage engine to aggregate values across the dimensions of a particular metric to create a meaningful result.
There are many examples of metrics across many different aspects of a software system:
- System metrics (including CPU, memory, and disk usage)
- Infrastructure metrics (e.g., from AWS CloudWatch)
- Application metrics (such as APM or error tracking)
- User and web tracking scripts (e.g., from Google Analytics)
- Business metrics (e.g., customer sign-ups)
Metrics can be gathered from all of the different levels on which the application operates, from very low-level operating system counters all the way up to human and business-scale metrics. Unlike logs or traces, the data volume of metrics does not scale linearly with request traffic—the exact relationship varies based on the type of metric being collected.
Logs
Logs constitute the third pillar of observability. These are defined as immutable records of discrete events that happen over time. Depending on the implementation, there are basically three types of logs: plain text, structured, and binary format. Not all observability products support all three.
Examples of logs include:
- System and server logs (syslog)
- Firewall and network system logs
- Application logs (e.g., Log4j)
- Platform and server logs (e.g., Apache, NGINX, databases)
Observability tools: Open source offerings on the rise
The APM/monitoring market segment used to be dominated by proprietary vendors. In response, various free and open source software projects started or were spun out of tech companies. Early examples include Prometheus for metrics, and Zipkin and Jaeger for tracing. In the logging space, the "ELK stack" (Elasticsearch, Logstash, and Kibana) gained market share and became popular.
As software continues to become more complex, more and more resources are required to provide a credible set of instrumentation components. For proprietary observability products, this trend creates duplication and inefficiency. The market has hit an inflection point, and it is becoming more efficient for competing companies to collaborate on an open source core and compete on features further up the stack (as well as on pricing).
This historical pattern is not an unusual dynamic for open source, and has shown up as a switch from proprietary to open source driven offerings as this market segment migrates from APM to observability. The move to open source can also be partly attributed to the influence of observability startups that have been fully or partially open source since their inception.
One key milestone was the merger of the OpenTracing and OpenCensus projects to form OpenTelemetry, a major project within CNCF. The project is still maturing, but is gaining momentum. An increasing number of users are investigating and implementing OpenTelemetry, and this number seems set to grow significantly during 2022. A recent survey from the CNCF showed that 49 percent of users were already using OpenTelemetry, and that number is rising rapidly.
The state of OpenTelemetry
Some developers are still confused by exactly what OpenTelemetry actually is. The project offers a set of standards, formats, client libraries, and associated software components. The standards are explicitly cross-platform and not tied to any particular technology stack.
OpenTelemetry provides a framework that integrates with open source and commercial products and can collect observability data from apps written in many languages. Because the implementations are open source, they are at varying levels of technical maturity, depending on the interest that OpenTelemetry has attracted in specific language communities.
From the Red Hat perspective, the Java/JVM implementation is particularly relevant, being one of the most mature implementations. Components in other major languages and frameworks, such as NET, Node.js, and Go, are also fairly mature.
The implementations work with applications running on bare metal or virtual machines, but OpenTelemetry overall is definitely a cloud-first technology.
It's also important to recognize what OpenTelemetry is not. It isn't a data ingest, storage, backend, or visualization component. Such components must be provided either by other open source projects or by vendors to produce a full observability solution.
Observability vs. APM
You may wonder if observability is just a new and flashier name for application performance monitoring. In fact, observability has a number of advantages for the end user over traditional APM:
- Vastly reduced vendor lock-in
- Open specification wire protocols
- Open source client components
- Standardized architecture patterns
- Increasing quantity and quality of open source backend components
In the next two years, you should expect to see a further strengthening of key open source observability projects, as well as market consolidation onto a few segment leaders.
Java observability
Java is hugely important to Red Hat's users, and there are particular challenges to the adoption of observability in Java stacks. Fundamentally, Java technology was designed for a world where JVMs ran on bare metal in data centers, so the applications don't easily map to tools designed for containerization.
The world is changing, however. Cloud-native deployments—especially containers—are here and being adopted quickly, albeit at varying rates across different parts of the industry.
The traditional Java application lifecycle consists of a number of phases: bootstrap, intense class loading, warmup (with JIT compilation), and finally a long-lived steady state (lasting for days or weeks) with relatively little class loading or JIT. This model is challenged by cloud deployments, where containers might live for much shorter time periods and cluster sizes might be dynamically readjusted.
As it moves to containers, Java has to ensure that it remains competitive along several key axes, including footprint, density, and startup time. Fortunately, ongoing research and development within OpenJDK is trying to make sure that the platform continues to optimize for these characteristics—and Red Hat is a key contributor to this work.
If you're a Java developer looking to adapt to this new world, the first thing to do is prepare your application or company plan for observability. OpenTelemetry is likely to be the library of choice for many developers for both tracing and metrics. Existing libraries, especially Micrometer, are also likely to have a prominent place in the landscape. In fact, interoperability with existing components within the Java ecosystem is a key goal for OpenTelemetry.
Status and roadmap
OpenTelemetry has several subprojects that are at different levels of maturity.
- The Distributed Tracing specification is at v1.0 and is being widely deployed into production systems. It replaces OpenTracing completely, and the OpenTracing project has officially been archived. The Jaeger project, one of the most popular distributing tracing backends, has also discontinued its client libraries and will default to OpenTelemetry protocols going forward.
- The OpenTelemetry Metrics project is not quite as advanced, but it is approaching v1.0 and General Availability (GA). At time of writing, the protocol is at the Stable stage and the API is at Feature Freeze. It is anticipated that the project might reach v1.0/GA during April 2022.
- Finally, the Logging specification is still in Draft stage and is not expected to reach v1.0 until late 2022. There is still a certain acknowledged amount of work to do on the spec, and participation is actively being sought by the working groups.
Overall, OpenTelemetry as a whole will be considered to be v1.0/GA when the Metrics standard reaches v1.0 alongside Tracing.
The major takeaways are that observability is reaching more and more developers and is noticeably gathering steam. Some analysts even anticipate that OpenTelemetry formats will become the largest single contributor to observability traffic as early as 2023.
Red Hat and the OpenTelemetry Collector
As the industry starts to embrace OpenTelemetry, it's important to help users decide what to do with all the telemetry data that is being generated. One key piece is the OpenTelemetry Collector. This component runs as a network service that can receive, proxy, and transform data. It enables users to keep data and process it internally, or forward it to a third party.
The Collector helps solve one of the major hurdles to adoption faced by a new standard such as OpenTelemetry: Dealing with legacy applications and with infrastructure that already exists. The Collector can understand and speak to many different legacy protocols and payloads, and can translate them into OpenTelemetry-supported protocols. In turn, these open protocols can be consumed by a vast number of vendors who embrace the specification. This unification of older protocols is a major shift in the observability space, and is going to offer a level of flexibility that we haven't seen before.
Red Hat is deeply committed to the OpenTelemetry community and has released the OpenTelemetry Collector as a component of the Red Hat OpenShift Container Platform, branded as OpenShift distributed tracing data collection. This helps our users take advantage of all the capabilities the OpenTelemetry Collector has to offer in order to provide a better, more open approach to observability.
The core architectural capabilities of the Collector can be summarized as follows:
- Usability: A reasonable default configuration that works out of the box
- Performance: Performant under varying loads and configurations
- Observability: A good example of an observable service
- Extensibility: Customizable without touching the core code
- Unification: A single codebase that supports traces, metrics, and logs
Conclusion
Observability is an emerging set of ideas and DevOps practices that help handle the complexity of modern architectures and applications, rather than any specific set of products.
Observability absorbs and extends classic monitoring systems, and helps teams identify the root cause of issues. More broadly, it allows stakeholders to answer questions about their application and business, including forecasting and predictions about what could go wrong.
A diverse collection of tools and technologies are in use, which leads to a large matrix of possible deployments. This has architectural consequences, so teams need to understand how to set up their observability systems in a way that works for them.
One key technology is OpenTelemetry. It is rapidly gaining popularity, but is still maturing, and the open source groups and standards need more participation, especially by end users.
Last updated: September 27, 2024