Starting a journey in the observability team.

Nicolas Takashi
4 min readApr 19, 2021

--

At the beginning of the year, I had the opportunity to join the observability team, and without a doubt, I have accepted the challenge.

Today, I will start something different, the main idea is to write something like a logbook describing my challenges during my job as a Software Engineer in an observability project.

Image by PixaBay

I got inspired by two of my teammates, schulzwill that is pioneering this journey by my side, and el_luis_parada that is doing the same thing describing his study journey about Site Reliability Engineer and you can check this content by clicking here.

Since you already know who inspires me and what’s the goal for the posts about O11y, let’s get starting.

But first, let me give you a spoiler: I won’t show anything related to code or tools today, just because we need to start at the beginning, understanding what observability means and a couple of core concepts that I needed to find out along this journey.

What does observability mean?

Observability is the new kid on the buzzwords block that rounds the IT industry, and you will see many many companies using it to sell their products and show how they are good at developing software.

But the truth is that observability is not new at all. As we can see on this Wikipedia page, observability comes from control theory, with a purpose to measure how well a system is working given external inputs.

Now you must be asking yourself something like that:

  • What does observability mean for an IT Professional
  • How to take advantage of Observability
  • How to make observable platforms?

Well, if it could relax you, I asked the same questions when I heard about O11y for the first time, so let’s answer these questions.

Observability for IT Professionals

All IT professionals need to operate production systems to ensure that those systems are working as expected. When those systems aren’t working well, we must know why and how to troubleshoot them properly.

Let’s use the old but gold e-commerce example and a simple user journey to buy new shoes on its favorite e-commerce platform.

To keep things simples, I will not cover all the existing flows related to shipping, stocks, etc.

Sequence Diagram to buy new shoes

Looking at the example above, we can check that there are a couple of interactions between the user and the e-commerce platform.

Now imagine that user journey on a big e-commerce platform, when they are facing significant events such as:

  • Sales Season,
  • Black Friday
  • Singles Day

During those events, the user interactions can create hundreds or thousands of network calls with many different request flows.

Troubleshoot systems under the described scenario without observability certainly will be hard work. Because of that, IT professionals must have observable platforms to operate those systems based on data, looking for Metrics, Logs, and Traces.

How to create observable platforms?

Well, sorry but it’s not an easy job, but I can guarantee that this is a satisfactory job.

If you are a person who is attentive to details, you saw that I marked three words as bold in the last topic: Metrics, Logs, and Traces, which are considered the three pillars of observability.

Looking at that three pillars, you might be thinking that this is quite simple to have observable platforms but do not rush to conclusions so fast. If you have Metrics, Logs, and Traces, it does not mean that you have observability in your system.

More than tools and patterns, observability is the ability to relate all the information that are collected to answer questions such as:

  • Why is “y” broken?
  • What went wrong during the release of feature “x”?
  • Why has system performance degraded over the past few months?
  • What did my service look like at point “x”?
  • Is this system issue affecting specific users or all of them?

Many tools will help you collect the proper information about Metrics, Logs, and Traces, such as:

  • DataDog — (Needs a commercial license)
  • Splunk — (Needs a commercial license)
  • New Relic — (Needs a commercial license)
  • ElasticSearch — (Free and Open Source)
  • Prometheus — (Free and Open Source)
  • Jaeger — (Free and Open Source)

There is an ocean of possibilities to have observable systems, so you can build your O11y platform or buy a vendor solution.

How to take advantage of Observability

We start to take advantage of observable platforms when we began to answer the questions mentioned above.

Observable systems will provide us data that will help tech and business make good decisions, such as:

  • Predict how to scale your infrastructure based on past events.
  • Discovery what is the request path inside your system?
  • Create alerts based on unexpected behavior

It is just a feel example of things that can be improved or implemented when we have observable systems. As we can see, both tech and business can take advantage of that and provide genuine solutions to their customers.

Conclusion

Observability is the facto a new kid on the block, so all the other kids wish to have fun along observability, especially when they can play with things like Kubernetes and Micro Services Architecture.

As you can imagine, observability is a vast topic, and I can’t resume everything in a single blog post. But we already have the basic knowledge to move forward into the following subjects.

I hope you enjoy it.

Let me know what do you think about that.

Is observability one crucial process to the companies?

--

--

Nicolas Takashi

I love to speak, teach, and write about distributed systems, cloud computing, architecture, systems engineering, and APIs.