Platform Monitoring — First concepts.

3 min readNov 23, 2021

Well, there's been a long time since I've started to share my experience as a software engineer working in an observability team, but guess who's back?

It's important to remember that before we jump into advanced topics, I want to cover the fundamentals and ensure that engineers from any level can consume this content. Today, I'm going to introduce you to monitoring as one of the pillars of Observability.

What is monitoring?

Monitoring provides detailed visibility into performance, availability, user experience, and resource utilization, helping us deliver a consistent platform performance with a low MTTR (Mean Time To Resolution).

Why is monitoring critical?

When a platform experiences performance issues or is unavailable, the company owns that platform risks losing customers. Monitoring tools provide real-time performance and availability insights that allow teams to react quickly when an issue arises.

Why should a platform use a monitoring tool?

A monitoring tool can give a couple of advantages for an organization, such as.

Reduce MTTR

A monitoring tool helps engineers understand what every day looks like by using performance metrics to set their baselines, set proper alerts based on those metrics, and identify the root cause of performance or availability issues.

Increased revenue

A monitoring tool will help the engineering teams quickly identify a critical performance or availability issue that frustrates its platform customers. Frustrated customers start to find other platforms to execute their needs, which directly impacts their revenue.

Reduce costs

A monitoring tool helps an organization understand what every day looks like by checking how much memory and CPU an application needs if the resources of a virtual machine are over-dimensioned.

The engineers' team can fine-tune their resources utilization, showdown some new machines, and save infrastructure money with that information.

Monitoring strategies

There are two types of monitoring strategies available, open-box monitoring, former known as white-box monitoring, and closed-box monitoring, former known as blackbox-monitoring.

Both of those strategies are important and provide different platform visibility about performance and availability.

Open-Box Monitoring

Open-box monitoring gives a rich insight into the service or application. To better understand that, let's use the following simples python service as an example.

That simples service example exposes Prometheus metrics style on the/metrics endpoint. Instrument a service or application using Prometheus client isn't complex at all, but if you are not familiar with that, I would recommend you take a look at the official documentation

From now on, when we invoke the metrics endpoint, we are getting the Prometheus metrics page that returns a bunch of default metrics based on your tech stack.

Given that it is a simple example that only returns the default Python metrics, we can start adding custom metrics for that application monitoring the service precisely.

Closed-Box Monitoring

Closed-box monitoring, formerly known as blackbox-monitoring, is a strategy to monitor services from the outside using artificial probes. It means that we don't leverage any internal service insights beyond the probe result.

What do means artificial probes?

The artificial probes can be an HTTP request, TCP Socket, or script execution to execute a specific task against the service or application.

Why is closed-box monitoring necessary?

Closed-box monitoring is fundamental to help organizations implement synthetic monitoring and simulate user behavior periodically, and this is a strategy that many companies already use.

For more information about synthetic monitoring and how closed-box monitoring helps that implementation, check the Grafana official documentation.

Conclusion

As we can see, monitoring is crucial for any organization does matter its size. The organization will quickly respond to incidents, scale its business, and increase customer happiness using open-box and closed-box monitoring strategies. More sophisticated monitoring implementations can help the organization predict incidents and proactively approach performance and reliability issues.

As you may imagine, there are a bunch of monitoring tools available on the market, such as:

Zabbix
Nagios
Prometheus

We always use Prometheus as an example in those blog posts since it's a Cloud-Native technology that monitors traditional and cloud Infrastructure.

Monitoring is a large and deep subject that we will be covering step by step in the following posts. I hope you have enjoyed it. Please feel free to share your feedback or doubts.

Platform Monitoring — First concepts.

What is monitoring?

Why is monitoring critical?

Why should a platform use a monitoring tool?

Reduce MTTR

Increased revenue

Reduce costs

Monitoring strategies

Open-Box Monitoring

Closed-Box Monitoring

What do means artificial probes?

Why is closed-box monitoring necessary?

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Nicolas Takashi

No responses yet

More from Nicolas Takashi

What you’re missing without profiling: An Introduction to Pyroscope

Profiling is the missing piece in your observability puzzle.

Fail Fast — Validando Commands com MediatR e Fluent Validation

Nos meus dois últimos artigos, falei sobre o padrão Command Query Segregation, desde a parte teórica até uma simples implementação em um…

Observability strategies to not overload engineering teams — OpenTelemetry Strategy

OpenTelemetry provides capabilities to democratize observability data and empowers engineering teams.

Starting a journey in the observability team.

At the beginning of the year, I had the opportunity to join the observability team, and without a doubt, I accept the challenge.

Recommended from Medium

System Design CheatSheet for Interview

Dear Readers, I am summarizing the commonly asked concepts in system design interviews. These questions are asked in almost all the system…

Monitoring FastAPI Using Grafana and Prometheus

Monitoring APIs is crucial to ensure their health, performance, and reliability. In this guide, we’ll walk through setting up monitoring…

Lists

Staff picks

Stories to Help You Level-Up at Work

Self-Improvement 101

Productivity 101

Technical Guide: End-to-End CI/CD DevOps with Jenkins, Docker, Kubernetes, ArgoCD, Github Actions …

Building an end-to-end CI/CD pipeline for Django applications using Jenkins, Docker, Kubernetes, ArgoCD, AWS EKS, AWS EC2

System Design Blueprint: The Ultimate Guide

Developing a robust, scalable, and efficient system can be daunting. However, understanding the key concepts and components can make the…

Datadog: Comprehensive Monitoring and Observability for Modern Systems

Introduction

A Comprehensive Guide to MQTT — Part 1: Introduction to MQTT

Welcome to A Comprehensive Guide to MQTT, your ultimate resource for mastering the Message Queuing Telemetry Transport (MQTT) protocol.