Platform Monitoring — First concepts.
Well, there's been a long time since I've started to share my experience as a software engineer working in an observability team, but guess who's back?
It's important to remember that before we jump into advanced topics, I want to cover the fundamentals and ensure that engineers from any level can consume this content. Today, I'm going to introduce you to monitoring as one of the pillars of Observability.
What is monitoring?
Monitoring provides detailed visibility into performance, availability, user experience, and resource utilization, helping us deliver a consistent platform performance with a low MTTR (Mean Time To Resolution).
Why is monitoring critical?
When a platform experiences performance issues or is unavailable, the company owns that platform risks losing customers. Monitoring tools provide real-time performance and availability insights that allow teams to react quickly when an issue arises.
Why should a platform use a monitoring tool?
A monitoring tool can give a couple of advantages for an organization, such as.
Reduce MTTR
A monitoring tool helps engineers understand what every day looks like by using performance metrics to set their baselines, set proper alerts based on those metrics, and identify the root cause of performance or availability issues.
Increased revenue
A monitoring tool will help the engineering teams quickly identify a critical performance or availability issue that frustrates its platform customers. Frustrated customers start to find other platforms to execute their needs, which directly impacts their revenue.
Reduce costs
A monitoring tool helps an organization understand what every day looks like by checking how much memory and CPU an application needs if the resources of a virtual machine are over-dimensioned.
The engineers' team can fine-tune their resources utilization, showdown some new machines, and save infrastructure money with that information.
Monitoring strategies
There are two types of monitoring strategies available, open-box monitoring, former known as white-box monitoring, and closed-box monitoring, former known as blackbox-monitoring.
Both of those strategies are important and provide different platform visibility about performance and availability.
Open-Box Monitoring
Open-box monitoring gives a rich insight into the service or application. To better understand that, let's use the following simples python service as an example.
That simples service example exposes Prometheus metrics style on the/metrics
endpoint. Instrument a service or application using Prometheus client isn't complex at all, but if you are not familiar with that, I would recommend you take a look at the official documentation
From now on, when we invoke the metrics endpoint, we are getting the Prometheus metrics page that returns a bunch of default metrics based on your tech stack.
Given that it is a simple example that only returns the default Python metrics, we can start adding custom metrics for that application monitoring the service precisely.
Closed-Box Monitoring
Closed-box monitoring, formerly known as blackbox-monitoring, is a strategy to monitor services from the outside using artificial probes. It means that we don't leverage any internal service insights beyond the probe result.
What do means artificial probes?
The artificial probes can be an HTTP request, TCP Socket, or script execution to execute a specific task against the service or application.
Why is closed-box monitoring necessary?
Closed-box monitoring is fundamental to help organizations implement synthetic monitoring and simulate user behavior periodically, and this is a strategy that many companies already use.
For more information about synthetic monitoring and how closed-box monitoring helps that implementation, check the Grafana official documentation.
Conclusion
As we can see, monitoring is crucial for any organization does matter its size. The organization will quickly respond to incidents, scale its business, and increase customer happiness using open-box and closed-box monitoring strategies. More sophisticated monitoring implementations can help the organization predict incidents and proactively approach performance and reliability issues.
As you may imagine, there are a bunch of monitoring tools available on the market, such as:
- Zabbix
- Nagios
- Prometheus
We always use Prometheus as an example in those blog posts since it's a Cloud-Native technology that monitors traditional and cloud Infrastructure.
Monitoring is a large and deep subject that we will be covering step by step in the following posts. I hope you have enjoyed it. Please feel free to share your feedback or doubts.