Nagios! – The worlds most popular monitoring system. Love it or hate it, its here to stay. Way back in the 2005 when Nagios started gaining popularity in the industry, I took notice of this and started exploring the world of Monitoring. Though I didn’t get a chance to Implement it until 2010 when the server count began increasing beyond a dozen.
As a first step: I selected Nagios Core as it was the industry leader with years of development going into its core. With Nagios we were never left having to explain why an unforeseen infrastructure outage hurt the companies bottom line. Everything from Switches, Servers, Firewalls to Application logic functions() were monitored and graphted via Nagios+Pnp4Nagios.
Not many months had gone by when the Virtual Server concept became popular, and server cost started dropping. Not too many months went by and again in 2012 cloud server computing became popular which reduced the costs even further. Nagios Core had to keep up with the ever growing number of services which it was monitoring (It had grown from a few hundred services to a few thousands). Once we started reaching 3000+ we noticed a minor but annoying lag in the Nagios check latency. It was growing from a few seconds to a few dozen seconds on average. We simply ignored the latency, but it kept piling up on the service checks until the check latency was simple unbearable. The Nagios checking latency had now grown from a few seconds to over 5+ minutes. After spending months trying to tweak Nagios, I decided to go back to square one: Installing Nagios Documentation.
Buried deep in the documents, I found the following topic: “Bulk processing of Performance Data” – bulk mode with NPCD. https://docs.pnp4nagios.org/pnp-0.6/config
It took me not more that 10 minutes to configure the new Bulk Mode (NPCD daemon) for Pnp4Nagios; And behold, the Nagios latency dropped back from 5+ minutes to micro-seconds. Today we monitor over 5000+ services (ranging from Firewalls, Protocols, Software, Application functions() to Switches & Routers) not to mention several hundred cloud-servers, while maintaining a 1.2s max-latency between cycles.
As time progressed and open source graphing become popular, my colleagues found a better graphing solution than the basic Pnp4Nagios. Even with the Pnp4Nagios aggregated graphs(distributed templates & special templates) from Pnp4Nagios, could not compete with Grafana’s superior dashboards, dynamic graphing and multiple data sources. I just had to have the best of graphing with the best of monitoring, which brought me to the following documentation. https://support.nagios.com/kb/article/nagios-core-using-grafana-with-pnp4nagios-803.html
Implementing & configuring grafana for nagios couldn’t have been any easier. After integrating the two together, much of our headaches were once and for all solved. Today we enjoy the luxury of nagios monitoring which proactively resolve many issues before human intervention is even required (through Event Handlers, but that’s another days topic) 🙂