Rules to Restore Sanity to Your IT Monitoring Tools

By March 19, 2015Uncategorized

“If I had eight hours to chop down a tree, I’d spend six hours sharpening my ax.” – Abraham Lincoln.

IT monitoring toolsThe scenario is always the same: the IT department picked up a new monitoring system and it has all the bells and whistles. Web interface?  Check. Monitors Windows top to bottom? Check. Monitors network infrastructure? Check. Monitors VMware health? Check.

Absolutely no one pays attention to it and all the alerts pile up in some obscure inbox subfolder? Double check.

This scenario plays out the same whether it is with Solarwinds, PRTG, ServerCheck, or something else. Companies invest in top of the line monitoring systems that end up being misconfigured or not configured at all, resulting in doing nothing but taking up resources. I always picture these monitoring system sitting in a corner of server room, ignored, like the human equivalent of the unstable homeless guy in a sandwich sign that reads, “The end is nigh.” And wouldn’t I? When configured incorrectly, your monitoring system is essentially the boy who cried wolf.

Help is here. I’m not here to provide click-by-click setup instructions or screenshots but I can help you plan and prepare to implement your monitoring system. I’m here to help you sharpen your ax.

If I could sum this blog post up in one sentence it would be:  Your monitoring system should get meaningful information to the right people at the right time.

Shall we begin?

Rule 1: Configure Judiciously

Monitor for information and events that are useful to you and your business. Most often, I see monitoring systems that are configured to just monitor anything and everything. While it’s tempting to really, really dig into all of your systems, you run the risk of information overload. Too much information is as bad no information at all.

Rule 2: Use the Right Tools

Avoid one-size-fits-all IT monitoring tools. There are different monitoring criteria for file servers as there are print servers. Ninety-five percent memory usage on a SQL server is SQL being SQL. Ninety-five percent memory usage on a Citrix server indicates overutilization of that Citrix server or it is underpowered. Some things overlap, but not enough to justify the same criteria for all servers, switches, etc. Instead, I recommend templates for different roles that are then made unique to the system.

Rule 3: Don’t Cry “Wolf”

Be stingy with the “system down” alerts. “System down” is the IT equivalent of the “red alert” in Star Trek. Captain Kirk didn’t get called on to bridge with shields up and phasers locked every time an asteroid got within a few kilometers, right? Your monitoring system should not raise the red alert unless there is truly an outage. Speaking of reacting to red alerts and downed systems…

Rule 4: Head off Problems

Use your monitoring system to be proactive and reactive. Back to Star Trek, imagine if Kirk was called to bridge and a red alert was raised after the Romulan Warbird was in range, shields raised, and firing on the Enterprise. Would Sulu have a job for very long provided the Enterprise survived? Now image it’s Saturday evening and you get an alarm that your Exchange server has less than 1 GB of space on C:. You get this alert via text because backpressure has kicked in and your Exchange server isn’t sending mail. You call your boss, expand the drive, and all’s good. That’s being reactive. Use your monitoring system to warn you of potential problems before they are problems. Why react to problems when you can prevent them?

Rule 5: Limit Proactive Alerts to Working Hours

Set schedules for alerting. Where you can, limit informational or warning messages for business hours. Email, text messages, etc. are items that take you and your team members’ time and attention to process. If you or your team members get proactive alerts on nights and weekends, these are things you are not going to address until later anyway. Filling up your inbox with non-emergency alerts during off hours only increases the likelihood real alerts will be ignored. Reserve night and weekend alerts for those critical system down “red alerts.”

Rule 6: Understand how Monitoring Uses Resources

Monitoring takes resources. All too often I see monitoring systems set up to use dozens of WMI or SNMP queries per minute (or every 30 seconds!). Each of these queries uses network and system resources. As they add up, they can negatively impact system and network performance. As with a doctor, your first directive is to first do no harm. If the difference between your file server being “up” and being out of space is 30 seconds, you have much bigger fish to fry.

Rule 7: Keep Your Monitoring Processes Up to Date

Revisit and review your monitoring system on a regular basis. As your network expands, updates and changes, so will your monitoring needs. Start simple with your monitoring and work your way into more detailed and complex monitoring policies.

If you have some questions about how you can put these rules into your monitoring practices, send us an email or give us a call: 502-240-0404.