Cluster Monitoring

From UF HPC Wiki

Jump to: navigation, search

Here is an attempt at a list of things to do in order to maintain a cluster.

Monitoring

Depending on the size of your cluster, you may want to get some very extensive monitoring tools. At the very least, make sure that you have some sort of monitoring tool that will show you what machines are up and what machines are down at a quick glance.

  • Nagios: This is a tool that can give you a quick overview of the status of your cluster, but also has tools which allow more in-depth monitoring of individual nodes and services. Highly recommended.
  • Ganglia: This is used more as a performance measuring stick. It gives wonderful graphs of what each node is doing including load, CPU utilization, I/O performance, network traffic, etc. Unfortunately there can be some pretty heavy network traffic if your cluster is a large one.

Other Tips

  • Configure smartd so that it emails a centralized account instead of the local root account of each node. This way you can tell if a drive is going bad before it really happens. The SMART technology has come a long way, and can give you some great indications of when a drive is going to go.
Personal tools