Cluster Monitoring

From UF HPC Wiki

Jump to: navigation, search

Here is an attempt at a list of things to do in order to maintain a cluster.

Monitoring

Depending on the size of your cluster, you may want to get some very extensive monitoring tools. At the very least, make sure that you have some sort of monitoring tool that will show you what machines are up and what machines are down at a quick glance.

  • Nagios: This is a tool that can give you a quick overview of the status of your cluster, but also has tools which allow more in-depth monitoring of individual nodes and services. Highly recommended.
  • Ganglia: This is used more as a performance measuring stick. It gives wonderful graphs of what each node is doing including load, CPU utilization, I/O performance, network traffic, etc. Unfortunately there can be some pretty heavy network traffic if your cluster is a large one.
  • Torque/PBS/Moab/Maui monitoring: Not necessarily a live monitoring capability, but this is good for look at past statistics of how your cluster is being used.
    • qstat/showq will show you at any given time how the queue is being utilized and how the nodes are being used.
    • pbsnodes -l will show you what nodes are currently offline/down.
    • The accounting logs from torque can also be used. We have written scripts that take the data from these accounting logs and inject them into a MySQL database, from which we can then mine the information for websites. At one time there was a method for Moab/Maui to have the same kind of data injected into a database, but it appears that functionality has been removed from the software in more recent releases.

Other Tips

  • Configure smartd so that it emails a centralized account instead of the local root account of each node. This way you can tell if a drive is going bad before it really happens. The SMART technology has come a long way, and can give you some great indications of when a drive is going to go.
Personal tools