Last night, just as I was falling asleep, I heard my phone ring out a text message tone. (Incidentally, it’s this sound, a customized version of one found on the Monty Python website. It can be startling.)
I leaned over to look at it. The head node from the cluster was checking in to let me know it was feeling a little warm (80 degrees F). The temperature in our server room keeps creeping up. Today it was blamed on a campus-wide chilled-water problem. The cluster is only using about 1/4 of the cooling capacity in the server room, and is almost the only machine in there. We should not be having these problems.
I use the IPMI capabilities of the head node to check the internal sensor every ten minutes, and it emails and texts me if the temperature exceeds some threshold. It is hard, though, to get a good idea of what’s going on from a single temperature at a single time. Has the temp been slowly rising? Has it jumped from 77 to 95 in ten minutes?
To make this easier to check, I wrote a little script that generates a plot based on the last 24 hours of my temperature log. You can see the results below. The image below is pulled straight from the cluster, so it should be up to date regardless of when you’re viewing this post. If you’re looking at it pretty shortly after this is posted, you can still see the spike that woke me up and the sharper one that got me worried this morning.
The temperature is of course not jumping like that — the precision of the sensor is limited. The rightmost edge is “now”, if that’s not clear. I used xmgrace
to produce this plot from the command line / script. If you’d like any of the code for this let me know.