Category Archives: Linux

Linux

Killing Zombies

Occasionally, on our cluster, a node will crash. If a job was running on it that spanned multiple nodes, sometimes the other nodes won’t get the message that their fellow has crashed, and they will just keep running whatever processes are on them.

I call these “Zombie” processes, because they just lumber along eating CPU time and rotting, keeping other jobs from using the node. Today, after noticing a particularly bad zombie infestation, I finally created a script that checks for zombified machines and then restarts them. This script is compatible with Torque and relies on Scyld Beowulf “b-commands” and IPMI, but you could easily replace them with similar utilities like rsh or ssh.


#!/bin/bash
# For all of the nodes in the main cluster...
for NODE in `seq 0 119`; do
     # Calculate the load and convert it to an integer
     LOAD=`bpsh $NODE uptime | awk '{ print $11 }' | sed "s/\,//"`
     LOAD=`printf %1.0f $LOAD`
     # Figure out whether the node should be running anything
     ASSIGNED=`qstat -f | grep $NODE | wc -l`
     if [ $ASSIGNED -gt 0 ]; then
          ASSIGNED=1
     fi
     # If the node is running something but shouldn't be, reboot it.
     if [ $LOAD -gt 1 ] && [ $ASSIGNED -eq 0 ]; then
          echo Node $NODE is a zombie! Kicking. >> /root/logs/zombies.log
          # This relies on IPMI
          ipmitool -H 10.54.2.$(( 100 + $NODE )) -U (some user) -P (some password) power reset
     fi
done

You can download the file directly: Zombie Checker

(Here’s another post on Zombies).

How to Cook a Server (Sunny Side Up?)

Last night, just as I was falling asleep, I heard my phone ring out a text message tone. (Incidentally, it’s this sound, a customized version of one found on the Monty Python website. It can be startling.)

I leaned over to look at it. The head node from the cluster was checking in to let me know it was feeling a little warm (80 degrees F). The temperature in our server room keeps creeping up. Today it was blamed on a campus-wide chilled-water problem. The cluster is only using about 1/4 of the cooling capacity in the server room, and is almost the only machine in there. We should not be having these problems.

I use the IPMI capabilities of the head node to check the internal sensor every ten minutes, and it emails and texts me if the temperature exceeds some threshold. It is hard, though, to get a good idea of what’s going on from a single temperature at a single time. Has the temp been slowly rising? Has it jumped from 77 to 95 in ten minutes?

To make this easier to check, I wrote a little script that generates a plot based on the last 24 hours of my temperature log. You can see the results below. The image below is pulled straight from the cluster, so it should be up to date regardless of when you’re viewing this post. If you’re looking at it pretty shortly after this is posted, you can still see the spike that woke me up and the sharper one that got me worried this morning.

temperature plot over the last 24 hours

The temperature is of course not jumping like that — the precision of the sensor is limited. The rightmost edge is “now”, if that’s not clear. I used xmgrace to produce this plot from the command line / script. If you’d like any of the code for this let me know.

UNIX Toolbox

UNIX-like operating systems are immensely powerful. They give one access to the minutest details of the operating system with command-line utilities. The major downside of command-line interfaces is that it is not readily apparent which commands are available and what they do. One can spend hours poring through man pages looking for related program names, instructions and examples to accomplish a simple task.

One way out of this is to use command references. When I first started using Linux, I bought a boxed edition of Red Hat (version 7, I think), which came with a wrist-rest sticker. This long sticker was designed to be stuck on a plastic keyboard wrist rest and contained some common BASH commands such as ls, cd, and mv, with brief examples. This was very helpful to me as a new Linux/UNIX user.

After 8 years of using and administering Linux and Mac OS X (BSD) machines, I have a pretty good handle on the command line. Nonetheless, I was very happy to find this UNIX toolbox document via nixCraft.

Some highlights that I discovered and will be employing from now on include:

  • fuser -m /home – Find out what programs have files open on the /home (or another) partition.
  • sysctl hw – get extensive hardware information on BSD (including OS X) systems.
  • dmidecode – Get BIOS information. I actually learned about this in the past two weeks. Sometimes it’s necessary to create /dev/mem (sudo mknod /dev/mem c 1 1) before this will work. One handy use for this is to get machine serial numbers without having to visit the datacenter.

These are just the most interesting examples from the first ten pages. Aside from simple commands, the document also includes instructions for complicated and infrequently-used but occasionally-necessary tasks that I never remember how to do off-hand. One could spend half an hour or so reading man
pages and HOW-TOs online to find the right incantation, or just find the precise instructions in this toolbox. Such groan-inducing operations include:

  • Mounting SAMBA partitions.
  • Mounting loopback devices such as CD images.
  • Burning CD images from the command line.
  • Converting between DOS and UNIX text file formats.
  • Basic database administration.

and several others. I plan on printing the booklet version of this PDF at lab tomorrow and keeping it at my desk and in the server room.

For possibly-NSFW (but text-only) entertainment, check out a BASH of another variety.

Comments from ‘file’ (and ‘tar’)

Occasionally when using Linux (or Mac OS X) I’ll notice a tongue-in-cheek output message from a utility. Today, it was ‘file‘, a program that uses magic numbers and other tricks to tell you about the contents of a file:


[brock@stilgar][Darwin]-(~/Workspace/RvPacing/bridge/flma2memfem)-> file bridge_w_surf.flma
bridge_w_surf.flma: ASCII text, with very long lines
[brock@stilgar][Darwin]-(~/Workspace/RvPacing/bridge/flma2memfem)->

Emphasis mine. Thanks for the commentary, file.

ADDENDUM: Just now, tar gave me this little message because I forgot to provide some source data:

[brock@stilgar][Darwin]-(~/models)-> tar cfjv bridge_iso.tar.bz2
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.

That one, I’ve seen before.