Occasionally, on our cluster, a node will crash. If a job was running on it that spanned multiple nodes, sometimes the other nodes won’t get the message that their fellow has crashed, and they will just keep running whatever processes are on them.
I call these “Zombie” processes, because they just lumber along eating CPU time and rotting, keeping other jobs from using the node. Today, after noticing a particularly bad zombie infestation, I finally created a script that checks for zombified machines and then restarts them. This script is compatible with Torque and relies on Scyld Beowulf “b-commands” and IPMI, but you could easily replace them with similar utilities like rsh or ssh.
#!/bin/bash
# For all of the nodes in the main cluster...
for NODE in `seq 0 119`; do
# Calculate the load and convert it to an integer
LOAD=`bpsh $NODE uptime | awk '{ print $11 }' | sed "s/\,//"`
LOAD=`printf %1.0f $LOAD`
# Figure out whether the node should be running anything
ASSIGNED=`qstat -f | grep $NODE | wc -l`
if [ $ASSIGNED -gt 0 ]; then
ASSIGNED=1
fi
# If the node is running something but shouldn't be, reboot it.
if [ $LOAD -gt 1 ] && [ $ASSIGNED -eq 0 ]; then
echo Node $NODE is a zombie! Kicking. >> /root/logs/zombies.log
# This relies on IPMI
ipmitool -H 10.54.2.$(( 100 + $NODE )) -U (some user) -P (some password) power reset
fi
done
You can download the file directly: Zombie Checker
(Here’s another post on Zombies).