I’ve been totally absent from most of my life the last week as a result of some problems we had with our code on the cluster. My jobs kept dying, taking down compute nodes in the process, for no apparent reason. After a while I narrowed it down to the time when restart files (from a previous simulation) are read. It turns out that the way the files were read (and that way for a good reason) was really brutal on the network. It involved way too much communication. This was okay for smaller models, but I currently have the largest model we’ve ever run in the lab.
After a conversation with our current programmer and one with our former programmer, and about 6 hours of coding last night, the restart files are now read in a less naughty way, and my jobs are reliably running.
—–
Tomorrow I am leaving for about two and a half weeks in New Orleans and Mandeville! I have an early flight, preceded by an even earlier train ride to the airport. I should be hooked up to the “tubes” and (New Year’s resolution here I come) updating the blog more often with the blow-by-blow as I try to get enough data for a Heart Rhythm conference abstract in time for the deadline, despite all of the sundry delays with the cluster.
*gasps for breath*
—–
Also, Penguin liked my cluster video so much that they put it on their front page.