Using UNIX (and Linux) for Research in silicio

There is a rank list of the most powerful computers on the planet: Top500.org. (It looks like they’ve recently done a site redesign for the better.) Statistics on these computers are available here. Of the top 500 most powerful computers on the planet, a measly 1.20% run a non-UNIX operating system (OS). 76.20% run an unspecified Linux (compared with, say a certain version of Red Hat or SuSE linux). According to the overall family rankings here, a total of 85.20% run some variant of Linux. The remaining operating systems on the list are all various UNIXes, including BSD and Mac OS X.

Of course, Microsoft would like a piece of the High-Performance Computing (HPC) market. They have a very shiny site here, touting the “next generation of high-performance computing”. This is not a trend that will take off. Sure, a few sites will adopt Windows for their HPC machines, most likely to parallelize existing Windows applications in the business space, but UNIX will remain the dominant force in HPC for a long time to come. Scientists are the primary users of high-performance computing, and Windows doesn’t have what it takes to do science in silicio.

The core philosophy of UNIX is that a given utility should do one job, and do that job well. As such, all of the various UNIX-like operating systems are composed of thousands of little binaries (compiled applications) and scripts that are fast, efficient, and reliable. Many of them are designed to be strung together. That is what inspired this post.

Two weeks ago, using our new cluster, I ran an entire study’s worth of simulations in two days, producing approximately 50,000 data files with some million data points apiece. It would be impossible for me to analyze all of these files in the next year manually. For only one of the three experiments within the study, there are 100 sub-experiments. If I created a script (I did) to analyze the data from one sub-experiment, but had to run it manually for each one, that would still require me to invoke a script 300 times by hand. This would be a total waste of time. I have seen scientists that use Windows do such a thing (or make an undergrad do it), but it is a waste of time and human brain power. It’s also extremely error prone.

Instead, it’s preferable to automate the process — this is what computers are really good at. That is, doing something extremely dull without getting bored, and without making the mistakes that commonly result from boredom on the part of a human. UNIX is predisposed to automation of such tasks, not only because of all of the aforementioned utilities included in the operating system stack (i.e. peripheral programs), but because of its command-line interface. There is no true graphical equivalent to a command-line interface. Apple has attempted something like it with Automator, but it’s a very very distant second to a command shell like bash, the C shell, and friends. ( I should add that Microsoft is trying to develop an equivalent shell for Windows, “Monad” or “PowerShell”, showing that they’ve finally realized they need to match the capabilities of a 40-year-old operating system architecture. )

One can use a shell in two ways: interactively or scripted. In interactive mode, one specifies a command or simple set of commands and presses enter. For example:

[brock@boh][Linux]-(~)-> for i in `seq 0 10`; do echo $i; done
0
1
2
3
4
5
6
7
8
9
10
[brock@boh][Linux]-(~)->

This bit: [brock@boh][Linux]-(~)-> is called the command prompt. It lets you know that the shell is ready and waiting for your input. Typically command prompts can be configured to show useful information. Mine shows my user name (brock), the machine I’m on (boh, our cluster), the operating system (Linux), and the current directory in parentheses (~, meaning my home directory). This prompt is set by assigning some formatting information to a special variable. In “bash”, this variable is PS1, probably meaning “Prompt String 1″. Mine can be shown as follows:

[brock@boh][Linux]-(~)-> echo $PS1
[\u@\[\e[32m\]\h\[\e[0m\]][Linux]-(\w)->
[brock@boh][Linux]-(~)->

Those characters are a combination of plain characters to print (like [, ], (, ), @, – and >) and special characters (staring with the escape sign “\” as with \u for user name). Anyway, what I did in the first example above was to run through a simple loop and print the numbers 0 through 10. I did this using a combination of some shell commands and some of those little UNIX utilities.

BASH commands: for, do, done, backticks (`)
Utilities: seq, echo

The BASH commands (for, do, done) define the loop structure, while backticks (`) indicate something that is to be executed. In this case, the seq utility is being executed. It produces a sequence of numbers, from start to finish, with a step of 1 by default. The step can be changed by supplying a third number. The command tells BASH to take each number generated by seq and print it to the screen with echo.

Using a pre-written script with the shell allows one to specify several commands in a row, and to execute the same set of actions reliably and without manual intervention. Here’s the script I developed last night, and that I’m planning to run 100 times (automatically) later today. I’ve included some comments (lines starting with #) to explain what the script does:


#!/bin/bash
# This first line tells Linux how to run the script.
# It can also be run like "bash (script name)", but
# that is more cumbersome than simply
# calling the script directly
#
# Here I set some useful variables
START=0
MAX=`maxfile`
# note that MAX is set using the output of
# another script that I wrote called "maxfile".
# It finds the highest-numbered data file
# from our simulator in the current directory.
#
# Next I extract some numbers and run calculations
# Most of the $1, $2, $3 etc in this script refer to the
# 1st, 2nd, 3rd and so on "arguments" that I give the script.
# They are automatically assigned to these numbered
# variables by BASH. However, the $3s below are actually
# commands for the "awk" utility, telling it
# to print the entry in the third column of the file specified
# by the BASH variable $2. I give the script an input file
# for $2 by specifying it when I run the script. "Awk" is
# designed to process text files with a bunch of columns
# of data in them. It's very fast.
S1=`awk '{ if (NR == 2) print $3 }' "$2"`
S2=`awk '{ if (NR == 3) print $3 }' "$2"`
# The previous two lines extract stimulus times
# from a stimulus file for our simulator.
# This tells me what the rate of stimulation
# (or "basic cycle length") is. I need it because I
# want to split up the data according to the
# basic cycle length
#
# The next line takes the stimulus times from above
# and calculates the difference between the two
# times, scales it appropriately with the basic calculator
# program "bc", and strips off the trailing zeros
# with the stream editor, "sed".
INCR=`echo "scale=0; ( $S2 - $S1 ) * 1000" | bc | sed "s/\.000//"`
#
# Here I save the calculated increment to a file for later
echo $INCR > found.period
#
# Calculating period boundaries for later
END=$(( $START + $INCR ))
#
# Count the stimuli for naming the output files
STIM=0
#
# This is a "while loop". What it says is:
# "While the end time of the current period
# is less than the latest time for which we have
# data..."
while [ $END -le $MAX ]
#
# and here's what we do as long as the "while"
# conditions from above are met
do
# Use my "ctrace" program to extract an action
# potential trace for the specified time period
# from the data
${HOME}/local/bin/ctrace -n $1 -s $START -e $END -l $3 > S${STIM}_N$3.trace
# Use another one of my scripts to calculate
# the time between activations
# in the extracted action potential trace
periodcalc S${STIM}_N$3.trace >> $4
# Increment START to begin where this period ended
START=$END
# Increment END to the end of the next period
END=$(( $START + $INCR ))
# Update the stimulus count
STIM=$(( $STIM + 1 ))
#
# The loop is done
done

This looks a lot simpler without all of the comments. I recommend you try to follow the script below, but look at the commented version above for guidance on what each line does:


#!/bin/bash
#
START=0
MAX=`maxfile`
S1=`awk '{ if (NR == 2) print $3 }' "$2"`
S2=`awk '{ if (NR == 3) print $3 }' "$2"`
INCR=`echo "scale=0; ( $S2 - $S1 ) * 1000" | bc | sed "s/\.000//"`
echo $INCR > found.period
END=$(( $START + $INCR ))
STIM=0
#
while [ $END -le $MAX ]
do
${HOME}/local/bin/ctrace -n $1 -s $START -e $END -l $3 > S${STIM}_N$3.trace
periodcalc S${STIM}_N$3.trace >> $4
START=$END
END=$(( $START + $INCR ))
STIM=$(( $STIM + 1 ))
done

The extra #s are just to keep the HTML formatting happy.

This script processes about 2 seconds worth of simulation time, normally output at 2000 data files. It splits the data into somewhere between 10 and 20 periods depending on the basic cycle length. For each stimulus strength, I have 10 directories corresponding to 10 intrinsic action potential durations. In each of those 10 directories, I have 10 subdirectories corresponding to 10 different stimulation rates, as some multiplier of a reentry’s intrinsic frequency. Therefore, I have to run the above script in 100 different ways. I could go into each directory manually and run the script with the appropriate parameters (major pain in the ass), or I could script the running of this script. I’ll opt for the latter.

This is near real-time science. I’m actually going to switch from composing this blog entry and write the script I need, then come back and write about it. (Be right back).

OK, here’s the script:

#!/bin/bash
#
# The items here are broken in two parts:
# The one starting with 0 (095) and the others
# starting with 1. There's probably
# a more clever way to do this, but
# for such a simple script this is sufficient
for i in 095 `seq 105 10 185`
do
cd ${i}msAPD80
for j in `seq 0 9`
do
cd P0${j}/OUTPUT
pwd
echo ${HOME}/local/bin/period_traces.sh nrvm ../nrvm.boundary 75231 ${i}msAPD80_P0${j}.trace
cd ../../
done
cd ../
done

This script contains two ‘for’ loops, one inside of the other. Deep down in the middle, I call the script that I explained above, “period_traces.sh”. However, as astute readers may have noticed, in this script I don’t actually run the command. I put “echo” in front of it, so that it just prints the command instead. I also execute the “pwd” command, to show the current directory. I do this as a check before actually running a functional script — it shows me what directory it’s in, and what command will be executed there. Once I check the output, I remove the “pwd” and “echo” commands, and run the real thing.

Here’s one of the graphs I produced by running the results through a great plotting utility called xmgrace:

Sample Traces (corrected)
This has been fixed since the post was first published. The vertical axis label was incorrect.

Of course, this is neither complete nor polished, but it’s a good place to start when trying to evaluate the results within such a large data set. I will have 10 plots like this (with up to 10 traces apieces) for each of the three high-level experiments.

One thought on “Using UNIX (and Linux) for Research in silicio

  1. Pingback: Advanced Bash Scripting

Comments are closed.