Category Archives: Tools of the Trade

Tools of the Trade

Five years in the lab: looking back, then forward

About this time five years ago, I was a nervous junior undergraduate studying Biomedical Engineering at Tulane University. I had just been accepted as an undergraduate member of Dr. Natalia Trayanova’s computational cardiac electrophysiology lab. The goal at that time was to complete a research project for my undergraduate thesis.


So very many things have happened since then. Here are the highlights:

  • 2002: Started learning the ropes of the lab
  • 2003: Continued to familiarize myself with the computers and code in use in the lab. The most powerful machine in our possession was an SGI with 8 processors (the Origin 300 listed here). There was almost always a wait to use those processors. Spent my summer vacation working in the lab. This was the first time I was paid to to research. Some time during this year (I think) I created the lab wiki using MoinMoin. By this time I was administering the lab computers and was sick of answering the same questions over and over. In desperation I created a wiki and started putting answers on it, referring people to the wiki when I was asked a question. The wiki is now (as of November 2007) huge, and contains basically all of the documentation of everything used in the lab, as well as gigabytes upon gigabytes of attached models, data, and images.
  • 2004: Graduated from Tulane with my Bachelor of Science in Engineering (BSE) degree. Joined the lab as a graduate student. Sometime in 2004 (I think), Tulane acquired a Linux Networx cluster, and we owned 20 nodes in that cluster.
  • 2005: Shortly after returning from my trip to Niger, Katrina struck New Orleans. The lab was scattered. Few people in the lab had access to their data. A few lab members actually snuck past armed guards to get our file servers and some workstations from our lab at Tulane. We took up residence in St.Louis, MO for two and a half months, aided by our colleagues in the labs of Drs. Yoram Rudy and Igor Efimov at Washington University. By the end of the year, we had returned to a slowly-recovering New Orleans.
  • 2006: Dr. Trayanova accepted a position as a professor at Johns Hopkins University. Almost the entire lab transfered to JHU and moved to Baltimore, MD.
  • 2007: In April, I began discussing a cluster purchase with High Performance Computing (HPC) companies. Around that time, the weather warmed up, the server room could no longer be adequately cooled, and we started limping by on 4 compute nodes. By the end of July, we had placed an order for a new cluster. We moved from Clark Hall into the newly-completed though poorly-named Computational Science and Engineering Building. In mid-November, most of our new cluster arrived, though FedEx dropped and destroyed one rack, and the cluster was not completely set up.

That brings us to the present day. Now, looking forward a little:

In the next two weeks, the cluster set-up will be completed. We will have free rein on 140 compute nodes (20 old, 120 new), all managed from one head node. The new nodes will be connected by the fastest Infiniband interconnects available on the market, and each node will have 8 GB of RAM available, with the potential to hold 64 GB each. There are four 3.0 GHz Opteron cores per node, yielding a total of 480 processors and 960 GB of RAM on the new nodes alone.

To give you some perspective on what that means, let me give you some details about the kinds of models we run. When I joined the lab, our two largest models consisted of a 4mm thick slice of the canine heart, and a very smooth, idealized model of the rabbit heart. These models are composed of 1.6 million and 0.82 million tetrahedral elements, respectively. It took something like an hour of wall clock time per millisecond of simulation time to run these models. (In other words, to get one millisecond worth of simulation data it was necessary to wait about an hour.) We could run one or two simulations at a time, at that speed.

My newest model, and currently the largest model in use in the lab, is composed of 28 million tetrahedral elements. On a cluster similar to our new one (Lonestar on TeraGrid), using 32 processors, it takes about 22 minutes of wall-clock time to simulate one millisecond in the model. Using a crude estimated unit of speed of (minutes real time / millisecond simulation time / tetrahedral element), and focusing only on the number of simulations we can run at once, not the number of CPUs required:

  • Old way: 60 minutes / 1 ms / 0.82 million tets = 73 minutes / ms sim time / million tets
  • New way: 22 minutes / 1ms / 28 million tets = .78 minutes / ms sim time / million tets

We have increased our simulation speed by almost 100 fold. We can run two to four simulations of that size at a time, vs one or two the old way. But that’s not all. We can now run bigger models. Much bigger models. We are now capable of running something the size of a dog heart (we have verified this). More importantly, we now have the technical capacity to run a model the size of the human heart, with a resolution near that of the size of a cardiac cell, and to model contraction in addition to electrical activity. It remains only to develop such models. We are prepared to store the results: the new cluster has a storage capacity of 28 TB online, with the ability to add something like 40 or 50 TB more simply by expanding the existing storage device.

In my time in the lab, I have watched our abilities expand from serial jobs with relatively small models to massively parallel jobs with the capacity to model electrical and mechanical activity in the human heart. We are just beginning a very exciting time in the lab and in the field, and what’s really killing me is that fact that there’s so much more to tell you.

But I can’t just yet.

(This post was partly inspired by a conversation with Maria and Amanda)

Cluster Update

So, the cluster is here.

The good news: I think we’ll be able to use it (albeit without Infiniband for now) by the end of the day.

The bad news: FedEx clearly dropped one entire rack. Yes. Unbelievable. About $110,000 worth of equipment. When it got here, there were plastic parts coming out from under the door of the crate. From the looks of it, it faceplanted off of a loading dock or a forklift. The tilt sensor was definitely tripped. Some of it may still be good, but it all has to be thoroughly inspected. Penguin is just building us a whole new rack. We lucked out in a way — it was the only rack with no ‘special’ components. All of the other racks have something special. One has the head node (not a big deal). One has the Infiniband switch (very expensive with long lead time to replace), and one has a fileserver with a Xyratex storage unit (also very expensive with long lead time). This rack just had compute nodes and a regular gigabit switch.

I’ll have more news and pictures once we get further along.

New cluster to arrive tomorrow

If you’re subscribed to my RSS feed (hi, mom!) you might already have seen these pictures. Nonetheless, I’ll add them here with a little explanation. First, the front of the cluster:

New Cluster

I suggest that you click on it and go to the Flickr page, as I’ve added some notes to the various parts of the picture, pointing out the major pieces. Here’s the back:

Cluster Cabling

As the title of the post states, it’s supposed to arrive here in Baltimore, from San Francisco, sometime tomorrow. I’m guessing in the afternoon. A couple of guys from Penguin will be here at 13:00. That’s the good news. Now to the bad news — the news that kept me from falling asleep for hours on Friday night.

It won’t be ready to go when it arrives.

No, contrary to all of the lovely visions I was given about the cluster rolling out of the crates, and being ready to go, it will not be ready to go when it gets here. This is in large part due to all of the delays on the part of AMD, and their apparent inability to ship anything when they say they will. Now, I’m not sure just how un-ready it will be. If I can get all of the wires hooked up, and the Infiniband MPI libraries are ready to go, I should be able to have simulations running on it by Tuesday. Penguin’s software people are not excited about this, and want me to wait. We’ll see how things look tomorrow.

November and December are our deadline season, with Heart Rhythm abstracts due on January 4th. Now more than ever, we have a huge backlog of simulations to run on large models, requiring the high-performance power that this cluster packs. However, the complete set-up is not due to start until November 28th. That is no good. To make matters worse, our favorite TeraGrid cluster in Texas, Lonestar, is on the lam (no pun intended) until further notice.

I’m not superstitious, but if I were, I’d be crossing my fingers right now.

Amazon EC2’s New Images

Amazon.com is really the leader right now in so-called “cloud computing”, where there’s some anonymous cluster of servers that you somehow use for various reliable services.

They started with “S3“, meaning “Simple Storage Service”. S3 is a back-end web service that allows you to place, retrieve, and share files via some sort of web back-end interface. As I’m not really a web developer, that interface is beyond me for the moment. However, various programs have cropped up that act as a front-end to S3, my favorite being JungleDisk. Once JungleDisk is fired up, you have what appears to be a network disk with unlimited storage. It is, in fact, effectively unlimited, though it of course comes at a cost. There’s a bandwidth cost whenever you move things to and from, and an ongoing storage cost. The ongoing storage cost is something like a charge based on your “average daily balance” of storage used, and is currently charged at a mere $0.15/GB/month. So, if you want to store 100 GB, that’s $15/mo. Not bad. Remember, too, that you’re never paying for storage that you don’t need, unlike some other services.

After they got S3 stabilized, Amazon.com came out with another beta “cloud” service, the “EC2“, for “Elastic Compute Cloud”. They let you run virtual machine images, basically, charged at an hourly rate. This was kind of interesting, and had some usefulness for a lot of people, but the machines were somewhat small and weak.

No more.

Within the last few days, they launched larger machine images with an x86_64 architecture, up to 4 virtual cores, and nearly 16GB of available RAM. Now we’re talking something interesting for people like me, and our lab. As a test, I created my own custom image, containing our simulation software, and fired it up. For $0.80/hour, I can get the equivalent of 2 of our 2-core Opteron cluster nodes. It’s a little slower than that, but only just. It also probably doesn’t scale as well as our current system, and I know it won’t scale as well as the system that will be arriving on Monday, but something tells me that this will change. I’m not the only one out there that wants to run cluster applications on EC2, as a quick google search will reveal.

Furthermore, this is a relatively cheap and ubiquitous platform that would allow someone to run high-performance applications without the overhead of purchasing a complete cluster. It would be good for, say, starting a business that required a large amount of computing power without having to purchase, store, feed, cool, and house all of that hardware up front. Once things got rolling it would be possible to use revenue to purchase and maintain such dedicated hardware.

The one major down-side of EC2, as I see it, is that it doesn’t save any of the data on the machine once the machine is shut down. One has to ship the data off to S3 (no charge) or another machine (bandwidth charges apply) before shutting it down. Nonetheless, as the tools for interacting with S3 improve, I expect that this limitation will disappear as well.

I should note that I’m not affiliated with either Amazon.com or JungleDisk in any way, except as a happy user.

Finding related articles graphically

When doing a literature search, it’s a good idea to start from a few articles and then (if they are along the lines of what you are looking for) use their references and articles that reference them to expand the search.

One handy way of doing that is with the HubMed Graph Browser. You get to it by finding an article (like mine here) and then selecting the “Graph” link next to “Related” in the line of options at the bottom.

Once you load the TouchGraph, you can see the related articles, change the depth of relationships graphed, zoom in and out, and so on. It can be a nice alternative to the normal related articles list, graphically showing distance and relation.