Category Archives: Linux

Linux

K9

So, you haven’t heard much from me in a while. The little one has been using up great amounts of my time and attention. (And she’s totally worth it).

However, I’m finally getting back into the swing of things with work, around the house, etc. I’ve started roasting my own coffee, which is great, and Amanda got me an espresso machine that should be arriving within hours in which to use my freshly-roasted beans.

The thing that has been using up the greatest amount of my “free” time (whatever that means these days) has been hacking on an improved email client for the new “Google Phone”, the G1. The included email client was utter crap, to put it kindly, not even as sophisticated as the client on my little Razr2 v8 flip phone. However, someone forked the open-source email client that came with the device into a project called K-9. The name is derived from that of an old UNIX-y email client called mutt, the idea being that K-9 (canine) is a sort of androidy name for a dog, I guess.

Anyway, the client has been rapidly improving. I haven’t done much — just a few bug fixes here and there, and possibly the addition of some bugs (hope not) — but several people are working on it. It’s already getting rave reviews in the Android market, particularly since people have only the crappy built-in client an an alternative. Heh. I don’t know Java, really, but that hasn’t been stopping me. It’s not that different from Python and C++. Anyway, if you have a G1, check it out. We’re improving performance, fixing bugs, and adding features all of the time.

Calculating Large Numbers in C

As a corollary to my last post, it’s important to be careful when calculating file seek positions (if you’re skipping around that way). It turns out it’s necessary to cast all of the numbers being used when calculating a seek position to a large integer, such as unsigned long int.

By the way, Rob had some helpful comments on that last post. (Thanks Rob!)

C++ ifstream and the 2 GB Limit

Any system that encodes values in some set number of places has a limit on the values that can be held. For example, old, mechanical cash registers were physically limited in the number of digits they could ring up. Likewise, modern LCD cash registers are limited by the number of digits available on the screen, though they may be able to hold longer numbers than they can show. The “Y2k” problem was also a result of such limitations.

Binary encoding of course has similar limits. Two bits can hold four values. Three bits can hold eight values. The relationship there is exponential — x bits are limited to 2x values.

One thing that programs might want to keep track of internally is the current location in a file that one is reading or writing. Like a bookmark, there are one or more variables that can be used to store locations in a file. Typically these are simply integers, indicating some sort of offset from the beginning or end of the file. These bookmarks impose a limitation. No file positions past 2x offsets may be recorded, or often even reached.

It turns out that in C++, in Linux, x = 31. This provides 231 = 2147483648 positions. Given that a position is normally a byte (B), we are then limited to 2147483648 B, or 2097152 kB, or 2048 MB, or 2 GB. With our newest, largest models, this is a problem. My latest simulations are producing data files that are around 21 GB when uncompressed.

It turns out that in 64-bit Linux, libc does not have this problem. It is therefore possible to use normal C file I/O commands to process large files (on most modern systems, which incorporate Large File Support or LFS). After searching high and low on the interwebs for a way to convince C++ to use larger variables for file I/O, and not finding much, I caved and spent all of five minutes changing my code over to use C I/O.

What made this bug a little more difficult to track down than it might have been is that for some reason, Mac OS X Leopard does not suffer from this problem with C++ stream I/O. Someone at Apple must have allocated a few more bits for file position pointers. My code was therefore working fine on my sorta-64-bit Mac but not on our fully-64-bit cluster or our workstations running recent versions of Linux.

Ultimately it would be nice to write some kind of wrapper or overload the C++ I/O functions to do things correctly, but for the moment my code is working properly and I am happy.

Do you have any better ideas for getting around this 2GB limit?

Advanced Bash Scripting

I have written before about the usefulness of command-line scripting in computational science.

Today, while looking for some information on various file test operators in bash (e.g. to check whether a file or directory exists), I found this amazing guide. As the author puts it,

This tutorial assumes no previous knowledge of scripting or programming, but progresses rapidly toward an intermediate/advanced level of instruction . . . all the while sneaking in little snippets of UNIX® wisdom and lore. It serves as a textbook, a manual for self-study, and a reference and source of knowledge on shell scripting techniques.

For instructional purposes, all along the examples have little comments like, “explain why this is the case…”, to test your knowledge as you go through the manual. This would make it excellent for use as textbook on basic programming ideas. It is even available in PDF format, and was updated March 18th of 2008.

I can assure you that every new member of the lab will be getting a link to this guide from me. Proper knowledge of shell scripting is an amplifier of one’s productivity. An investment of a few hours learning the basics will probably return a hundred-fold savings of time over a few months. More advanced concepts are naturally learned as more difficult scenarios are encountered. I’ll be writing soon about some of the more sophisticated issues I’ve encountered using shell scripting.