Category Archives: Tools of the Trade

Tools of the Trade

Your Emails — They are not secure

In other news from the house-hunting front, we’ve been working with lenders to finance the purchase of a house. Lenders want a lot of information. They want bank statements, driver’s license copies, landlord information, tax returns, income statements, current address, credit card statements, letters of employment and so on. Of course, they also want that ubiquitous, unchangeable, universal secret password, the social security number.

You would think, given the nature of this collection of information, and the rising prevalence and cost of identity theft, that these people would be careful with this information. If you’re cynical or just a realist, maybe you wouldn’t think that. Anyway, you’d be wrong. One of the first lenders we dealt with EMAILED A COMPLETE, FILLED COPY of the application form to us for signatures. No encryption, whatsoever. It was like an identity theft starter kit. After we confronted them about it, they said they had no idea this was insecure, and offered to fax or FedEx the documents instead.

If you don’t already know this, you really need to know: Email, without any special add-ons, is the opposite of secret. It is the digital equivalent of a postcard — anyone along the way can read it, and you have no idea who will be along the way. Would you tape your social security card to the back of a postcard and send it across the country? Furthermore, there’s no guarantee that an email’s “From:” address is accurate, as you may have deduced from spam email that you’ve received. All it takes to forge it is changing a string of text when putting the message together.

There are ways to use email to send secure, confidential communications. Probably the most universal and robust way is with PGP or (preferably) GPG. The main reason these solutions aren’t used more widely is that encrypted communication is difficult to do correctly. Keys have to be generated, passwords selected, keys exchanged and signed, managed, and sometimes even revoked. A number of pieces have to fit together, including the encryption engine, mail program plug-ins, and file encryption software. The difficulty of using proper encryption is not, however, an excuse for sending my SSN in plain text via E-mail. When used with good enough ciphers, email can be safe even from the prying eyes of the US Government, who would have to spend hundreds or thousands of years of computer time attempting to crack your key. Furthermore, with or without encrypting the message, cryptographic signatures may be used to verify that the purported sender of the message is in fact the true sender of the message. This eliminates the problem of From address forgery.

Should you wish to send encrypted e-mail my way, you may find my public key here.

Calculating Large Numbers in C

As a corollary to my last post, it’s important to be careful when calculating file seek positions (if you’re skipping around that way). It turns out it’s necessary to cast all of the numbers being used when calculating a seek position to a large integer, such as unsigned long int.

By the way, Rob had some helpful comments on that last post. (Thanks Rob!)

C++ ifstream and the 2 GB Limit

Any system that encodes values in some set number of places has a limit on the values that can be held. For example, old, mechanical cash registers were physically limited in the number of digits they could ring up. Likewise, modern LCD cash registers are limited by the number of digits available on the screen, though they may be able to hold longer numbers than they can show. The “Y2k” problem was also a result of such limitations.

Binary encoding of course has similar limits. Two bits can hold four values. Three bits can hold eight values. The relationship there is exponential — x bits are limited to 2x values.

One thing that programs might want to keep track of internally is the current location in a file that one is reading or writing. Like a bookmark, there are one or more variables that can be used to store locations in a file. Typically these are simply integers, indicating some sort of offset from the beginning or end of the file. These bookmarks impose a limitation. No file positions past 2x offsets may be recorded, or often even reached.

It turns out that in C++, in Linux, x = 31. This provides 231 = 2147483648 positions. Given that a position is normally a byte (B), we are then limited to 2147483648 B, or 2097152 kB, or 2048 MB, or 2 GB. With our newest, largest models, this is a problem. My latest simulations are producing data files that are around 21 GB when uncompressed.

It turns out that in 64-bit Linux, libc does not have this problem. It is therefore possible to use normal C file I/O commands to process large files (on most modern systems, which incorporate Large File Support or LFS). After searching high and low on the interwebs for a way to convince C++ to use larger variables for file I/O, and not finding much, I caved and spent all of five minutes changing my code over to use C I/O.

What made this bug a little more difficult to track down than it might have been is that for some reason, Mac OS X Leopard does not suffer from this problem with C++ stream I/O. Someone at Apple must have allocated a few more bits for file position pointers. My code was therefore working fine on my sorta-64-bit Mac but not on our fully-64-bit cluster or our workstations running recent versions of Linux.

Ultimately it would be nice to write some kind of wrapper or overload the C++ I/O functions to do things correctly, but for the moment my code is working properly and I am happy.

Do you have any better ideas for getting around this 2GB limit?