Any system that encodes values in some set number of places has a limit on the values that can be held. For example, old, mechanical cash registers were physically limited in the number of digits they could ring up. Likewise, modern LCD cash registers are limited by the number of digits available on the screen, though they may be able to hold longer numbers than they can show. The “Y2k” problem was also a result of such limitations.
Binary encoding of course has similar limits. Two bits can hold four values. Three bits can hold eight values. The relationship there is exponential — x bits are limited to 2x values.
One thing that programs might want to keep track of internally is the current location in a file that one is reading or writing. Like a bookmark, there are one or more variables that can be used to store locations in a file. Typically these are simply integers, indicating some sort of offset from the beginning or end of the file. These bookmarks impose a limitation. No file positions past 2x offsets may be recorded, or often even reached.
It turns out that in C++, in Linux, x = 31. This provides 231 = 2147483648 positions. Given that a position is normally a byte (B), we are then limited to 2147483648 B, or 2097152 kB, or 2048 MB, or 2 GB. With our newest, largest models, this is a problem. My latest simulations are producing data files that are around 21 GB when uncompressed.
It turns out that in 64-bit Linux, libc does not have this problem. It is therefore possible to use normal C file I/O commands to process large files (on most modern systems, which incorporate Large File Support or LFS). After searching high and low on the interwebs for a way to convince C++ to use larger variables for file I/O, and not finding much, I caved and spent all of five minutes changing my code over to use C I/O.
What made this bug a little more difficult to track down than it might have been is that for some reason, Mac OS X Leopard does not suffer from this problem with C++ stream I/O. Someone at Apple must have allocated a few more bits for file position pointers. My code was therefore working fine on my sorta-64-bit Mac but not on our fully-64-bit cluster or our workstations running recent versions of Linux.
Ultimately it would be nice to write some kind of wrapper or overload the C++ I/O functions to do things correctly, but for the moment my code is working properly and I am happy.
Do you have any better ideas for getting around this 2GB limit?
Josuttis, never leave home without it.
For output buffers:
http://www.josuttis.com/libbook/toc.html
http://www.josuttis.com/libbook/io/outbuf2.hpp.html
For input buffers:
http://www.josuttis.com/libbook/io/inbuf1.hpp.html
This code should also do the trick:
http://www-f9.ijs.si/~matevz/docs/C++/annotations-5.2.0a/cplusplus19.html
Just to clarify, open the file with the large file attribute, and then pass it to one of these classes.
Pingback: Calculating Large Numbers in C
Did you try your C I/O code on your Mac? Because on my Mac (using leopard), it seems that fseek/ftell are also limited by the 2GB threshold…