Programming Notes: programming

Memory mapped files, or mmap-files map the file into virtual memory, allowing to address the files data as if it is in memory, without the overhead of actually copying everything into memory. Mmap-files are part of POSIX and therefore available on practically all platforms. I investigated the performance of mmap-files for numerical weather prediction files, i.e. files larger 1GB, where large chunks of data must be read as a stream.

The library to read the data uses many seekg and read on C++ iostreams. By switching the C++ iostream to an boost::iostream::device::mapped_file_source the program could be used without larger changes:

        using namespace boost::iostreams;
        typedef stream mmStream;
        mmStream* mmistream = new mmStream();
        mmistream->open(mapped_file_source(fileName));
        feltFile_ = mmistream;

This worked as expected, and the feltFile_ was working with the old data-reading commands:

    feltFile_->seekg(pos, ios_base::beg);
    feltFile_->read((char*) ret.get(), blockWords * sizeof(word));

On my 32bit platform, I ran into the first problem with files larger 2GB. mmap files need a address-space which is larger than the file-size, that is on linux 32bit ~3.5GB. For files between 2GB and 3.5GB, there is a problem with the file-size pointer of iostream, which seems to use a signed int.

64bit platforms are around the corner, so I continued my test with 1.5GB files, ignoring the 32bit issues. For compatibility, my code just catches the mmap-exception and reopens the large files as standard-streams.

Performance measurements showed no benefit on using mmap-files or standard streams, even when reading the same file in parallel. There was even a slight, insignificant better performance for std::streams. Mmap files don't seem to make sense for stream-like readers, even if it is very simple to switch from the one to the other. I guess, mmap makes more sense where the file otherwise would be slurped into memory. I removed the use of mapped_files after the test, but just to simplyfy the code.

Databases come in all choices, on the high end, there are Oracle, IBM and others I forgot. Then you have a set of free alternatives to those, think MySQL and Postgres, just to name some. For smaller applications, which don't need concurrent access, there are several embedded once, SQLite, Derby, HSQL. And if you don't have many tables and you don't like SQL, you can choose file-based databases, i.e. BerkleyDB, ndb, gdb and all of those who have been so popular in the 90th before every programmer needed to know SQL. And all non-programmers believe, that the file-system or even the tape library are databases, and honestly, aren't they right.

I might have had a quite common learning curve with databases: I started using CSV files, then I was very pleased with BerkleyDB creating an graphical overview of all runners of the Köln-Marathon, and since I had to learn SQL at the end of the 90th, I installed postgreSQL and played with it, building an online-calendar for my band.

I started more serious work, that is projects which still exist, in 2002.

The online dictionary Heinzelnisse was first build using PostgreSQL. It was an obvious choice: It was installed on the machine I intended to use, I knew SQL by then good enough, I had need for several tables and concurrent access. When Heinzelnisse started to have some users a year later, I recognized how bad the choice was: PostgreSQL6 was extremely slow on Solaris Sparc machines with 32bit installed. And the machine was already 5 years old, running at 166MHz (and 4GB) at that time. The installation was so old, that the administrator never managed to get mod-perl running there, at the end it took more than 10s to make a query - that was much to slow. Then I went back using BerkleyDB again. It was very well-integrated into Perl, and fast enough on the computer, queries were running at under 2s. Again a year or two later, I switched to the extremely cheap professional hoster, which cost about 1$/month. They had only MySQL installed and I had to switch again. I was never really happy with MySQL, since I had very often problems with the character-columns which behave quite strange in some cases, i.e. Tür = tür = tyr, but it was only some learning and using binary data instead of characters. The dictionary was running nicely, but I was thrown out from the cheap hosters, since my application was too CPU demanding - they were only limited by bandwidth and disk-space. Currently Heinzelnisse is hosted at a virtual machine hoster Tektonic. Even though I can now install whichever database I like, I still stick to MySQL-MyISAM since it is working with very few memory, but I don't think I would have MySQL if I haven't been limited by the cheap webhosters.

Another application I was heavily involved in is Webdab, the European air-emission database. This database started with Sybase, and switched later to Oracle7. When we had to make it available on the Internet in 2002, we knew that the license-costs would be to much for our budget and we had to find an alternative. We found one in PostgreSQL and never regretted that decision, though we had to figure out some performance problems in the beginning.

So, choosing the right database is not simple. There are in my opinion often more non-technical reasons to select the right database than the technical ones. Maybe the most important thing to remember is to embed the database into the application in such a way, that it is possible to switch to another database at a later stage.

Programming Notes

tirsdag 7. august 2012

Performance of memory mapped files for streams

onsdag 6. august 2008

Choosing the right database

Bloggarkiv

About me