Viser innlegg med etiketten performance. Vis alle innlegg
Viser innlegg med etiketten performance. Vis alle innlegg

tirsdag 7. august 2012

Performance of memory mapped files for streams

Memory mapped files, or mmap-files map the file into virtual memory, allowing to address the files data as if it is in memory, without the overhead of actually copying everything into memory. Mmap-files are part of POSIX and therefore available on practically all platforms. I investigated the performance of mmap-files for numerical weather prediction files, i.e. files larger 1GB, where large chunks of data must be read as a stream.

The library to read the data uses many seekg and read on C++ iostreams. By switching the C++ iostream to an boost::iostream::device::mapped_file_source the program could be used without larger changes:

        using namespace boost::iostreams;
        typedef stream mmStream;
        mmStream* mmistream = new mmStream();
        mmistream->open(mapped_file_source(fileName));
        feltFile_ = mmistream;

This worked as expected, and the feltFile_ was working with the old data-reading commands:
    feltFile_->seekg(pos, ios_base::beg);
    feltFile_->read((char*) ret.get(), blockWords * sizeof(word));
On my 32bit platform, I ran into the first problem with files larger 2GB. mmap files need a address-space which is larger than the file-size, that is on linux 32bit ~3.5GB. For files between 2GB and 3.5GB, there is a problem with the file-size pointer of iostream, which seems to use a signed int.

64bit platforms are around the corner, so I continued my test with 1.5GB files, ignoring the 32bit issues. For compatibility, my code just catches the mmap-exception and reopens the large files as standard-streams.

Performance measurements showed no benefit on using mmap-files or standard streams, even when reading the same file in parallel. There was even a slight, insignificant better performance for std::streams. Mmap files don't seem to make sense for stream-like readers, even if it is very simple to switch from the one to the other. I guess, mmap makes more sense where the file otherwise would be slurped into memory. I removed the use of mapped_files after the test, but just to simplyfy the code.

onsdag 27. august 2008

Microbenchmark of SSE in C++ and Java

Currently, I develop a file-converter for meteorological data, called Fimex. Those files are usually in NetCDF or GRIB data-format and contain several GB of regularly gridded multi-dimensional data. The data is thus similar to films, with the exception that you might have more dimensions, usually x,y, time plus height, and I don't want to apply any lossy compression algorithm.

When I wanted to put the program into production, the program was much to slow. I had tested it on my laptop, and there it was running fine, but on the server, I was expecting it to be running faster. On the one hand, the server had faster disks, and on the other hand, though being a bit older, the server had a CPU running at approx. the same frequency. The application is mainly doing IO, so what was going wrong?

The main differences between the server and the laptop are cache. The server has 0.5MB, while the laptop has 4MB. Analysing the program showed, that I was jumping around the data. A 200x200 pixel of double data fit nicely into 0.5MB, but since I have it in 60 different heights, I had to make sure, that I don't apply the operation on all of the height dimensions before having finished all x,y data. This cache-alignment doubled the performance.

According to gprof, the remaining and most important bottleneck was a simple sqrt which I needed to normalize my vectors. I tested different software implementations, but the performance only got worse. After a while, I recognized that SSE2 and AltiVec SIMD extensions of modern chips have a hardware implementation of sqrt, but this isn't used by default using gcc on a x86 system. It is only the default on the x86-64. Enabling SSE2 with -mfpmath=sse -msse2 increased the performance again considerably and I was finally happy with the results, even on the server.

I tried to repeat the performance-gain with a micro-benchmark, doing exactly the same operation with dummy data. That C++ and Java code can be downloaded here. The results were a bit disappointing. I got only less than 20% performance gain SSE enabled. In addition, the code running on the server was faster than the code on the laptop, while this is in contrast to fimex. I translated the same code to java. The java-code is a bit simpler, but running at approximately the same speed as the SSE code (- 1%), so SSE is on by default in java-hotspot. I'm not 100% happy with the micro-benchmark results, but at least, I'm happy with my code, which is now in production and running fast enough.