onsdag 27. august 2008

Microbenchmark of SSE in C++ and Java

Currently, I develop a file-converter for meteorological data, called Fimex. Those files are usually in NetCDF or GRIB data-format and contain several GB of regularly gridded multi-dimensional data. The data is thus similar to films, with the exception that you might have more dimensions, usually x,y, time plus height, and I don't want to apply any lossy compression algorithm.

When I wanted to put the program into production, the program was much to slow. I had tested it on my laptop, and there it was running fine, but on the server, I was expecting it to be running faster. On the one hand, the server had faster disks, and on the other hand, though being a bit older, the server had a CPU running at approx. the same frequency. The application is mainly doing IO, so what was going wrong?

The main differences between the server and the laptop are cache. The server has 0.5MB, while the laptop has 4MB. Analysing the program showed, that I was jumping around the data. A 200x200 pixel of double data fit nicely into 0.5MB, but since I have it in 60 different heights, I had to make sure, that I don't apply the operation on all of the height dimensions before having finished all x,y data. This cache-alignment doubled the performance.

According to gprof, the remaining and most important bottleneck was a simple sqrt which I needed to normalize my vectors. I tested different software implementations, but the performance only got worse. After a while, I recognized that SSE2 and AltiVec SIMD extensions of modern chips have a hardware implementation of sqrt, but this isn't used by default using gcc on a x86 system. It is only the default on the x86-64. Enabling SSE2 with -mfpmath=sse -msse2 increased the performance again considerably and I was finally happy with the results, even on the server.

I tried to repeat the performance-gain with a micro-benchmark, doing exactly the same operation with dummy data. That C++ and Java code can be downloaded here. The results were a bit disappointing. I got only less than 20% performance gain SSE enabled. In addition, the code running on the server was faster than the code on the laptop, while this is in contrast to fimex. I translated the same code to java. The java-code is a bit simpler, but running at approximately the same speed as the SSE code (- 1%), so SSE is on by default in java-hotspot. I'm not 100% happy with the micro-benchmark results, but at least, I'm happy with my code, which is now in production and running fast enough.

Ingen kommentarer: