tirsdag 7. august 2012

Performance of memory mapped files for streams

Memory mapped files, or mmap-files map the file into virtual memory, allowing to address the files data as if it is in memory, without the overhead of actually copying everything into memory. Mmap-files are part of POSIX and therefore available on practically all platforms. I investigated the performance of mmap-files for numerical weather prediction files, i.e. files larger 1GB, where large chunks of data must be read as a stream.

The library to read the data uses many seekg and read on C++ iostreams. By switching the C++ iostream to an boost::iostream::device::mapped_file_source the program could be used without larger changes:

        using namespace boost::iostreams;
        typedef stream mmStream;
        mmStream* mmistream = new mmStream();
        mmistream->open(mapped_file_source(fileName));
        feltFile_ = mmistream;

This worked as expected, and the feltFile_ was working with the old data-reading commands:
    feltFile_->seekg(pos, ios_base::beg);
    feltFile_->read((char*) ret.get(), blockWords * sizeof(word));
On my 32bit platform, I ran into the first problem with files larger 2GB. mmap files need a address-space which is larger than the file-size, that is on linux 32bit ~3.5GB. For files between 2GB and 3.5GB, there is a problem with the file-size pointer of iostream, which seems to use a signed int.

64bit platforms are around the corner, so I continued my test with 1.5GB files, ignoring the 32bit issues. For compatibility, my code just catches the mmap-exception and reopens the large files as standard-streams.

Performance measurements showed no benefit on using mmap-files or standard streams, even when reading the same file in parallel. There was even a slight, insignificant better performance for std::streams. Mmap files don't seem to make sense for stream-like readers, even if it is very simple to switch from the one to the other. I guess, mmap makes more sense where the file otherwise would be slurped into memory. I removed the use of mapped_files after the test, but just to simplyfy the code.

fredag 11. mai 2012

Accessing a Fortran-library from C

I recently had to work with a scientist who had written a nice library in his preferred language, Fortran. The library was useful, we wanted to publish the results on our webservers and had thus to connect it to apache. Unfortunately, apache isn't written in Fortran, so we had to create a C-wrapper around the Fortran code.

This starts quite simple. A subroutine like
subroutine calcThis(in, out)
real, intent(IN) :: in
real, intent(OUT):: out
can be called from C like
callthis_(&in, &out);
Note here, that fortran-functions appear in lower-case and with usually one underscore in C, and all, even input variables are given by reference.

The problem starts when asking the question: What is this real, is it float or double? The sad answer is any, because the type is defined by the compiler, or rather, the compiler options, e.g. for gcc, -fdefault-real-8 means to use real*8 internally, which can be 8bit or 16bit, depending on architecture and controlled by -fdefault-double-8. Since Fortran2003, this can be much better controlled using the ISO_C_BINDING, which allow to use explicitly C-datatypes like real, KIND(C_DOUBLE)
This was originally thought to help fortran access C-libraries, but it also helps the other way around. Unfortunately, at least in gcc-4.4, it does not check if the datatypes are used inconsistently, so it is possible to intermix a 4bit real with an 8bit real and so on.

Strings are even more tricky. While the C-String is only a character array with a final 0 (char*), a fortran string is a object, with internally defined length. While the calling convention with intermixed C is not defined, it is often (gfortran, pgf), used the following way:
fortranfunc_(char* string1, char* string2, int strLen1, int strLen2)
i.e., the length of the strings are the last arguments of the function. This can get very complicated, e.g. when variable length-arrays of strings are used, e.g.
integer, intend(IN) :: aryLen
character*256, intend(IN) :: strAry(aryLen)
Also here, the ISO_C_BINDING can help:
integer, intend(IN) :: aryLen
character, kind(LEN=1, C_CHAR) :: strAry(aryLen*256)

The length is always 1, since it sends only a pointer. The allocated length is then the length of a single string times the array-length.
Unfortunately gfortran (and I think other compilers, too), don't convert between the string-types automatically, and copying a C-string to a fortran string needs special care. On the net, I have found some Fortran functions, which should do so, but I didn't manage to get them compiled and I had to change them to Fortran subroutines to work:
SUBROUTINE C_F_STRING (C_STR, F_STRING)
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_CHAR, C_NULL_CHAR
IMPLICIT NONE
CHARACTER(LEN=1,KIND=C_CHAR), intent(IN) :: C_STR
CHARACTER(LEN=*), intent(INOUT)          :: F_STRING ! output
INTEGER                      :: I, N
LOGICAL                      :: SETVAL
   
N = LEN(F_STRING)
SETVAL = .true.
DO I = 1, N
   IF (C_STR(I:I) .EQ. C_NULL_CHAR) THEN
       SETVAL = .false.
       F_STRING = F_STRING(:I-1)
   END IF
   IF (SETVAL) THEN
       F_STRING(I:I) = C_STR(I:I)
   ELSE
       GOTO 100
   END IF
END DO
100   CONTINUE

END subroutine C_F_STRING


SUBROUTINE F_C_STRING (F_STRING, C_STRING)
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_CHAR, C_NULL_CHAR
IMPLICIT NONE
CHARACTER(LEN=*), INTENT(IN) :: F_STRING
CHARACTER(LEN=1,KIND=C_CHAR), INTENT(INOUT) :: C_STRING(LEN(F_STRING)+1)
INTEGER                      :: N, I

N = LEN_TRIM(F_STRING)
DO I = 1, N
   C_STRING(I) = F_STRING(I:I)
END DO
C_STRING(N + 1) = C_NULL_CHAR

END SUBROUTINE F_C_STRING

It should be noted that the C_String must have allocated 1 byte more than the fortran strings, due to the additional ending 0. The subroutines are then used like:
! initialization
CHARACTER(LEN=1, KIND=C_CHAR), INTENT(INOUT) :: rep250(251*maxrep)
CHARACTER(LEN=1, KIND=C_CHAR), INTENT(IN) :: desc100(101)
CHARACTER(LEN=250) :: f_rep250(maxrep)
CHARACTER(LEN=100) :: f_desc100

! convert to fortran strings
CALL C_F_STRING(desc100,f_desc100)
! the array case
DO i = 1, maxrep
   CALL C_F_STRING(rep250((i-1)*251+1) ,f_rep250(i))
END DO

! do something with the fortran strings
...

! convert to C-strings
CALL F_C_STRING(f_desc100, desc100)
DO i = 1, maxrep
   CALL F_C_STRING(f_rep250(i), rep250((i-1)*251+1))
END DO
Since this might require a lot of additional wrapper-code to be written, it should be automatized. In my cases, I wrote a pseudo header including tags like
FWRAP(real workWithStrings)[
                      IN integer maxrep,
                      INOUT character*250(maxrep) rep250,
                      IN character*100 desc100]
which got processed by a perl-skript, writing both the C header-file, and the f90 wrapper-functions. Good information about the ISO_C_BINDING can be found in the IBM Linux Compilers handbook.

søndag 16. oktober 2011

Testing hsqldb database for the cloud

Cloud services have been around now for some time. While my current preferred deployment architecture are virtual servers (VPS), those VPS have not seen much attention lately.  My current VPS costs still 15$/month, the same as in 2005, though it now has 512MB of RAM rather than 64MB and 3x the disk-space. A comparable server in the cloud should be much cheaper (approx. 0$/month). But before switching to a PaaS like Google AppEnginge or IaaS like Amazon EC2, some technical issues need to be overcome, in particular, none of these offer access to the filesystem or SQL databases for free. Google is just started to solve the reprogramming issue with Google Cloud SQL, but this is not available for free.

My application Heinzelnisse comes with 5 different information databases, 3 of them are static, or at least upgraded less frequently than the application itself, while the remaining 2 are a forum and a wiki. The idea is now to move the static dbs from the MySQL database to the application war/jar file. The 3 static dbs come from spreadsheet tables and are in spreadsheet format less than 5MB in total. Therefore, I'm trying to redeploying them to an embedded database which can be run from a jar.

I used hsqldb in the current version: 2.2.5. Smaller changes to the SQL-schema was needed to create a database. I had to make sure to use the removeAbandoned=true for the DBCP connection pool settings.

When using the default in-memory tables table-data increased from 1MB in the spreadshead table to 30MB in the data.script startup file. The performance was comparable to MySQL, I still could serve about 30 requests/s as long as I used the server jvm. The problem with in memory tables is the long startup and huge memory requirements. It took 184s to read the .script file on startup. While the application was running stable with -Xmx76MB before, I had to increase this now to more than 160MB. Using that amount of memory for a 1MB file was not acceptable.

Then I change the in-memory tables to disk-cached tables. The startup time decreased to 12s again, and application was stable again with 76MB jvm memory. But the data is now stored on disk in a .data file, which is 48MB in my case. Performance didn't change, at least not in my test with no change on the query and therefore perfect caching possibilities. So far, so nice, but file-access is not allowed in the cloud. hsqldb allows resource tables in a jar, but it wasn't clear if they work with disk-cached tables.

Testing disk-cached tables from a jar didn't work, so I had to ask in the mailing list if it is supported at all. The answer was fast and positive, even with some tips to do so, but I still didn't get it working. After scanning through the code a found a obvious bug, and my patch is accepted for the coming 2.2.6, but it is obvious that this type of deployment has not been used since the start of the 1.9 release some years ago.

I tried then also with the 1.8 release, which is still the most stable version of hsqldb. This required again minor changes in the schema, but I got it installed fast. Unfortunately,  performance of 1.8 seems to be much worse than in 2.2. I didn't manage more than 4 requests/s versus 30 in mysql and hsqldb 2.2. I didn't investigate where the bottleneck was, though.

hsqldb seems to be a nice, feature-rich java database. But there still seems to be a gap between in-memory databases and full file-based databases. I didn't manage to run about 100000 rows in memory with fast startup time and low memory consumption. My next try will be to hand tune these tables.

tirsdag 22. september 2009

Installing rpm packages on a deb-system without root access

This blog is a bit of topic, and more or less a note for myself. I just wanted to install OpenOffice3 on my work-machine. Since I have no root-access for that linux-machine, and the administrators are usually busy, I did the following:
  • Download the deb-package from openoffice.org (there are internally rpm-files)
  • Create a private rpm database: mkdir -p /disk1/heiko/rpmdb; rpm --dbpath /disk1/heiko/rpmdb --initdb
  • Install the rpm packages without dependancies: rpm --nodeps --dbpath /disk1/heiko/rpmdb --prefix /disk1/heiko/local/OO3.1 --install ooobasis3.1-*.rpm openoffice.org*.rpm
Now I can use openoffice 3.1 by starting /disk1/heiko/local/OO3.1/openoffice.org3/program/soffice

torsdag 30. april 2009

Jetty on a cheap virtual server

In my last blog, I wrote about my first tries on using Jetty to improve the memory footprint for using a cheap (15$) virtual private server with 300MB, running the complete application, including database and webserver. These results have been derived by using benchmarks of the main part of the application, without changing the java settings. Now it is time to come back and talk about the results after running jetty 6.1.14 in production. It should be noted that I had a 100% uptime of the application the first two month of 2009, while still running on tomcat. Shortly after switching to Jetty, the uptime dropped below 98% due to different reasons:
  • Running java 1.6.0_11 in client mode was a very bad idea. Jetty simply stalled after 3hours without any error or warning message. I've seen similar behaviour with tomcat 6.0.18, while tomcat 6.0.14 has been running for weeks without problems with the same java version in client mode. I don't know why, but lesson learned: When running a server application, use the server JVM.
  • A concurrency bug in my code showed up. The load didn't increase, but jetty seems to be running the code much more concurrent than tomcat did. After fixing the code, that problem was solved.
  • A break-in to a different VPS on the same machine caused a lot of problems with extremly slow network and access. After the provider shut down that VPS, everything got better. This was maybe the main cause of the uptime reduction, and it has more to do with VPS handling than with Jetty.
With jetty, I managed to reduce the memory footprint. The application was still running well with new java options. On tomcat, I had to use:
JAVA_OPTS='-server -Xmx96m -Xms96m -Djava.awt.headless=true -XX:+UseConcMarkSweepGC -XX:PermSize=24m -XX:MaxPermSize=24m'
Now I'm using without problems:
JAVA_OPTS='-server -Xmx76m -Xms32m -Djava.awt.headless=true -XX:+UseSerialGC
-XX:MaxPermSize=20m -XX:ReservedCodeCacheSize=16m'
The ReserverdCodeCacheSize is a setting I just found out resently, and it is 46MB by default on a x86 server machine. With some additional tunings for the database reducing its memory requirements from 36MB to 25MB while increasing slightly the cache-hit-ratio from 1% to 2%, I'm now running the application within 220MB, and I'm not afraid of having more than 50hits/s.

Another important adjustment I made to increase the stability was to preallocate a database-connection to each running threads. Usually, a connection pool with 2-5 spare threads would be enough, but on a VPS you never know how long it takes to start a new process. I have seen a simple 'ls' waiting for 10-20s, and the 10s timeout for waiting on a database connection was often timed out, in particular in the morning when people start using my application.

tirsdag 10. mars 2009

Tomcat 6.0.18 vs Jetty 6.1.14

When I started programming servlets some years ago, Tomcat (5) was the most natural choice to choose as servlet container. It wasn't just the best known container, but it was in addition blessed by SUN as the reference implementation, so it always had the latest compatible features. In addition, Tomcat came bundled with my IDE, so I didn't even think about using a different container. Over the years, the situation has changed a bit. SUN moved to Glassfish/Grisly as reference implementation and I got the impression, that the hype moved from Tomcat to Jetty as the preferred container.

Being lazy, I wouldn't have thought about switching if it wasn't due to some problems I had when I had to upgrade from Tomcat 6.0.14 to 6.0.18. Suddenly, my jsp contained all errors and my cookies threw errors. Both problems were conflicts with the specs: I used 'empty(var)' in my code instead of 'empty (var)' (watch the space) and I didn't uuencode the cookies. But I don't want to see such problems turn up when I just patch because of a security advisory.

Another reason to try something else was the memory consumption. Memory is cheap, but when running servlets on a virtual machine from a hosting provider, price increases close to linearly with memory, since cpu power usually is no issue for people sharing servers.

I decided then to give Jetty a try, since it is known to be small in size and well-suited for embedded applications, which usually have memory limits. My first impression wasn't very good. Download size is quite big, I didn't manage to compile jetty by myself and I couldn't find good documentation. But after some try and error sessions, I managed to get jetty running. The missing documentation turned out to be my biggest problem for setting up all the features I needed for my application, namely: virtual hosts, JNDI database pool, compressed html pages. Searching for documentation on the web didn't help so much since the configuration seems to have changed between Jetty 4, 5 and 6. Chances to get a wrong hint are higher than getting a correct one. The best source of documentation are the javadocs. Here it is clear, that Jetty is more used as embedded container, which needs to be configured on the java level, rather than a standalone application. The configuration file is rather java translated to xml than user-friendly. There were some other inconveniences, but all could be solved.

Finally everything was set up, and I could start testing both containers for performance, and more importantly, for memory consumption. I used the flags which I found out worked best for my application
on tomcat on a two-processor machine:
JAVA_OPTS='-Xmx96m -Xms96m -Djava.awt.headless=true -XX:+UseConcMarkSweepGC -XX:PermSize=24m -XX:MaxPermSize=24m'

and I tested with:
ab -c 20 -n 10000 'http://localhost:8080/dict?searchItem=warum'
I measured the first, the second, the third and the forth 10000 requests, total memory consumption was measured with top-RES.







#/s 10000#/s 20000#/s 30000#/s 40000Memory consumption (30000)
jetty -server224391511512148m
tomcat -server24852563933170m
jetty -client38340841441478m
tomcat -client49656454346126m

The results show that tomcat memory consumption is more than 20% larger than jetty, while using a client-vm gives another 20%. For performance, the opposite is approximately the case. In addition, after about 35000 requests, tomcat performance drops dramatically, and analysing this further, this is caused mainly by the GC working permanently. I haven't seen that during normal load, where I have approx. 70000 hits a day, so I think this must be some kind of session data, which is kept in tomcat until the session times out after 30min. Talking about real-time usage: I have now jetty running for some days, and I don't recognize any differences. Performance is good, and memory-consumption approximately the same as before. Maybe the most important lesson I have learned: Run with client-vm when running low on memory.

tirsdag 16. desember 2008

Comparing 64bit and 32bit Linux and Java

I'm running a Tomcat server which needs to allocate large (>1GB) of memory. For that, it would make sense to run Linux and Java in 64bit mode, thus being able to allocate much more than 2GB (Java) or 3.5GB (Linux) of memory. On the other hand, there are a lot of rumours on the net that 64bit Linux uses twice or more the memory. There is a good article including measurements explaining that 64bit java needs approx. 50% more memory for strings and integers.

Here are my results tested on the same machine as virtual xen-hosts, using debian-etch in i386 and amd64 bit mode. The machines got both 256MB memory and no swap:











Action32bit64bit
free after boot219M217M
file-size c-prog10M12M
data-allocationmax 217M217M
free after data-allocation224M224M
java -servermax 208M @ Xmx214Mmax 203M @ Xmx234M
java -clientmax 209M @ Xmx214Mn.a.


Both the C and the java-program simply allocate some memory. The java-program can be seen at the end of this blog. The java-program tried to allocate 256 * 1MB, until it threw a OutOfMemoryError. The Xmx settings have been adapted so high, that Linux didn't kill the JVM with 'Out of Memory'. It was astonishing to see that the Xmx settings could be set higher on the 64bit jvm, while the maximum available memory within
java was slightly lower.

The results are an assurance that 64bit Linux does not require much more memory than the 32bit Linux. At least not for a server machine with only 64bit libraries installed, and applications which don't require many pointer. Desktop machines may require much more if also the 32bit libraries need to run due to some 32bit only programs. The biggest difference is code size (20%). But gcc is known to have different default options in 32bit and in 64bit mode.

Example of code run to test data allocation:

public class Main {
public static void main(String[] args) {
if (args.length == 0) {
System.err.println("Main: buffers size (in m)");
System.exit(1);
}
int buffers = Integer.parseInt(args[0]);
int size = Integer.parseInt(args[1]) * 1024 * 1024 / 4;
java.util.Vector store = new java.util.Vector(size);
for (int i = 0; i < buffers; i++) {
store.add(i, new int[size]);
System.err.println("allocated buffer " + i);
}
}
}