Wednesday, January 21, 2009

A performance update

I've continued playing with the LLVM. I discovered that when generating code for the normcdf and Black-Scholes functions I did not tell LLVM that the functions that were called (exp etc.) are actually pure functions. That meant that the LLVM didn't perform CSE properly.

So here are some updated numbers for computing an option prices for 10,000,000 options:

  • Pure Haskell: 8.7s
  • LLVM: 2.0s
Just as a reminder, the code for normcdf looks like this:
normcdf x = x %< 0 ?? (1 - w, w)
  where w = 1.0 - 1.0 / sqrt (2.0 * pi) * exp(-l*l / 2.0) * poly k
        k = 1.0 / (1.0 + 0.2316419 * l)
        l = abs x
        poly = horner coeff 
        coeff = [0.0,0.31938153,-0.356563782,1.781477937,-1.821255978,1.330274429] 
A noteworthy thing is that exactly the same code can be used both for the pure Haskell and the LLVM code generation; it's just a matter of overloading.

Vectors

An very cool aspect of the LLVM is that it has vector instructions. On the x86 these translate into using the SSE extensions to the processor and can speed up computations by doing things in parallel.

Again, by using overloading, the exact same code can be used to compute over vectors of numbers as with individual numbers.

So what about performance? I used four element vectors of 32 bit floating point numbers and got these results:

  • Pure Haskell: 8.7s
  • LLVM, scalar: 2.0s
  • LLVM, vector: 1.1s
  • C, gcc -O3: 2.5s
Some caveats if you try out vectors in the LLVM.
  • Only on MacOS does the LLVM package give you fast primitive functions, because that's the only platform that seems to have this as a standard.
  • The vector version of floating point comparison has not been implemented in the LLVM yet.
  • Do not use two element vectors of type 32 bit floats. This will generate code that is wrong on the x86. I sent in a bug report about this, but was told that it is a feature and not a bug. (I kid you not.) To make the code right you have to manually insert EMMS instructions.
  • The GHC FFI is broken for all operations that allocate memory for a Storable, e.g., alloca, with, withArray etc. These operations do not take the alignment into account when allocating. This means that, e.g., a vector of four floats may end up on 8 byte alignment instead of 16. This generates a segfault.
On the whole, I'm pretty happy with the LLVM performance now; about 8 times faster than ghc on this example.

[Edit:] Added point about broken FFI.

12 comments:

  1. Haskell + LLVM is really nice!

    I wonder what LLVM could do to the old raytracing benchmark you wrote about. At the least, it would allow the raytracer to read the model at runtime and still be fast.

    ReplyDelete
  2. Yeah, unfortunately MMX/3D Now's abuse of the FPU register space really does make the EMMS stuff a manual requirement.

    Its complicated by the fact that the x86 and x64 ABI default to calling functions in FPU rather than MMX mode, but thats implicit in signatures, etc. There is no type system in place to enforce the use of the correct FPU mode.

    I guess it really is a feature, unless you want to limit access to certain intrinsics in the LLVM bindings with some phantom type parameter indicating known FPU state.

    ReplyDelete
  3. I've been messing around with auto-vectorisation in GCC for the past week now and I thought I should let you know that (I think) you haven't enabled it properly.

    Although -O3 tells GCC (version >= 4.3 I believe) to attempt to vectorise simple loops it has no idea if your CPU instruction set actually supports SIMD (e.g. MMX, SSE, SSE2), and so ultimately chooses to do nothing. Morevoer, unless you explicitly set "-mfpmath=sse", GCC will not vectorise floating point instructions (unless you're on a 64-bit platform, in which case this option is default).

    Hence, grab the latest beta of GCC from http://hpc.sourceforge.net/ (right now at 4.4.0) and give these flags a shot:

    gcc -O3 -march=core2 -mfpmath=sse -mmmx -msse -msse2 -msse3

    where "core2" matches the architecture of your platform. Look through the man page of GCC to find out what that should be for you.

    Of course, by adding "-fprofile-generate" to that flag list, running the app, then re-compiling with "-fprofile-use", you could use profile guided optimisation to further improve performance but that's beyond the scope of discussion.

    Just to emphasise how much of a difference this makes (i.e. between a basic "-O3" and "-O3 + _goodness_" I've double the speed of a Mersenne Twister implementation I'm using from here:

    http://www-personal.umich.edu/~wagnerr/MersenneTwister.html

    Hope this helps, cheers.
    Asim

    ReplyDelete
  4. I may be wrong about -O3 implicitly implying -ftree-vectorize, as the man pages don't specify this. Please be safe and explicitly set it, making the new GCC flag-set:

    gcc -O3 -ftree-vectorize -ftree-vectorizer-verbose=7 -march=core2 -mfpmath=sse -mmmx -msse -msse2 -msse3

    The verbose flag set to 7 spits out god awful amounts of information about what GCC can and cannot vectorize and for what reasons.

    ReplyDelete
  5. Edward: For functions with internal linkage the compiler has full knowledge and should be able to do it both right and fast. For other functions, I'd rather have right than fast.

    ReplyDelete
  6. Asymptote: I have gcc 4.0.1, and I'm not going to jump through hoops to get a later one. For that compiler, your extra flags made no difference.

    When I release the next version of the bindings I'll include the benchmarking code, and anyone can try it.

    BTW, to get good speedups you have to have vector versions of log and exp.

    ReplyDelete
  7. augustss: I have no idea is GCC generates vectorised versions of log and exp, so it'll probably make no difference. And also I have no idea when the auto-vectorization code made it into the stable GCC releases but it's pretty primitive so far; I've read that Intel's compiler is far superior.

    ReplyDelete
  8. Asymptote: gcc did not vectorize, at first it was because the function calls were not inlined. When I forced them to be inlined, gcc complains that there are too many basic blocks in the loop.

    ReplyDelete
  9. augustss: i'm getting that error a lot too. i think the auto-vectorisation code is still in its early stages in GCC and can't handle more than the most simple scenarios.

    by the way, after doing a few searches online it seems that people have been doing some work using inline assembly to generate the SSE2 log and exp instructions, like here:

    http://gruntthepeon.free.fr/ssemath/

    if llvm does this automatically without all this hassle, amazing right?

    also, i've just realised that intel's c++ compiler is free for non-commercial use so i'm curious to see just how it compares with gcc and llvm. we'll see.

    ReplyDelete
  10. Asymptote: Yes, I found that french code too, but it was easier to use vecLib on the Mac for me.

    LLVM does no work to speed up calls to log, exp, etc. It's my Haskell layer on top of it that calls specialized vector functions when they are available.

    ReplyDelete
  11. I've searched around, but have not found any info on precisely the call required to hint that an extern function in LLVM can be thought of as pure.

    ReplyDelete
  12. Its complicated by the fact that the x86 and x64 ABI default to calling functions in FPU rather than MMX mode, but thats implicit in signatures, etc. There is no type system in place to enforce the use of the correct FPU mode.

    ReplyDelete