Things that amuse me

Tuesday, November 06, 2007

Benchmarking ray tracing, Haskell vs. OCaml

On his web site about OCaml Jon Harrop has a benchmark for a simple ray tracing program written in a number of languages. When I saw it I wondered why Haskell was doing so badly, especially since this benchmark was taken as some kind of proof that Haskell performs poorly in "real life". So I have rerun the benchmarks. I've also rewritten the Haskell versions of the programs. The Haskell versions on Jon's web site were OK, but they were too far from the OCaml versions for my taste. I prefer to keep the programs very similar in a situation like this. My rewrite of the benchmarks from OCaml to Haskell was done with a minimum of intelligence. Here are the only things I did that needed creative thought:

Use the type Vector to represent vectors instead of a tuple. This allows the components to be strict.
Use the type Scene instead of a tuple to represent a scene. The tuple used in the OCaml code uses the dubious feature of equi-recursive types (even Xavier thinks it's strange enough to have a flag to enable it).
Rewrite the loop that computes a pixel's value using an accumulating updatable variable into a list comprehension that sums the list.
Finally, the compiler flags needed a bit of tweaking to get good performance, even though "-O3 -funbox-strict-fields -fexcess-precision -optc-ffast-math" were pretty obvious.

In addition to this I made the code look a little more Haskellish, e.g., using overloading to allow + and - on vectors. This is really just minor syntactic changes, but makes the code more readable.

To make the program size comparison fair I removed some dead code from the OCaml code.

I then reran the benchmarks using Haskell, OCaml, and C++.

The benchmarks are five programs that starts from very simple ray tracing and specializing the program more and more to speed it up.

The numbers are in the tables below. The time is execution time in second, the characters are non-white characters in the file, and the lines are the number of lines in the file. To ease comparison I also include the relative numbers compared to OCaml (smaller numbers are better).

Interestingly, and unlike Jon's benchmark, the Haskell code is always smaller than the OCaml code. Furthermore, the Haskell code ranges from much faster to slightly faster than the OCaml code. Again, this is very unlike Jon's benchmark. I find the unoptimized version of the benchmark especially interesting since Haskell is 5 times(!) faster than OCaml. I've not investigated why, but I suspect laziness.

Results

The programs, ray1-ray5, are variations on the ray tracer as given on Jon's web site. I've used the same size metrics as Jon does.

Haskell: My Haskell code compiled with ghc 6.8.1
Haskell old: Jon's Haskell code, compiled with ghc 6.8.1
Haskell old 6.6: Jon's Haskell code, compiled with ghc 6.1.1
OCaml: Jon's OCaml code
C++: Jon's C++ code

Time: execution time is second
Char: number of non-white chracters in the program
Lines: number of lines in the program
Rel T: execution time relative to OCaml
Rel C: non-white characters relative to OCaml
Rel L: lines relative to OCaml
Mem: Maximum resident memory

ray1	Time	Chars	Lines	Rel T	Rel C	Rel L	Mem
Haskell	15.3	1275	51	0.202	0.990	1.020	5M
Haskell, old	15.8	1946	88	0.208	1.511	1.760	9M
Haskell, old 6.6	28.1	1946	88	0.370	1.511	1.760	9M
OCaml	75.9	1288	50	1.000	1.000	1.000	18M
C++	8.1	2633	122	0.106	2.044	2.440	8M

ray2	Time	Chars	Lines	Rel T	Rel C	Rel L	Mem
Haskell	11.5	1457	50	0.206	0.912	0.943	12M
Haskell, old	12.0	2173	99	0.215	1.360	1.868	35M
Haskell, old 6.6	21.1	2173	99	0.379	1.360	1.868	35M
OCaml	55.8	1598	53	1.000	1.000	1.000	15M
C++	6.1	3032	115	0.108	1.897	2.170	8M

ray3	Time	Chars	Lines	Rel T	Rel C	Rel L	Mem
Haskell	9.7	1794	62	0.970	0.919	0.939	12M
Haskell, old	11.1	2312	103	1.112	1.184	1.561	35M
Haskell, old 6.6	19.7	2312	103	1.984	1.184	1.561	35M
OCaml	10.0	1953	66	1.000	1.000	1.000	15M
C++	5.4	3306	143	0.545	1.693	2.167	8M

ray4	Time	Chars	Lines	Rel T	Rel C	Rel L	Mem
Haskell	8.5	1772	66	0.985	0.867	0.957	12M
Haskell, old	11.7	2387	110	1.360	1.168	1.594	36M
Haskell, old 6.6	19.2	2387	110	2.235	1.168	1.594	35M
OCaml	8.6	2043	69	1.000	1.000	1.000	11M
C++	5.0	3348	149	0.584	1.639	2.159	8M

ray5	Time	Chars	Lines	Rel T	Rel C	Rel L
Haskell	7.0	2246	95	0.999	0.878	0.950
OCaml	7.0	2559	100	1.000	1.000	1.000
C++	4.7	3579	142	0.674	1.399	1.420

The source code is available in a Darcs repository.

Software and hardware details

Hardware: MacBook, Intel Core Duo 2GHz, 2MB L2 Cache, 1GB 667MHz DRAM

Software:

Haskell compiler: ghc-6.8.1
OCaml compiler: 3.10.0
g++: gcc version 4.0.1 (Apple Computer, Inc. build 5367)

Compilation commands:

ghc: -O3 -fvia-C -funbox-strict-fields -optc-O3 -fexcess-precision -optc-ffast-math -funfolding-keeness-factor=10
OCaml: -rectypes -inline 100 -ffast-math -ccopt -O3
g++: -O3 -ffast-math

Target architecture is x86 (even though the processor is x86_64 capable).

Some observations

Haskell should really have the definitions of infinity and epsilon_float in a library. They are quite useful. Also, having them in a library would have made the Haskell code somewhat shorter and faster.

Converting these programs from OCaml to Haskell was very mechanical; it could almost be done with just sed.

I'm glad version 5 of the benchmark didn't show much improvement, because it's a really ugly rewrite. :)

Note that the code is all Haskell98, no strange extensions (even though -funbox-strict-fields deviates subtly from H98).

Conclusion

Benchmarking is tricky. I'm not sure why my and Jon's numbers are so different. Different hardware, slightly different programs, different software.

Haskell is doing just fine against OCaml on this benchmark; the Haskell programs are always smaller and faster.

Edit: Updated tables with more numbers.

PS: Phil Armstrong wrote the Haskell code on Jon's web site and I took some code from his original.

Labels: Benchmark, Haskell

20 Comments:

Adam Jones said...: One of my first thoughts when reading this is that you aren't just comparing implementations of the ray tracer, you are also comparing the compilers and (possibly) platforms involved. Harrop used GHC 6.6.1 and didn't list what kind of system it was run on.

Are your results due to the rewrites you have done, improvements to GHC, or somewhere in between? I'd say probably the last one, but it is hard to conclude for certain. It also is somewhat troubling that you are comparing your modified version of the Haskell and OCaml code. It might be worthwhile, at least for the sake of forstalling arguments, to run another benchmark against the original OCaml code and see how it fares on your system.

Either way, it is good to see that Haskell is more than competitive.; Tuesday, November 6, 2007 at 6:32:00 PM GMT
sigfpe said...: Great work!

Did you look at the memory footprint of the executables?

I've never really had too many worries about the speed of Haskell for ray-tracing, and a 2x drop in performance might be acceptable to many people if it makes code development easier. The thing I'm interested in is how performance degrades as you start stressing your machine out. How do you think the various languages will perform when you have, say, 2Gb of RAM, your scene takes up 1.9GB, and the renders take hours?; Tuesday, November 6, 2007 at 6:36:00 PM GMT
chessguy said...: Just wanted to say, I agree with Adam's post. Would be nice to see numbers from the original code in the same tables as the rest.; Tuesday, November 6, 2007 at 7:13:00 PM GMT
augustss said...: I'll try to run the original Haskell programs, and also look at memory foot print.; Tuesday, November 6, 2007 at 8:38:00 PM GMT
Don Stewart said...: You use -O3, however -O3 is clamped to be just -O2 in GHC 6.8, but in 6.6 and earlier, -O3 is closer to -Onot, if I recall (as it wasn't maintained, so only some optimisations are enabled). -O3 might therefore be worse than -O2 for the older compilers.

In the shootout we typically use only: -O2 -optc-O2 -optc-march=pentium4 -optc-ffast-math -funbox-strict-fields, so you might double check using -O2 -optc-O2; Tuesday, November 6, 2007 at 10:22:00 PM GMT
Unknown said...: On the website, it says Philip Armstrong wrote the Haskell code, not Jon Harrop.; Tuesday, November 6, 2007 at 10:46:00 PM GMT
chak said...: Great stuff. I think it shows again that (1) you need to get an expert of each language to get fair benchmarks and (2) we need more material on how to write efficient Haskell programs (on the other hand, that you should use datatypes with strict fields in such situations is described in GHC's User Guide).; Tuesday, November 6, 2007 at 11:43:00 PM GMT
Chris Pacejo said...: Quick note: the -ccopt -O3 passed to ocamlopt isn't actually doing anything; -ccopt is only used when compiling C extensions. O'Caml programs themselves are compiled directly to assembly.; Tuesday, November 6, 2007 at 11:45:00 PM GMT
Phil said...: Yeah, I wrote the Haskell version that appears on Jon's webpages. I don't pretend to be an expert Haskell programmer; indeed if you look back in the archives of the haskell-cafe mailing list you'll find a sizable thread on the topic of optimising my version: http://www.haskell.org/pipermail/haskell-cafe/2007-June/027235.html

It's nice to see that GHC 6.8 makes such a big difference to my original code.

From what I remember there are *significant* differences between the 64-bit and 32-bit x86 binaries produced by both the OCaml compiler and GHC, so you need to watch out for that when making performance comparisons (I think Jon is comparing 64-bit versions but ICBW).

As far as the code goes, I had considered making Vector a Num, but decided against it on personal taste grounds. I'm not a big fan of big list comprehensions either, but perhaps you guessed that...; Wednesday, November 7, 2007 at 2:08:00 PM GMT
augustss said...: Phil, I'm sorry I forgot to mention you as the original author. I stole code from you after all. :)

It's not clear to me if Jon is comparing 32 or 64 bit versons. If it's 64 bits it could account for the huge differences.; Wednesday, November 7, 2007 at 2:47:00 PM GMT
Phil said...: Apology accepted :)

I'll probably merge a few bits of yours into my version just for completeness' sake.

I think that Jon is reporting 64-bit timings IIRC.; Wednesday, November 7, 2007 at 4:08:00 PM GMT
Jon Harrop said...: Your results look similar to the old 32-bit results on my site (the ones in red), which is not surprising if you are still running your 64-bit CPU in legacy mode.

What results do you get in 64-bit mode?; Thursday, November 8, 2007 at 1:41:00 AM GMT
augustss said...: I don't know what I'd get in 64 bit mode because I don't know how to generate 64 bit binaries on my Mac.

But I know native code generation for 64 bits is not that good for ghc yet.

I'm not sure running in 64 bit mode is always the best (in any language). For this kind of benchmark we're after fastest execution after all, and if you get that in 32 bit mode, fine by me.

But I'll investigate 64 bit mode.; Thursday, November 8, 2007 at 1:50:00 AM GMT
Jon Harrop said...: In OCaml and C++, 64-bit is much faster than 32-bit for almost all tasks.

Looks like you need the latest version of Mac OS to get 64-bit. I'm quite surprised by that because Linux has been 64-bit for several years now. I had AMD64 Debian Linux on my old laptop, for example.

Anyway, our results agree for your legacy system. On my 64-bit machine, you Haskell is almost twice as fast as the current implementations and shorter to boot. That makes Haskell much more competitive!

So I shall update my pages with your new implementations (and attribution) ASAP. Many thanks.; Thursday, November 8, 2007 at 2:19:00 AM GMT
Unknown said...: Would you please explain how you picked the compilation flags?; Tuesday, November 13, 2007 at 3:51:00 AM GMT
zhzm said...: It's interesting how different figures might be on different platforms. I've just compiled and run ray5 on a 32-bit Windows box (core2duo 2GHz) with GHC 6.8.3 and OCaml 3.10.2 with flags from your post. I've got 29 seconds for Haskell and 9 seconds for OCaml.

Seems like your OCaml installation is screwed. Or did you run bytecode variant? ;); Thursday, January 8, 2009 at 11:29:00 AM GMT
Jon Harrop said...: We should parallelize this benchmark in both languages.; Thursday, January 8, 2009 at 8:35:00 PM GMT
Alp Mestan said...: And what about optimisations in the OCaml code ?

Has it been natively compiled ?
Is there some use of laziness etc in the OCaml code, as it is possible in OCaml too.
Etc...; Friday, January 9, 2009 at 6:25:00 PM GMT
Jon Harrop said...: @Alp Mestan

The OCaml code is optimized and natively compiled. Laziness is of little benefit in this benchmark. The latest code and results are available on our site although the discussion is out of date.; Friday, January 9, 2009 at 6:54:00 PM GMT
rei said...: Ah, Jon, the notorious troll. Amusing to see him trolling Haskell before he switched to F# and added Caml to his hit list.; Thursday, June 13, 2013 at 1:23:00 AM GMT+1