the informal ramblings of a formal language researcher

Wednesday, February 01, 2006

on parallelizing a garbage collector

So here are the lessons I learned from attempting to parallelize a stop-and-copy garbage collector, as implemented in the Larceny runtime, using Cilk's primitives as the basis for the parallelization.

At a high level, based on Flood's paper on parallel collection on SMP machines, I decided to start with a naive recursive stop-and-copy collection algorithm, and parallelize that using Cilk's spawn and sync primitives. (The Flood paper points out the utility of a work-stealing approach to load-balancing, which the Cilk runtime gives you "for free.")

Of course, the devil is in the details. First of all, all I had to start with was an kernel collector that uses Cheney's algorithm, which is not trivial to parallelize (as far as I can tell). So I had to develop a recursive algorithm that behaved in a manner compatible with the behavior of the Cheney collector (this is not as trivial as it sounds; the devil's in the details).

Even after I developed the recursive algorithm, I discovered a new set back: the Cilk language, as currently documented, has a restriction that you can only call Cilk functions from other Cilk functions. That is, Cilk functions can call out to C functions, but not vice versa.

This immediately leads to a huge problem for Larceny, because the Scheme code is meant to be able to call into the garbage collector. So that means that the Scheme code needs to be output as Cilk code, not C code. And in fact, the entire runtime and compiler system needs to be shifted to use Cilk instead of C. ACK!

Luckily, the very newest version of Cilk has a beta interface for calling Cilk functions from C code. Unluckily, the interface is indeed beta, and there were some bugs that were showstoppers for me (in hindsight, there were workarounds for the bugs, but that's neither here nor there).

So, here's the important lessons from the project:

  1. Reverse engineering is hard. Even when you're going from a complex algorithm to a simple one, there's always details in "real systems" that get overlooked. (For me, it was particular invariants in terms of the Cheney algorithm deliberately not traversing certain structures at certain times, and it was overloading the generation bits to control this, even though it was still relevant even for non-generational collection.

  2. If you want your language extension/variant A to interoperate with language B, it needs to go in both directions. You may think it will suffice to have one layered on top of the other, but in the end, you will be wrong, and your users will hate you. You're better off designing that interoperability in from the beginning. (This is relevant for languages that seek to interoperate with the CLR or the JVM, in that your extension language should define classes that themselves can be invoked or extended by terms in C# or Java, respectively.

Followers