I’ve been working on a control-flow analysis for Scheme 48 for a while now. In this analysis, functions represented as maps are omnipresent: variable environments map bindings to abstract values, binding environment map this to that, and so on. It’s quite easy to represent these functions as association lists, which is how they are represented now. Well, obviously a kind of hash table would be much faster.

However, most of the domains of these functions are actually represented by complex Scheme record types which makes it hard to come up with a reasonable hash function. Actually, in many cases it would be the best solution to use the memory address of the Scheme record as the hash value: Many of the association lists use eq? as the comparison operator anyways. However, you can’t do that in Scheme: There is no way to get the memory address of an object. Which makes sense: The Scheme implementation is free to move the program values around at any time. And implementations with a moving garbage collector do this all the time. Hence, a hash table that uses the address as a key must be rehashed after each garbage collection. Some implementations such as PLT Scheme offer such functionality and call such tables eq?-hash tables.

During the last weeks I started to implement eq?-hash tables (which I call id-tables) for Scheme 48. That just seemed to be a good idea. Here is how they work. I have added a new “stob type” (think of it as a record type defined by the VM) for id-tables to the VM and added new instructions for adding, searching, and deleting items in that table. An id-table consists of two fields: A counter field that says how many values are currently stored in the table and a vector which is the actual hash table. The entries of the vector are a lists of pairs. Each pair consists of the object used as the key and the object used as the value for this key. It’s necessary to remember the key object since the hash function (the address of the key object modulo the vector size) may map more than one object to the same vector index. Well, this is just the usual simple representation of a hash table.

Things get interesting when a garbage collection occurs: Scheme 48’s two space copier moves the key objects (and all other objects) two the new heap space, hence the hash key of the object changes. The id-table and it’s content is treated like any other Scheme object. Hence, after the garbage collection there is a copy of the id-table in the new heap space. All addresses in the table’s vector are updated to the corresponding new addresses. That’s fine, but in general the objects are now at the wrong index: We need to rehash. Rehashing is quite easy: Traverse all entries of the vector, re-calculate the hash-key for each key object and either move the entry to a new index in the vector or leave it untouched (in case the hash functions maps the old and new address to the same index). All this is done in place: The rehashing modifies the copy of the hash table instead of copying the contents to a fresh vector. The latter would probably be a little bit easier and faster, but harder to implement: rehashing happens just after the garbage collection and it’s a bad idea to allocate new heap space in this situation (Actually, these post-gc cleanup hooks are part of the GC process). However, the in-place rehashing has the disadvantage that the algorithm actually might re-calculate the hash function for each object twice (worst case). I think that’s a fair trade-off, calculating the hash function should be very fast.

A diffrent solution would be to treat the id-tables specially during the garbage collection. But I found that complicated: The objects in the table don’t have their new addresses yet and doing the whole copying and so on is a tedious business. The advantage would be that the id-tables are easy to find: The GC algorithm traverses the heap and this way just stumbles over the id-tables. The post-gc rehashing mechanism I have implemented has to involve some extra machinery to find the id-tables: The constructor for id-tables registers the freshly created id-table in a global list. The rehashing procedure uses this list to find all id-tables.

I should mention that at the moment I have only considered the two space copy collector of Scheme 48. So, the id-tables won’t work with the upcoming new generational collector yet. Actually, id-tables is the second approach. I started with an implementation of stable names, but I gave up this approach. Various reasons, might tell later.

Technorati Tags: , , , ,