chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-hackers] [PATCH] [5] Simplify weak symbol GC and make it the de


From: Peter Bex
Subject: [Chicken-hackers] [PATCH] [5] Simplify weak symbol GC and make it the default
Date: Sat, 3 Sep 2016 20:39:52 +0200
User-agent: Mutt/1.5.23 (2014-03-12)

Hi all,

Now that #1173 has been fixed, I would like to simplify the garbage
collection code dealing with the symbols.

This mail is a bit long, so I'm adding sections to make it easier
to digest.


== Problems with the current symbol GC

The current code is quite tricky and hard to understand:

- It uses a fixed-size "open coded" (if that's the term) hash table,
   which means sometimes a symbol won't end up in the table at all
   due to too many collisions.  This also means memory use is unbounded,
   more or less: symbols not in the table are always kept.
- There's some randomisation involved, which makes bugs hard to track,
   because of the fixed table size: on some runs, a symbol will end up
   in the table, making it eligible for collection, on other runs it
   won't because the random factor has changed.
- The code uses type punning to squeeze a counter into the low bits
   of a pointer.  This is extremely clever, but also tricky.
- Because the ordering of symbols and their buckets is indeterminate,
   some symbols will be collected differently than others, making it
   yet harder to understand: the symbol is only entered in the table
   when its bucket is encountered, but if the symbol itself is
   encoutered while it's in the table, its counter will be incremented.
   If the symbol is encountered first, it won't increment any counter
   but when we see the bucket, it will saturate the counter (WTF!)
   because the symbol is already forwarded.  I spent a lot of time
   thinking the bug for #1173 was here.
- The forwarding pointer resolution code is very tricky, especially
   now with the fix for #1173.

Besides all these things, the fixed table size will mean only up to
997 symbols can be collected on each major GC.  So, there's a slowdown
if there are many symbols.  It will needlessly copy unreferenced symbols
around during GC.


== First patch: simplify GC

The attached patch (0001) replaces the weak table with a simpler system:

We now keep track of whether a symbol is supposed to be kept around, by
making "set!" (or "define") and the plist manipulation procedures smarter.
When a symbol has a global definition or a non-empty plist, it *must*
stick around, so we change its bucket, to indicate this!

Symbols that are disposable will be stored in a "weak" symbol table
bucket.  Symbols that must stick around will be stored in a "strong"
bucket.  This way, we can rely on the standard tracing of our GC:

For a weak bucket, the GC doesn't mark its symbol.  So, the symbol can
be collected unless there's another reference to it.  For a strong
bucket, we simply mark its symbol, so it will stay.


At the end of a major or reallocating GC, we walk the symbol tables, and
update all weak bucket symbol pointers by dereferencing forwarding
pointers.  If we see a weak bucket with a symbol that wasn't copied to
the target heap area, that symbol was collected, and the bucket can be
removed from the chain.  Statically allocated symbols are always kept.


There's one weird part about the implementation: it sets the bucket's
header for weak symbols (C_BUCKET_TYPE | C_SPECIALBLOCK_BIT) in order
to have the GC skip the first slot (which is the symbol).  Ideally,
I'd like to use another bit to indicate that it's a weak reference, but
we're all out of special bits!


I also updated the symbol GC test: it now has zero tolerance for symbols
that are kept around.  This works, because symbol tracking is now precise.
The test is fully enabled again and can fail the entire test suite,
because there's no hash table randomisation factor anymore that can screw
things up.


== Second patch: making symbol GC optional

The second patch is very straightforward: it simply removes the -:w
option, making symbol GC the default. There's a diff hunk for the
update_symbol_tables() function, but that's reindentation due to a
removed if() check.

There is no reason why we should let symbol GC be optional.  The Ruby
community has also had a similar transition to enable GC of symbols,
mostly because an attacker could initiate a resource consumption attack
by generating lots and lots of large symbols, which would never be
cleaned up, eating more and more memory.  But regardless of that, the
overhead of copying unused symbols is unnecessary, so GCing the symbols
should almost always be faster.

To check this, I ran the benchmarks again.  See the attached
"benchmark.diff", which shows the difference between plain CHICKEN 5 [1],
CHICKEN 5 with -:w [2] and CHICKEN 5 with these patches [3].
As you can see, [3] is fastest in most situations.  Especially the
"knucleotide" benchmark really benefits from this patch, as it generates
a lot of symbols.  It also uses less memory with the patch than without
(8.4 MB versus 5.7 MB).  Overall, in my benchmark runs, I saw a difference
of a minute less spent in major GC time (over 10 minutes in total).


I hope this all makes sense.  If you need more info about why and how,
let me know and I'll try to explain.  The patches themselves also have
extensive commit messages to explain what they do, and I've tried to add
comments here and there to clarify some more.

Cheers,
Peter

Attachment: 0001-Simplify-and-improve-reclaimability-of-symbol-GC.patch
Description: Text Data

Attachment: 0002-Make-weak-symbol-GC-the-default.patch
Description: Text Data

Attachment: benchmark.diff
Description: Text Data

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]