gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gcl-devel] Omnibus gcl/acl2 performance post


From: Camm Maguire
Subject: [Gcl-devel] Omnibus gcl/acl2 performance post
Date: Tue, 01 Oct 2013 15:04:27 -0400
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

Greetings!  And thank you so much for this very helpful report!

This post is long, so feel free to skip over any uninteresting detail.
The short news is, 2.6.10pre is now even or better with 2.6.8.

=============================================================================
In the analysis below, several regressions have been run collecting
the following statistics:

i=$(find books -name "*.cert.out");
for j in $i; do 
  awk '/seconds runtime/ {k=$(NF-2);} 
       /\[S?GC for/ {i++;a=gensub(".*T=([0-9]*).*","\\1","g")+0;j+=a}
       /\[SGC on/ {l++} 
       END {print k,i,j,l,m}' m=$j $j;done | sort -n

giving the runtime seconds, the number of gc calls, the amount of gc
time, and the number of sgc on/off calls, which can be compared using
(e.g. files r2.6.10pre vs r2.6.8)

join <(awk '{print $5,$1,$2,$3,$4}' r2.6.10pre|sort) <(awk '{print 
$5,$1,$2,$3,$4}' r2.6.8|sort) | awk '{print $2-$6,$0}' | sort -n

in turn giving a report like

-188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
554.34 141 3726 3
-127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
1039.93 198 12635 4
-84.02 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 
210 2 151.25 113 3797 4
...
16.2 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 171.22 44 
817 2
16.56 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 50 
1378 2 469.76 103 2498 3
20.79 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 441.88 80 
1868 3

Perhaps your knowledge of the various books and their algorithms might
shed light as to why certain factors produce these particular results
in the discussion below.

I'll be placing the results at http://people.debian.org/~camm/acl2/
with hopefully self-explanatory names.
=============================================================================


Performance issues:
=======================================

1) beginning relblock allocation

Issue: when gcl saves an image, it eliminates the hole and minimizes
the relblock space to make the disk image smaller.  All gc statistics
used in balancing the heap and reducing overall gc time
(e.g. si::*optimize-maximum-pages*) are reset to 0.  This seems to be
appropriate, as the type of calculation involved in building an image
is not necessarily representative of that done is using it.  So on
startup, one has somewhat of a 'shrink wrapped' image.  It takes some
time for gcl to gather new statistics and expand the heap from this
state.  The idea is to expand each page type so that its size vis a
vis the rate of allocation is equivalent to the others.  Thus, if for
some reason one starts with a larger allocation of a given type by
fiat, the gc will be triggered by the other types, and they will scale
according to the fiat type.  As 2.6.8 did not shrink wrap the
relblock, it starts with a very large allocation, causing the heap to
grow more quickly and save gc time relative to 2.6.10.

Current status: At startup, 2.6.10 now scales the new hole size as
a set fraction of the number of (dynamically determined) available
pages, and the relblock size as a multiple of the non-relocatable
heap.  Both of these operations cost nothing, but are somewhat
ad-hoc.  The functions #'si::set-starting-hole-divisor and
#'si::set-starting-relblock-heap-multiple are provided to tune if
necessary the defaults of 10 and 2.

2) cons size

Issue: 2.6.9 and forward defaults to a two word cons, whereas 2.6.8
has a three word cons, the first word being a type word.  This is to
save space and restore the build on more limited 32bit machines, which
has now been achieved. (acl2 6.2 is now in Debian testing).  It should
also help with memory bandwidth.  There is however an extra branch
required in typing an object, and more if immediate fixnums are in
force.  In principle, this should easily be dominated by the cost of
referencing the pointer, but this should be tested.

Current status: we have an extra configure switch

--enable-widecons ; which defaults to "no".  

Thankfully widecons is a net loss of some 7 min, with a profile like
(r2.6.10pre vs r2.6.10pre.widecons):

-44.1 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 956.80 
114 7651 2
-15.9 books/centaur/vl/transforms/xf-sizing.cert.out 134.27 36 790 2 150.17 37 
880 2
-11.77 books/centaur/regression/common.cert.out 177.72 66 2074 2 189.49 51 1516 
2
...
15.44 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 447.23 52 
1537 2
17.25 
books/workshops/2004/legato/support/generic-theory-tail-recursion-mult.cert.out 
194.77 34 624 2 177.52 31 566 2
33.19 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 486.32 50 
1378 2 453.13 56 1483 2


3) immediate fixnums

Issue: 2.6.9 and forward support immediate fixnums, the rationale
having been to lower memory requirements as described above.  These
provide faster inlined arithmetic and comparisons, though require
extra branching when typing an object.  When a two word cons and
immediate fixnums are both present, up to four branches can be
required in a typing, as immediate fixnums can now appear in the cdr
of the cons inspected to determine the type.  We should see if this is
a win too.

Current status: We now have the following configure switches

--enable-immfix ; defaults to yes, can be disabled with  --disable-immfix

--enable-fastimmfix=xx ; try to get at least an xx bit wide fixnum table 
centered
                       ; on the NULL address, which has no boxing cost
                       ; default is 64, meaning use high memory
                       ; immediate fixnums requiring arithmetic to box.
   
--enable-safecdr       ; do not place immediate fixnums in cdr, but
                       ; boxed versions instead, and speed up typing
                       ; accordingly.  Defaults to "no"
--enable-safecdrdbg    ; debug the above algorithm and error on failure

 
The dominant fixnum cost is allocation, and this has already been
virtually eliminated in acl2 via its use of
#'allocate-bigger-fixnum-range.  So for acl2, its really just the
typing cost vs. the arithmetic acceleration and saving of a pointer
dereference. 

--disable-immfix is a slight net gain (30 sec), with a profile like
(r2.6.10pre vs r2.6.10pre.no-immfix):

-107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 20 
210 2 174.55 87 3042 2
-10.62 books/unicode/utf8-decode.cert.out 116.08 26 325 2 126.70 24 316 2
-9.86 books/centaur/tutorial/intro.cert.out 159.80 68 1905 2 169.66 60 1793 2
...
12.08 books/workshops/1999/ste/inference.cert.out 149.30 20 287 2 137.22 24 294 
2
13.54 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 173.88 27 
463 2
16.07 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 349.48 
88 1950 2

I anticipate this is not likely to hold up when other immediate fixnum
accelerations are backported from master, e.g. accelerating eql. 

Several of these experiments are still running.  I'll post more if desired.

4) general gc:

Several inner gc loops have been optimized, significantly speeding up
contiguous gc in particular, and to a lesser extent, relocatable.  All
contiguous gc used to also collect relocatable by default -- this is
now separated for efficiency.  Traditionally, (si::gbc t) collected
everything, (si::gbc 1) relocatable, and (si::gbc nil) cells only.
This is still the case, with (si::gbc 0) added to collect contiguous
only.   

echo '(time (progn (setq si::*optimize-maximum-pages* nil)(dotimes (i 1000) 
(si::gbc ???))))' | ./saved_ansi_gcl |grep "^gbc time"

???    2.6.8  2.6.10pre
===    =====  =========
t      9.500  7.510
1      7.250  6.980
0      -----  6.699
nil    7.210  6.050

5) contblock allocation

some sgc bit logic has changed, particularly regarding contiguous
blocks, making the large contiguous allocation I recommended earlier
now possibly obsolete.  I have not tested this, and leaving it in
should cause minimal harm.

When I get around to it, I really need to implement a trampoline at
the end of .data for all pc32 relocations, rendering the whole issue
mute.  This will likely complicate many platforms, however.

Finally, I've used your save-gprof.lsp file, and have found one bug:

diff -u save-gprof.lsp.ori save-gprof.lsp
--- save-gprof.lsp.ori  2013-10-01 13:56:55.739771000 -0500
+++ save-gprof.lsp      2013-10-01 13:57:15.712921000 -0500
@@ -302,7 +302,7 @@
 
                                     `(setq ,sym-initial-cbd
                                            (,sym-pathname-os-to-unix
-                                            (namestring (truename ""))
+                                            (namestring (truename "")) :unix
                                             ,sym-state))
                                     `(,sym-f-put-global
                                       ',sym-cbd


Take care,
-- 
Camm Maguire                                        address@hidden
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah



reply via email to

[Prev in Thread] Current Thread [Next in Thread]