gcl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gcl-devel] fast-fixnums branch


From: Matt Kaufmann
Subject: Re: [Gcl-devel] fast-fixnums branch
Date: Fri, 11 Oct 2013 06:50:39 -0500

Hi, Camm --

That looks promising!

>> I'm considering merging this into 2.6.10pre.

I didn't notice any mention of a downside, in which case I imagine
you'll do this.  If you do, and you want me to test the result with
the development copy of ACL2 (and latest books), let me know.  (Or you
can obtain all that here: http://acl2-devel.googlecode.com/)

-- Matt
   From: Camm Maguire <address@hidden>
   Cc: address@hidden
   Date: Thu, 10 Oct 2013 20:45:32 -0400

   Greetings!

   Just a quick note here about the new branck off of 2.6.10pre --
   fast-fixnums.  The idea was that as long as we backport immediate
   fixnums for storage savings, we might as well use them to speed up
   arithmetic. 

   This branch narrows the gap with ccl to less than 4% in my tests.
   Also almost 20% gain on the maxima testsuite.

   I'm considering merging this into 2.6.10pre.

   If anyone is interested I can share my little becnhmark lisp code and
   results against ccl, sbcl and clisp.  The functions optimized are:

   LOGEQV LOGCOUNT LDB DPB ABS LOGTEST LOGORC2 LOGNOR LOGORC1 LOGNAND
   LOGANDC1 LOGANDC2 SIGNUM BOOLE MIN LOGNOT MAX FLOOR CEILING
   DEPOSIT-FIELD ODDP EQUALP GCD INTEGER-LENGTH LCM TRUNCATE LDB-TEST REM
   MOD LOGXOR LOGBITP + = = LOGAND < >= LOGIOR > ASH - <= PLUSP MINUSP
   EQUAL EQ ZEROP EQL *

   Turns out that equal on lists needs to be very fast.

   bignum performance still needs work.

   This stuff suggests a possible paradigm for inlining decisions -- always
   tricky.  Perhaps code that can be done in register should be inlined,
   then call when we need a pointer dereference.  

   Take care,

   Matt Kaufmann <address@hidden> writes:

   > Hi, Camm --
   >
   > Congratulations!  I did timings using ACL2 Version 6.3 (which was
   > released just this morning), and here are the results, which indeed
   > show some speed-up.
   >
   > ; GCL 2.6.8 CLtL1
   > 30144.603u 1176.089s 1:19:27.64 656.9%     0+0k 484552+4273744io 381pf+0w
   >
   > ; GCL 2.6.10pre CLtL1
   > ; (from today: git commit 7e321ad46dbb96449960e5c4cd6605b2af3a971c)
   > 29950.747u 1029.940s 1:18:17.05 659.5%     0+0k 155464+6549560io 5pf+0w
   >
   > I've included both of the above results in the timing results for
   > various Lisps, here:
   >
   > http://www.cs.utexas.edu/users/moore/acl2/v6-3/new.html#performance
   >
   > Mostly I don't have much insight from your extensive data to add
   > beyond what you are seeing.  But I might have a useful comment or two.
   > For starters, consider these two lines from your message:
   >
   > -188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
554.34 141 3726 3
   > -127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
1039.93 198 12635 4
   >
   > I think this says that 2.6.10pre is 188.79 seconds faster than 2.6.8
   > for the first book (188.79 is (- 554.34 365.55)) and 44.1 seconds
   > faster for the second.  I suspect that these proofs generate rather
   > large amounts of "output", which we're not seeing but which is being
   > stored by ACL2's "gag-mode" feature for potential future replay (more
   > on that below).  So maybe there's an unusually large amount of
   > garbage, and your improvements for garbage collection and/or reduction
   > of cons words are helping.  (The figure of -44.1 for "r2.6.10pre vs
   > r2.6.10pre.widecons" on elementary-bounders might back that up, unless
   > I'm reading it backwards!)
   >
   > (Details about the "gag-mode" issue above: I plan to fix that problem
   > in the future, so that storing that output doesn't happen when proof
   > output is inhibited, as is the case during book certification.
   > Actually, I made such a fix just to the above elementary-bounders book
   > so that I could certify it on my Mac, using 2.6.8 with its limited
   > maxpages.  But your tests probably don't use that fix.  When I install
   > a fix in the ACL2 sources I'll send you a patch to play with, if you
   > ask me to do so.)
   >
   > Here's another that I can probably explain:
   >
   > -107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 67.23 
20 210 2 174.55 87 3042 2
   >
   > I think that one makes VERY heavy use of fixnums, especially the form
   > (def::ung rectest (x) ...), without declarations or such to help.
   >
   > Finally, thanks for the fix to save-gprof.lsp, which I haven't looked
   > at for many years.  I'm pretty sure that instead of adding the
   > argument :unix, the added argument should be the name of the current
   > os, which in the ACL2 package is (os (w *the-live-state*)).  If you
   > think I'm wrong, please let me know.  Note that this expression does
   > indeed evaluate to :unix on a linux machine.
   >
   > Thanks --
   > -- Matt
   >    From: Camm Maguire <address@hidden>
   >    Cc: address@hidden
   >    Date: Tue, 01 Oct 2013 15:04:27 -0400
   >
   >    Greetings!  And thank you so much for this very helpful report!
   >
   >    This post is long, so feel free to skip over any uninteresting detail.
   >    The short news is, 2.6.10pre is now even or better with 2.6.8.
   >
   >    
=============================================================================
   >    In the analysis below, several regressions have been run collecting
   >    the following statistics:
   >
   >    i=$(find books -name "*.cert.out");
   >    for j in $i; do 
   >      awk '/seconds runtime/ {k=$(NF-2);} 
   >      /\[S?GC for/ {i++;a=gensub(".*T=([0-9]*).*","\\1","g")+0;j+=a}
   >      /\[SGC on/ {l++} 
   >      END {print k,i,j,l,m}' m=$j $j;done | sort -n
   >
   >    giving the runtime seconds, the number of gc calls, the amount of gc
   >    time, and the number of sgc on/off calls, which can be compared using
   >    (e.g. files r2.6.10pre vs r2.6.8)
   >
   >    join <(awk '{print $5,$1,$2,$3,$4}' r2.6.10pre|sort) <(awk '{print 
$5,$1,$2,$3,$4}' r2.6.8|sort) | awk '{print $2-$6,$0}' | sort -n
   >
   >    in turn giving a report like
   >
   >    -188.79 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 
2 554.34 141 3726 3
   >    -127.23 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 
2 1039.93 198 12635 4
   >    -84.02 books/workshops/2013/greve-slind/defung/defung-test.cert.out 
67.23 20 210 2 151.25 113 3797 4
   >    ...
   >    16.2 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 
171.22 44 817 2
   >    16.56 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 
486.32 50 1378 2 469.76 103 2498 3
   >    20.79 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 
441.88 80 1868 3
   >
   >    Perhaps your knowledge of the various books and their algorithms might
   >    shed light as to why certain factors produce these particular results
   >    in the discussion below.
   >
   >    I'll be placing the results at http://people.debian.org/~camm/acl2/
   >    with hopefully self-explanatory names.
   >    
=============================================================================
   >
   >
   >    Performance issues:
   >    =======================================
   >
   >    1) beginning relblock allocation
   >
   >    Issue: when gcl saves an image, it eliminates the hole and minimizes
   >    the relblock space to make the disk image smaller.  All gc statistics
   >    used in balancing the heap and reducing overall gc time
   >    (e.g. si::*optimize-maximum-pages*) are reset to 0.  This seems to be
   >    appropriate, as the type of calculation involved in building an image
   >    is not necessarily representative of that done is using it.  So on
   >    startup, one has somewhat of a 'shrink wrapped' image.  It takes some
   >    time for gcl to gather new statistics and expand the heap from this
   >    state.  The idea is to expand each page type so that its size vis a
   >    vis the rate of allocation is equivalent to the others.  Thus, if for
   >    some reason one starts with a larger allocation of a given type by
   >    fiat, the gc will be triggered by the other types, and they will scale
   >    according to the fiat type.  As 2.6.8 did not shrink wrap the
   >    relblock, it starts with a very large allocation, causing the heap to
   >    grow more quickly and save gc time relative to 2.6.10.
   >
   >    Current status: At startup, 2.6.10 now scales the new hole size as
   >    a set fraction of the number of (dynamically determined) available
   >    pages, and the relblock size as a multiple of the non-relocatable
   >    heap.  Both of these operations cost nothing, but are somewhat
   >    ad-hoc.  The functions #'si::set-starting-hole-divisor and
   >    #'si::set-starting-relblock-heap-multiple are provided to tune if
   >    necessary the defaults of 10 and 2.
   >
   >    2) cons size
   >
   >    Issue: 2.6.9 and forward defaults to a two word cons, whereas 2.6.8
   >    has a three word cons, the first word being a type word.  This is to
   >    save space and restore the build on more limited 32bit machines, which
   >    has now been achieved. (acl2 6.2 is now in Debian testing).  It should
   >    also help with memory bandwidth.  There is however an extra branch
   >    required in typing an object, and more if immediate fixnums are in
   >    force.  In principle, this should easily be dominated by the cost of
   >    referencing the pointer, but this should be tested.
   >
   >    Current status: we have an extra configure switch
   >
   >    --enable-widecons ; which defaults to "no".  
   >
   >    Thankfully widecons is a net loss of some 7 min, with a profile like
   >    (r2.6.10pre vs r2.6.10pre.widecons):
   >
   >    -44.1 books/tau/bounders/elementary-bounders.cert.out 912.70 97 6315 2 
956.80 114 7651 2
   >    -15.9 books/centaur/vl/transforms/xf-sizing.cert.out 134.27 36 790 2 
150.17 37 880 2
   >    -11.77 books/centaur/regression/common.cert.out 177.72 66 2074 2 189.49 
51 1516 2
   >    ...
   >    15.44 books/centaur/defrstobj/basic-tests.cert.out 462.67 51 1471 2 
447.23 52 1537 2
   >    17.25 
books/workshops/2004/legato/support/generic-theory-tail-recursion-mult.cert.out 
194.77 34 624 2 177.52 31 566 2
   >    33.19 books/workshops/2013/greve-slind/defung/defung-stress.cert.out 
486.32 50 1378 2 453.13 56 1483 2
   >
   >
   >    3) immediate fixnums
   >
   >    Issue: 2.6.9 and forward support immediate fixnums, the rationale
   >    having been to lower memory requirements as described above.  These
   >    provide faster inlined arithmetic and comparisons, though require
   >    extra branching when typing an object.  When a two word cons and
   >    immediate fixnums are both present, up to four branches can be
   >    required in a typing, as immediate fixnums can now appear in the cdr
   >    of the cons inspected to determine the type.  We should see if this is
   >    a win too.
   >
   >    Current status: We now have the following configure switches
   >
   >    --enable-immfix ; defaults to yes, can be disabled with  
--disable-immfix
   >
   >    --enable-fastimmfix=xx ; try to get at least an xx bit wide fixnum 
table centered
   >                      ; on the NULL address, which has no boxing cost
   >                      ; default is 64, meaning use high memory
   >                      ; immediate fixnums requiring arithmetic to box.
   >
   >    --enable-safecdr       ; do not place immediate fixnums in cdr, but
   >                      ; boxed versions instead, and speed up typing
   >                      ; accordingly.  Defaults to "no"
   >    --enable-safecdrdbg    ; debug the above algorithm and error on failure
   >
   >
   >    The dominant fixnum cost is allocation, and this has already been
   >    virtually eliminated in acl2 via its use of
   >    #'allocate-bigger-fixnum-range.  So for acl2, its really just the
   >    typing cost vs. the arithmetic acceleration and saving of a pointer
   >    dereference. 
   >
   >    --disable-immfix is a slight net gain (30 sec), with a profile like
   >    (r2.6.10pre vs r2.6.10pre.no-immfix):
   >
   >    -107.32 books/workshops/2013/greve-slind/defung/defung-test.cert.out 
67.23 20 210 2 174.55 87 3042 2
   >    -10.62 books/unicode/utf8-decode.cert.out 116.08 26 325 2 126.70 24 316 
2
   >    -9.86 books/centaur/tutorial/intro.cert.out 159.80 68 1905 2 169.66 60 
1793 2
   >    ...
   >    12.08 books/workshops/1999/ste/inference.cert.out 149.30 20 287 2 
137.22 24 294 2
   >    13.54 books/misc/misc2/reverse-by-separation.cert.out 187.42 26 467 2 
173.88 27 463 2
   >    16.07 books/rtl/rel9/support/lib3.delta1/seed.cert.out 365.55 88 2103 2 
349.48 88 1950 2
   >
   >    I anticipate this is not likely to hold up when other immediate fixnum
   >    accelerations are backported from master, e.g. accelerating eql. 
   >
   >    Several of these experiments are still running.  I'll post more if 
desired.
   >
   >    4) general gc:
   >
   >    Several inner gc loops have been optimized, significantly speeding up
   >    contiguous gc in particular, and to a lesser extent, relocatable.  All
   >    contiguous gc used to also collect relocatable by default -- this is
   >    now separated for efficiency.  Traditionally, (si::gbc t) collected
   >    everything, (si::gbc 1) relocatable, and (si::gbc nil) cells only.
   >    This is still the case, with (si::gbc 0) added to collect contiguous
   >    only.   
   >
   >    echo '(time (progn (setq si::*optimize-maximum-pages* nil)(dotimes (i 
1000) (si::gbc ???))))' | ./saved_ansi_gcl |grep "^gbc time"
   >
   >    ???    2.6.8  2.6.10pre
   >    ===    =====  =========
   >    t      9.500  7.510
   >    1      7.250  6.980
   >    0      -----  6.699
   >    nil    7.210  6.050
   >
   >    5) contblock allocation
   >
   >    some sgc bit logic has changed, particularly regarding contiguous
   >    blocks, making the large contiguous allocation I recommended earlier
   >    now possibly obsolete.  I have not tested this, and leaving it in
   >    should cause minimal harm.
   >
   >    When I get around to it, I really need to implement a trampoline at
   >    the end of .data for all pc32 relocations, rendering the whole issue
   >    mute.  This will likely complicate many platforms, however.
   >
   >    Finally, I've used your save-gprof.lsp file, and have found one bug:
   >
   >    diff -u save-gprof.lsp.ori save-gprof.lsp
   >    --- save-gprof.lsp.ori  2013-10-01 13:56:55.739771000 -0500
   >    +++ save-gprof.lsp      2013-10-01 13:57:15.712921000 -0500
   >    @@ -302,7 +302,7 @@
   >
   >                                    `(setq ,sym-initial-cbd
   >                                           (,sym-pathname-os-to-unix
   >    -                                            (namestring (truename ""))
   >    +                                            (namestring (truename "")) 
:unix
   >                                            ,sym-state))
   >                                    `(,sym-f-put-global
   >                                      ',sym-cbd
   >
   >
   >    Take care,
   >    -- 
   >    Camm Maguire                                        address@hidden
   >    
==========================================================================
   >    "The earth is but one country, and mankind its citizens."  --  
Baha'u'llah
   >
   >
   >
   >
   >

   -- 
   Camm Maguire                                     address@hidden
   ==========================================================================
   "The earth is but one country, and mankind its citizens."  --  Baha'u'llah




reply via email to

[Prev in Thread] Current Thread [Next in Thread]