octave-patch-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #10167] interpreter: Speedup patchset


From: Petter Tomner
Subject: [Octave-patch-tracker] [patch #10167] interpreter: Speedup patchset
Date: Wed, 5 Jan 2022 06:08:43 -0500 (EST)
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

URL:
  <https://savannah.gnu.org/patch/?10167>

                 Summary: interpreter: Speedup patchset
                 Project: GNU Octave
            Submitted by: petter
            Submitted on: Wed 05 Jan 2022 11:08:40 AM UTC
                Category: Core : other
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: 
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

Hi!

I have made some patches in a try to speed up the interpreter which
I hope might come to use.

I post them all here to give some overview, since many of them look
a bit silly and meaningless in isolation. There are 22 patches that
I tried to keep small and isolated to make things palatable.

I have verified the changes with Callgrind in isolation.

I have already posted some to the list but they are in this set too.

The patches do the following:
    * Use memory pool to decrease allocations for octave_value scalars,
      octave_lvalue and stack frames.

      The way I clean up the memory pool at thread exit might 
      be abit sketchy, since thread_local destructors seems 
      to only trigger if the object is "used". Should probably 
      be tested on a Windows machine too. Might need to be
      ifdef:ed out on quirky platforms, but a leak at exit
      on some platforms might be OK too.

    * Use a "tiny vector" which stores its elements in an union
      of an array and a std::vector, to avoid mallocs and keep
      things on the stack, for small amount of elements.

      This is useful in alot of places where a std::list or octave_value_list
      hardly ever have many elements anyway.

    * Replace dynamic_cast with static_cast in scalar operators.

    * Use direct calls in scalar operators (instead of virtual calls) to
      avoid pointer indirection and look ups.

    * Cache function look up in the identifier. Using a weak_ptr to a 
      shared_ptr as signal for invalidation. (Not for classes since they
      have no unique type_id.)

      Cache is invalidated on load_path update or autoload. Should 
      there be more places that invalidates the cache?

    * Some more smaller fixes.


There seem to be no official benchmark, but the speed increase for 
"interpreter heavy" code, .i.e. where most time is not spent in 
BLAS, is measurable.


function i = foo ()
 tic;
 for i = 1:1000000
   i = 1+2+3+4+5+6 + sin (7);
 end
 toc;
end


For such a loop, the run time decreases with 70%. (2.7s->0.8s)

Julia has some micro benchmark which is engineered in such a way
that it is terrible for Octave (all but one test no vectorization). The run
time 
decreases with about 40% in that on average.

https://github.com/JuliaLang/Microbenchmarks/blob/master/perf.m

Roland Baudin made a benchmark some time ago, the number looks like this 
for his "interpreter test":
http://roland65.free.fr/benchmarks/benchmarks-0.2.pdf

 fib(24) ------------------ 0.722477 s  - 0.364349 s
 subsets(1:20,7)----------- 1.00886 s   - 0.615895 s 
 irn55_rand(1,1e6)--------- 1.08363 s   - 0.654606 s 
 merge_sort, 10000 elts --- 0.972122 s  - 0.623209 s 
 qSort, 40000 elts -------- 1.20522 s   - 0.796537 s 
 25 times harmvect(1e6)---- 0.254404 s  - 0.230258 s
 harmloop(1e6)------------- 0.652904 s  - 0.38412 s  
 3 times fannkuch(7)------- 1.11088 s   - 0.692495 s
 mon_lu 400x400------------ 0.065699 s  - 0.059072 s 
 crible(3e6)--------------- 0.078116 s  - 0.063907 s 
 p = make_perm(20000)------ 0.905211 s  - 0.497369 s 
 4 times q = inv_perm(p)--- 0.669331 s  - 0.459765 s
 fftx 2^15 elts------------ 1.75846 s   - 1.10528 s  
 30 times pascal(1029)----- 0.190787 s  - 0.151445 s
 hmeans, 8x8000, K=4 ------ 0.083367 s  - 0.07835 s  
 simplexe 200 cstr 300 var- 0.85787 s   - 0.750403 s 
 loop_call_f (30000 elts)-- 1.8533 s    - 1.35763 s  
 loop_call_p (50000 elts)-- 0.939411 s  - 0.269355 s 
 form_vect1(20000)--------- 0.031274 s  - 0.022097 s 
 form_vect2(20000)--------- 0.127493 s  - 0.116317 s 
 form_vect3(20000)--------- 0.026516 s  - 0.019685 s 
 loop1(1e6)---------------- 0.55983 s   - 0.352572 s 
 loop2(1e6)---------------- 1.44872 s   - 0.94482 s  
 loop3(3e5)---------------- 0.652665 s  - 0.405102 s 
 test_bool_and_comp_ops---- 0.047887 s  - 0.04963 s  
 test_find----------------- 0.188028 s  - 0.196211 s 
 prime_factors(160001)----- 0.517081 s  - 0.175626 s 
 extraction test----------- 0.076488 s  - 0.065315 s 
 insertion test------------ 0.044466 s  - 0.042847 s 
 ===================================    ============
 temps total-------------- 18.1325 s    11.5443 s 


Some of the patches breaks ABI for the .so lib.

__run_test_suite__ runs fine.

I have attached the patches as a zip-file.



    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Wed 05 Jan 2022 11:08:40 AM UTC  Name:
octave_interpreter_patchset_20220105.zip  Size: 66KiB   By: petter
patchset zip
<http://savannah.gnu.org/patch/download.php?file_id=52615>

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/patch/?10167>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]