[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-patch-tracker] [patch #10167] interpreter: Speedup patchset
From: |
Petter Tomner |
Subject: |
[Octave-patch-tracker] [patch #10167] interpreter: Speedup patchset |
Date: |
Wed, 5 Jan 2022 06:08:43 -0500 (EST) |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0 |
URL:
<https://savannah.gnu.org/patch/?10167>
Summary: interpreter: Speedup patchset
Project: GNU Octave
Submitted by: petter
Submitted on: Wed 05 Jan 2022 11:08:40 AM UTC
Category: Core : other
Priority: 5 - Normal
Status: None
Privacy: Public
Assigned to: None
Originator Email:
Open/Closed: Open
Discussion Lock: Any
_______________________________________________________
Details:
Hi!
I have made some patches in a try to speed up the interpreter which
I hope might come to use.
I post them all here to give some overview, since many of them look
a bit silly and meaningless in isolation. There are 22 patches that
I tried to keep small and isolated to make things palatable.
I have verified the changes with Callgrind in isolation.
I have already posted some to the list but they are in this set too.
The patches do the following:
* Use memory pool to decrease allocations for octave_value scalars,
octave_lvalue and stack frames.
The way I clean up the memory pool at thread exit might
be abit sketchy, since thread_local destructors seems
to only trigger if the object is "used". Should probably
be tested on a Windows machine too. Might need to be
ifdef:ed out on quirky platforms, but a leak at exit
on some platforms might be OK too.
* Use a "tiny vector" which stores its elements in an union
of an array and a std::vector, to avoid mallocs and keep
things on the stack, for small amount of elements.
This is useful in alot of places where a std::list or octave_value_list
hardly ever have many elements anyway.
* Replace dynamic_cast with static_cast in scalar operators.
* Use direct calls in scalar operators (instead of virtual calls) to
avoid pointer indirection and look ups.
* Cache function look up in the identifier. Using a weak_ptr to a
shared_ptr as signal for invalidation. (Not for classes since they
have no unique type_id.)
Cache is invalidated on load_path update or autoload. Should
there be more places that invalidates the cache?
* Some more smaller fixes.
There seem to be no official benchmark, but the speed increase for
"interpreter heavy" code, .i.e. where most time is not spent in
BLAS, is measurable.
function i = foo ()
tic;
for i = 1:1000000
i = 1+2+3+4+5+6 + sin (7);
end
toc;
end
For such a loop, the run time decreases with 70%. (2.7s->0.8s)
Julia has some micro benchmark which is engineered in such a way
that it is terrible for Octave (all but one test no vectorization). The run
time
decreases with about 40% in that on average.
https://github.com/JuliaLang/Microbenchmarks/blob/master/perf.m
Roland Baudin made a benchmark some time ago, the number looks like this
for his "interpreter test":
http://roland65.free.fr/benchmarks/benchmarks-0.2.pdf
fib(24) ------------------ 0.722477 s - 0.364349 s
subsets(1:20,7)----------- 1.00886 s - 0.615895 s
irn55_rand(1,1e6)--------- 1.08363 s - 0.654606 s
merge_sort, 10000 elts --- 0.972122 s - 0.623209 s
qSort, 40000 elts -------- 1.20522 s - 0.796537 s
25 times harmvect(1e6)---- 0.254404 s - 0.230258 s
harmloop(1e6)------------- 0.652904 s - 0.38412 s
3 times fannkuch(7)------- 1.11088 s - 0.692495 s
mon_lu 400x400------------ 0.065699 s - 0.059072 s
crible(3e6)--------------- 0.078116 s - 0.063907 s
p = make_perm(20000)------ 0.905211 s - 0.497369 s
4 times q = inv_perm(p)--- 0.669331 s - 0.459765 s
fftx 2^15 elts------------ 1.75846 s - 1.10528 s
30 times pascal(1029)----- 0.190787 s - 0.151445 s
hmeans, 8x8000, K=4 ------ 0.083367 s - 0.07835 s
simplexe 200 cstr 300 var- 0.85787 s - 0.750403 s
loop_call_f (30000 elts)-- 1.8533 s - 1.35763 s
loop_call_p (50000 elts)-- 0.939411 s - 0.269355 s
form_vect1(20000)--------- 0.031274 s - 0.022097 s
form_vect2(20000)--------- 0.127493 s - 0.116317 s
form_vect3(20000)--------- 0.026516 s - 0.019685 s
loop1(1e6)---------------- 0.55983 s - 0.352572 s
loop2(1e6)---------------- 1.44872 s - 0.94482 s
loop3(3e5)---------------- 0.652665 s - 0.405102 s
test_bool_and_comp_ops---- 0.047887 s - 0.04963 s
test_find----------------- 0.188028 s - 0.196211 s
prime_factors(160001)----- 0.517081 s - 0.175626 s
extraction test----------- 0.076488 s - 0.065315 s
insertion test------------ 0.044466 s - 0.042847 s
=================================== ============
temps total-------------- 18.1325 s 11.5443 s
Some of the patches breaks ABI for the .so lib.
__run_test_suite__ runs fine.
I have attached the patches as a zip-file.
_______________________________________________________
File Attachments:
-------------------------------------------------------
Date: Wed 05 Jan 2022 11:08:40 AM UTC Name:
octave_interpreter_patchset_20220105.zip Size: 66KiB By: petter
patchset zip
<http://savannah.gnu.org/patch/download.php?file_id=52615>
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/patch/?10167>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Octave-patch-tracker] [patch #10167] interpreter: Speedup patchset,
Petter Tomner <=