[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No-threading poke, VM performance
From: |
Luca Saiu |
Subject: |
No-threading poke, VM performance |
Date: |
Mon, 21 Dec 2020 14:35:20 +0100 |
User-agent: |
Gnus (Gnus v5.13), GNU Emacs 27.0.50, x86_64-pc-linux-gnu |
Hello José, and the others.
José told me that poke's performance in general could be improved. As
requested by him I am sending this writeup recording our IRC
conversation, and more. I apologise if by not proposing to immediately
solve these problems myself I will make my proposals feel preachy. I
think they are interesting anyway for VM writers.
I would like some very small microbenchmark as well; for the time being
I will be using something like this:
var euclid
= lambda (long a, long b) long :
{ while (a != b)
if (a < b)
b -= a;
else
a -= b;
return a; }
.vm disassemble function euclid
.vm disassemble function/n euclid
There are a few obvious problems:
a) pushvar relies on C function calls.
This should be trivial to optimise by turning the function into a macro
and specialising for common instruction arguments. The same holds for
popvar, and probably related instructions such as pushe.
I am not worried about this.
b1) Arithmetic is *extremely* expensive in the common case on the PVM,
as it involves GC help allocation of results, and more.
I would suggest changing the data representation by following Lisp's
idea of fixnums: representing unboxed integers which are small enough to
fit in a word, along with their tag bits. If, for example, you needed
10 bits for tags (I am being very pessimistic here: in practice you will
need fewer) then, on a 64-bit machine, you could represent with a
"long-unboxed" tag any 64-bit number fitting sign-extended in 54 bits.
Your arithmetic operations will involve a conditional, fast-branching to
uncommonly executed code in case a result does not fit.
I would also recommend handling out-of-line any case where the
*argument* is not unboxed: all that is non-critical code, which will be
executed rarely and will have no effect on performance. You can do all
of that in C functions, and the common case using CPP macros only.
I also see much more bit twiddling than I was expecting. This is a long
subtraction on the PVM:
# subl (77 B):
0x00007fcbc92214e0 49 8b 45 00 movq 0x0(%r13),%rax
0x00007fcbc92214e4 48 83 e0 f8 andq $0xfffffffffffffff8,%rax
0x00007fcbc92214e8 48 8b 48 08 movq 0x8(%rax),%rcx
0x00007fcbc92214ec 8d 71 01 leal 0x1(%rcx),%esi
0x00007fcbc92214ef 48 8b 38 movq (%rax),%rdi
0x00007fcbc92214f2 ba 3f 00 00 00 movl $0x3f,%edx
0x00007fcbc92214f7 89 d0 movl %edx,%eax
0x00007fcbc92214f9 29 c8 subl %ecx,%eax
0x00007fcbc92214fb 89 c1 movl %eax,%ecx
0x00007fcbc92214fd 48 d3 e7 shlq %cl,%rdi
0x00007fcbc9221500 48 d3 ff sarq %cl,%rdi
0x00007fcbc9221503 4c 89 f1 movq %r14,%rcx
0x00007fcbc9221506 48 83 e1 f8 andq $0xfffffffffffffff8,%rcx
0x00007fcbc922150a 48 8b 01 movq (%rcx),%rax
0x00007fcbc922150d 2b 51 08 subl 0x8(%rcx),%edx
0x00007fcbc9221510 89 d1 movl %edx,%ecx
0x00007fcbc9221512 48 d3 e0 shlq %cl,%rax
0x00007fcbc9221515 48 d3 f8 sarq %cl,%rax
0x00007fcbc9221518 48 29 c7 subq %rax,%rdi
0x00007fcbc922151b 48 8b 44 24 50 movq 0x50(%rsp),%rax
0x00007fcbc9221520 ff d0 callq *%rax
0x00007fcbc9221522 4d 89 75 08 movq %r14,0x8(%r13)
0x00007fcbc9221526 49 83 c5 08 addq $0x8,%r13
0x00007fcbc922152a 49 89 c6 movq %rax,%r14
Even ignoring the allocation near the end, this really seems too much.
This is JitterLisp's subtraction. I believe your code should end up
not bigger than it:
# primitive-primordial-minus/fR 0x560af169f860 (45 B):
0x00007f853fc7e085 49 f7 06 0f 00 00 00 testq $0xf,(%r14)
0x00007f853fc7e08c 0f 85 2e 00 00 00 jne 0x00007f853fc7e0c0
0x00007f853fc7e092 48 f7 c1 0f 00 00 00 testq $0xf,%rcx
0x00007f853fc7e099 0f 85 21 00 00 00 jne 0x00007f853fc7e0c0
0x00007f853fc7e09f 49 8b 06 movq (%r14),%rax
0x00007f853fc7e0a2 48 29 c8 subq %rcx,%rax
0x00007f853fc7e0a5 0f 80 15 00 00 00 jo 0x00007f853fc7e0c0
0x00007f853fc7e0ab 48 89 c1 movq %rax,%rcx
0x00007f853fc7e0ae 49 83 ee 08 subq $0x8,%r14
Before saying that this code is also too complex for a subtraction
remember that this is a dynamically typed language, and look at what it
does:
Tag checking of the first argument, read from memory from the stack
under-top: branch out-of-line if the argument is not a fixnum:
0x00007f853fc7e085 49 f7 06 0f 00 00 00 testq $0xf,(%r14)
0x00007f853fc7e08c 0f 85 2e 00 00 00 jne 0x00007f853fc7e0c0
Same thing for the second argument, on the top of the TOS-optimised
stack and therefore in a register:
0x00007f853fc7e092 48 f7 c1 0f 00 00 00 testq $0xf,%rcx
0x00007f853fc7e099 0f 85 21 00 00 00 jne 0x00007f853fc7e0c0
Subtraction *with overflow checking*: branch out-of-line on overflow:
0x00007f853fc7e09f 49 8b 06 movq (%r14),%rax
0x00007f853fc7e0a2 48 29 c8 subq %rcx,%rax
0x00007f853fc7e0a5 0f 80 15 00 00 00 jo 0x00007f853fc7e0c0
If we did not overflow copy the subtraction result to its place, which is
the TOS:
0x00007f853fc7e0ab 48 89 c1 movq %rax,%rcx
Adjust the stack, which is now one element shorter. This is a nip operation:
0x00007f853fc7e0ae 49 83 ee 08 subq $0x8,%r14
The complexity of tagging and operations on tagged data is in CPP macros.
Now, my fixnum representation is clever in the sense that it uses the
tag 0 for fixnums in the low bits (currently three bits: I could afford at
least four on 64-bits platforms), which allows sum and subtraction to
work on *tagged* data.
A tag valued zero is also faster to check, compared to any other value.
The one type you choose to tag that way will be faster to handle.
Compare, for example, car:
# primitive-car/fR 0x560af16a0780 (20 B):
0x00007f853fc7d3f5 48 8d 41 fe leaq -0x2(%rcx),%rax
0x00007f853fc7d3f9 48 a9 07 00 00 00 testq $0x7,%rax
0x00007f853fc7d3ff 0f 85 12 00 00 00 jne 0x00007f853fc7d417
0x00007f853fc7d405 48 8b 49 fe movq -0x2(%rcx),%rcx
In JitterLisp car does not work on nil. Here leaq is just a way to
compute a three-operand subtraction on x86_64. In practice in order to
check whether the argument is a cons I subtract the desired tag (0x2) from
the tagged pointer, and test if the result's tag bits are zero. This
check costs one more instruction because the tag is non-zero.
Then branch away on non-cons.
(And then, finally, do the actual work: load, using an offset of -0x2 to
compensate for the tag.)
If you adopt this style you will have the same asymmetry: the zero tag
will be slightly faster to handle than the others. There will be *one*
integer type preferred for performance.
In practice, even if the distinction between boxed and unboxed numbers
would still be invisible to the Poke user, we are talking about a
dynamic-typing approach.
Which brings me to my last point c). Before that, a digression on
something which should be easy to improve:
b2) Comparisons materialise their result, which is boxed (!), on the
stack
Look at the disassembly for
ltl
nip2
bzi $L
, which you can find compiling the euclid function.
This is a typical problem with stack code, which on a Jittery VM can be
solved with rewriting at least in typical cases.
Here one big mistake is making the Boolean result be a *boxed* integer.
And of course, as always in these cases, you are computing two
conditionals at run time instead of one: one for knowing which Boolean
to push, another depending on the Boolean value.
I have no constructive proposals for avoiding this pattern in Poke's
compiler, but rewrites alone might suffice. You could rewrite to a new
"branch-on-not-less" instruction in this case, if needed even
reproducing the same stack effect possibly by PUSH_UNSPECIFIED.
c) Uncommonly executed code is inline.
There is a lot of code inside PVM instruction following the pattern
if (ERROR CONDITION)
PVM_RAISE (...);
else
...
Apart from the problems of PVM_RAISE, which right now is not replicable
because it relies on string literals, this makes VM instructions very
large, with branches which are taken in the common case. It also
pollutes the instruction cache with unused code, and having large
instructions will be a problem later when we optimise deeper stack
accesses, which will rely on more replication.
Not doing this may interfere with Poke's compiler logic. Let me first
show you the desired code; I will end with a constructive proposal.
JitterLisp does not follow this practice yet, but its VM instructions
are designed for it. Say that I want to do something with the result of
a subtraction:
...push the two subtraction operands on the stack...
primitive-primordial-minus $slow_path
$after_subtraction:
...use the subtraction result...
$slow_path:
subtraction-in-c
branch $after_subtraction
The important point here is that the slow path has negligible impact on
performance: it can check for type errors (not needed in Poke) and for
correct but uncommon tags in the input or in the output (needed in
Poke). This code can be as complicated as we need, and it can be
written in comfortable unrestricted C, using a function. The slow path,
if there is no failure, will branch back into the common path and that
the consumer of the subtraction result does not need to be aware of
anything special.
This should also happen for safe points (currently "sync" in the PVM).
Look at example-vms/uninspired/examples/signals.vm , which is a
cleaned-up version of an example I originally wrote for you on the
mailing list.
The safe-point VM instruction itself should compile to two or three
hardware instructions, with a non-taken branch:
# safe-point/fR 0x55b2e7ba21f0 (11 B):
0x00007f5589c7ea11 48 83 7b f0 00 cmpq $0x0,-0x10(%rbx)
0x00007f5589c7ea16 0f 85 08 00 00 00 jne 0x00007f5589c7ea24
The code for handling notifications, in case they actually arrive, will
be in the slow path.
Important remark: Jitter's new GC will encourage this style: allocation
is extremely fast in the common path -- at four instructions, cheaper
than a C function call plus return in C, with a non-taken branch.
The slow path will call into C.
allocate-cons %r0, $slow_path
$after_allocation:
...Fill the new cons in %r0...
$slow_path:
...some instruction which can afford to be inefficient...
branch $after_allocation
This style should be pervasive. Modifying the Poke compiler to provide
for it in every place seems hard; however I have a proposal.
José would it be easy to integrate this hypothetical Jitter feature into
Poke's code generator?
Jittery VM subsections
======================
Right now code is generated for a Jittery VM in a strictly sequential
order. If I have to compile
if EXP1 then
STMT1
else
STMT2
...
the code will be something like:
[EXP1]
branch-on-false $else
[STMT1]
branch $end
$else:
[STMT2]
$end:
...
And the code generator has to generate, for example, the instructions
for STMT1 before the instructions for STMT2.
I would like to add a feature similar to Gas subsections (the name for
Jitter should be different)
From the point of view of the compiler the code could also be:
[EXP1]
branch-on-false $else
[STMT1]
branch $end
.pushsubsection
$else:
[STMT2]
.popsubsection
$end:
...
This does not seem to make things easier, but now imagine that the else
branch be very unlikely. Then this variant becomes attractive (two
taken branches in the else case, no taken branch in the then case):
[EXP1]
branch-on-false $else
[STMT1]
$end:
...
$else:
[STMT2]
branch $end
This code would be easier to generate if we could afford not to follow
the instruction order at generation time:
[EXP1]
branch-on-false $else
[STMT1]
.pushsubsection
$else:
[STMT2]
branch $end
.popsubsection
$end:
...
In fact I already follow this logic in my compilers, even without
subsections, by making my code generators recursive; but I doubt that
this would work with Poke. The advantage I imagine here is generating
slow paths at the same time as fast paths. Could something like this be
hidden in your macro instructions using RAS or some other Poke
mechanism? If it could then solving the problems in c) would become
much less disruptive.
I plan to provide the "subsection" stack in Jitter. It should not be
terribly hard.
What do you think?
--
Luca Saiu
* My personal web site: http://ageinghacker.net
* GNU epsilon: http://www.gnu.org/software/epsilon
* Jitter: http://ageinghacker.net/projects/jitter
I support everyone's freedom of mocking any opinion or belief, no
matter how deeply held, with open disrespect and the same unrelented
enthusiasm of a toddler who has just learned the word "poo".
signature.asc
Description: PGP signature
- No-threading poke, VM performance,
Luca Saiu <=