[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats
From: |
Igor Lesik |
Subject: |
50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats QEMU hands down) |
Date: |
Thu, 14 Sep 2023 05:09:49 +0000 |
Hi.
I came across a case when OVPSim shamelessly outperforms QEMU. In 8 CPUs test,
OPVSim single-thread is faster than QEMU tcg-single 4 times, and faster than
QEMU mttcg by ~30%.
I constructed a simple test case that reproduces it.
When I profiled the test I saw that ~50% of all time QEMU spends inside
function victim_tlb_hit (according to perf tool).
Setup:
1. For both QEMU and OPVSim I made simple machine with 8 RISC-V CPUs and one
RAM (system mode).
2. Host machine is x86 with 4 Cores, but only 1 thread per Core, so 4 HW
threads only.
3. The test is "bare metal", no OS.
4. All CPUs run the same program, no explicit synchronizations in the code.
5. Both QEMU and OPVSim use semihosting EXIT and simulation ends when "last"
exit happens.
Test:
```
#define N (10000000ul * 60ul)
#define M (1024*1024)
int my_main(int argc, char* argv[]) {
volatile long unsigned int a = 0;
volatile long unsigned int b[M] = {};
volatile long unsigned int c[M] = {};
for (long unsigned int i = 1; i < N; i++) {
int j = i % M;
a += i;
a |= (b[j] * i);
b[j] += a & (c[j] / i);
c[j] += i + a;
a += b[j] - c[j];
}
//consume a
```
Perf report:
```
46.78% qemu-system-riscv64 [.] victim_tlb_hit
23.68% qemu-system-riscv64 [.] helper_le_ldq_mmu
4.46% qemu-system-riscv64 [.] helper_latch_ld_dest_reg_id
```
victim_tlb_hit
```
│ jne 1f9
│ lea (%rax,%r9,1),%rcx
│ add $0x130,%rcx
0.25 │ mov $0x7,%edi
0.29 │126:shl $0x4,%rsi
0.39 │ mov %rdx,%r8
1.65 │ shl $0x5,%r8
0.35 │ add 0x1fa8(%rax,%rsi,1),%r8
0.32 │139:mov $0x1,%esi
0.37 │ xchg %esi,(%rax)
51.86 │ test %esi,%esi
│ je 150
│ jmp 148
│146:pause
```
Results:
1. OPVSim single 4 times faster than QEMU tcg-single.
2. OPVSim single ~30% times faster than QEMU mttcg.
3. When M changed from 1M to 2, OPVSim single 2 times faster than QEMU
tcg-single,
and 2 time slower than QEMU mttcg.
Question: does someone have an idea/intuition how QEMU code can be improved to
speed up the simulation in cases like this?
Thanks,
Igor
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- 50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats QEMU hands down),
Igor Lesik <=