50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats

From:	Igor Lesik
Subject:	50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats QEMU hands down)
Date:	Thu, 14 Sep 2023 05:09:49 +0000

Hi.

I came across a case when OVPSim shamelessly outperforms QEMU. In 8 CPUs test,
OPVSim single-thread is faster than QEMU tcg-single 4 times, and faster than 
QEMU mttcg by ~30%.

I constructed a simple test case that reproduces it.
When I profiled the test I saw that ~50% of all time QEMU spends inside 
function  victim_tlb_hit (according to perf tool).

Setup:
1. For both QEMU and OPVSim I made simple machine with 8 RISC-V CPUs and one 
RAM (system mode).
2. Host machine is x86 with 4 Cores, but only 1 thread per Core, so 4 HW 
threads only.
3. The test is "bare metal", no OS.
4. All CPUs run the same program, no explicit synchronizations in the code.
5. Both QEMU and OPVSim use semihosting EXIT and simulation ends when "last" 
exit happens.

Test:

```
#define N (10000000ul * 60ul)
#define M (1024*1024)

int my_main(int argc, char* argv[]) {

  volatile long unsigned int a = 0;
  volatile long unsigned int b[M] = {};
  volatile long unsigned int c[M] = {};

  for (long unsigned int i = 1; i < N; i++) {
      int j = i % M;
      a += i;
      a |= (b[j] * i);
      b[j] += a & (c[j] / i);
      c[j] += i + a;
      a += b[j] - c[j];
  }

  //consume a
```

Perf report:

```
  46.78%  qemu-system-riscv64      [.] victim_tlb_hit
  23.68%  qemu-system-riscv64      [.] helper_le_ldq_mmu
   4.46%  qemu-system-riscv64      [.] helper_latch_ld_dest_reg_id
```

victim_tlb_hit
```
       │    jne    1f9
       │    lea    (%rax,%r9,1),%rcx
       │    add    $0x130,%rcx
  0.25 │    mov    $0x7,%edi
  0.29 │126:shl    $0x4,%rsi
  0.39 │    mov    %rdx,%r8
  1.65 │    shl    $0x5,%r8
  0.35 │    add    0x1fa8(%rax,%rsi,1),%r8
  0.32 │139:mov    $0x1,%esi
  0.37 │    xchg   %esi,(%rax)
 51.86 │    test   %esi,%esi
       │    je     150
       │    jmp    148
       │146:pause
```


Results:
1. OPVSim single 4 times faster than QEMU tcg-single.
2. OPVSim single ~30% times faster than QEMU mttcg.
3. When M changed from 1M to 2, OPVSim single 2 times faster than QEMU 
tcg-single,
   and 2 time slower than QEMU mttcg.

Question: does someone have an idea/intuition how QEMU code can be improved to 
speed up the simulation in cases like this?

Thanks,
Igor

[Prev in Thread]

Current Thread

[Next in Thread]

50% of all time spent in victim_tlb_hit() !? (or case when OVPSim beats QEMU hands down), Igor Lesik <=

Prev by Date: Re: [PATCH v11 6/9] gfxstream + rutabaga: add initial support for gfxstream
Next by Date: [PATCH] ui: add XBGR8888 and ABGR8888 in drm_format_pixman_map
Previous by thread: [RFC PATCH v2 00/21] QEMU gmem implemention
Next by thread: [PATCH] ui: add XBGR8888 and ABGR8888 in drm_format_pixman_map
Index(es):
- Date
- Thread