[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC PATCH 0/6] target/ppc: Improve 4xx and 440 tlbwe
From: |
Nicholas Piggin |
Subject: |
Re: [RFC PATCH 0/6] target/ppc: Improve 4xx and 440 tlbwe |
Date: |
Thu, 07 Dec 2023 14:22:06 +1000 |
On Thu Dec 7, 2023 at 11:35 AM AEST, BALATON Zoltan wrote:
> Hello,
>
> On Wed, 15 Nov 2023, BALATON Zoltan wrote:
> > On Tue, 14 Nov 2023, Nicholas Piggin wrote:
> >> Well I split out these patches and looked a bit closer and added
> >> a few more things.
> >>
> >> I think it may be a bit too much to do the optimisations for
> >> this release, because 4xx TLB flushing has some quirks too so
> >> it's not just simple implementation of 4xx scheme in 440. We
> >> could try for next time.
> >>
> >> The bug fix patch 1 maybe we should do. We haven't been able to
> >> confirm it fixes anything but there was mention of occasional
> >> random crashes.
> >
> > I did some quick testing of this series and found that patch 1 alone makes
> > it
> > slower but not known to fix any issue so I'd say don't commit just this
> > patch
> > without the rest. The current version works enoigh so we can live with that
> > until the next version. With the other patches it's faster and the last
> > patch
> > does make a difference, it makes it a bit faster. I did not record the
> > numbers and only did one measurement so it's only approximate but unless
> > you
> > plan to take the whole series now then keep patch 1 for next devel cycle as
> > well.
>
> We've done some more experiments and I've collected some numbers now. The
> test was running lame to convert a wav file to mp3 right after boot and
> then get "info jit" after it finished. The same executable runs on
> pegasos2 and sam460ex so we can compare these before and after this series
> and to pegasos2 as well. These were run on the same host machine so the
> numbers should be comparable. (This test is also hitting the slow FPU
> emulation on PPC target that's another reason it runs slowly.)
>
> On pegasos2 I get:
>
> Encoding as 44.1 kHz j-stereo MPEG-1 Layer III VBR(q=2)
> Frame | CPU time/estim | REAL time/estim | play/CPU | ETA
> 1149/1149 (100%)| 0:33/ 0:33| 0:33/ 0:33| 0.8982x| 0:00
> QEMU 8.1.92 monitor - type 'help' for more information
> Accelerator settings:
> one-insn-per-tb: off
>
> Translation buffer state:
> gen code size 29666515/1023052800
> TB count 52723
> TB avg target size 24 max=2048 bytes
> TB avg host size 325 bytes (expansion ratio: 13.4)
> cross page TB count 0 (0%)
> direct jump count 31917 (60%) (2 jumps=25829 48%)
> TB hash buckets 24452/32768 (74.62% head buckets used)
> TB hash occupancy 33.37% avg chain occ. Histogram: [0,10)%|▆ █ ▅▁▃▁▁|[9
> TB hash avg chain 1.018 buckets. Histogram: 1|█▁3
>
> Statistics:
> TB flush count 0
> TB invalidate count 7841
> TLB full flushes 0
> TLB partial flushes 13298
> TLB elided flushes 100190
> [TCG profiler not compiled]
>
> On sam460ex *without* this series:
>
> Frame | CPU time/estim | REAL time/estim | play/CPU | ETA
> 1149/1149 (100%)| 0:37/ 0:37| 0:37/ 0:37| 0.8093x| 0:00
> QEMU 8.1.92 monitor - type 'help' for more information
> Accelerator settings:
> one-insn-per-tb: off
>
> Translation buffer state:
> gen code size 32917427/1023052800
> TB count 60534
> TB avg target size 22 max=2048 bytes
> TB avg host size 306 bytes (expansion ratio: 13.9)
> cross page TB count 0 (0%)
> direct jump count 37047 (61%) (2 jumps=29011 47%)
> TB hash buckets 26619/32768 (81.23% head buckets used)
> TB hash occupancy 40.02% avg chain occ. Histogram: [0,10)%|▅ █ ▆▁▄▁▂|[9
> TB hash avg chain 1.035 buckets. Histogram: 1|█▁3
>
> Statistics:
> TB flush count 0
> TB invalidate count 5629
> TLB full flushes 0
> TLB partial flushes 508238
> TLB elided flushes 7680722
> [TCG profiler not compiled]
>
> On sam460ex *with* this series:
>
> Frame | CPU time/estim | REAL time/estim | play/CPU | ETA
> 1149/1149 (100%)| 0:34/ 0:34| 0:34/ 0:34| 0.8595x| 0:00
> QEMU 8.1.92 monitor - type 'help' for more information
> Accelerator settings:
> one-insn-per-tb: off
>
> Translation buffer state:
> gen code size 33094883/1023052800
> TB count 60607
> TB avg target size 22 max=2048 bytes
> TB avg host size 308 bytes (expansion ratio: 13.9)
> cross page TB count 0 (0%)
> direct jump count 37093 (61%) (2 jumps=29038 47%)
> TB hash buckets 26682/32768 (81.43% head buckets used)
> TB hash occupancy 40.12% avg chain occ. Histogram: [0,10)%|▅ █ ▆▁▄▁▂|[9
> TB hash avg chain 1.034 buckets. Histogram: 1|█▁3
>
> Statistics:
> TB flush count 0
> TB invalidate count 5628
> TLB full flushes 0
> TLB partial flushes 73
> TLB elided flushes 1143
> [TCG profiler not compiled]
Great, thanks for the numbers.
> The excessive TLB flushes are resolved, there are even much less now than
> on pegasos2 that uses a G4 CPU. I wonder why and if that could be reduced
> further as well for books. I still runs slower on sam460ex than on
> pegasos2 but that will need further profiling to find out what is the next
> bottle neck.
G4 uses segments and hash table? I think the problem with that is QEMU
TLB does not match the MMU well, so a TLBIE address can not easily match
to a QEMU TLB address.
So it would not be trivial to improve like this series. It could be an
interesting project, I think you need some way to quickly map a hash
virtual address to the possible segment effective addresses that could
be mapping it, and so you can invalidate those addresses (that is what
TCG TLBs cache).
Thanks,
Nick