qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v14 04/10] accel/tcg: add jit stats and time to TBStatistics


From: Wu, Fei
Subject: Re: [PATCH v14 04/10] accel/tcg: add jit stats and time to TBStatistics
Date: Thu, 1 Jun 2023 14:48:06 +0800
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0

On 6/1/2023 9:08 AM, Richard Henderson wrote:
> On 5/30/23 01:35, Fei Wu wrote:
>> +static void collect_jit_profile_info(void *p, uint32_t hash, void
>> *userp)
>> +{
>> +    struct jit_profile_info *jpi = userp;
>> +    TBStatistics *tbs = p;
>> +
>> +    jpi->translations += tbs->translations.total;
>> +    jpi->ops += tbs->code.num_tcg_ops;
>> +    if (stat_per_translation(tbs, code.num_tcg_ops) > jpi->ops_max) {
>> +        jpi->ops_max = stat_per_translation(tbs, code.num_tcg_ops);
>> +    }
>> +    jpi->del_ops += tbs->code.deleted_ops;
>> +    jpi->temps += tbs->code.temps;
>> +    if (stat_per_translation(tbs, code.temps) > jpi->temps_max) {
>> +        jpi->temps_max = stat_per_translation(tbs, code.temps);
>> +    }
>> +    jpi->host += tbs->code.out_len;
>> +    jpi->guest += tbs->code.in_len;
>> +    jpi->search_data += tbs->code.search_out_len;
>> +
>> +    jpi->interm_time += stat_per_translation(tbs, gen_times.ir);
>> +    jpi->opt_time += stat_per_translation(tbs, gen_times.ir_opt);
>> +    jpi->la_time += stat_per_translation(tbs, gen_times.la);
>> +    jpi->code_time += stat_per_translation(tbs, gen_times.code);
>> +
>> +    /*
>> +     * The restore time covers how long we have spent restoring state
>> +     * from a given TB (e.g. recovering from a fault). It is therefor
>> +     * not related to the number of translations we have done.
>> +     */
>> +    jpi->restore_time += tbs->tb_restore_time;
>> +    jpi->restore_count += tbs->tb_restore_count;
>> +}
> 
> Why do sums of averages (stats_per_translation) instead of accumulating
> the complete total and dividing by jpi->translations?
> 
There has some inconsistency.

>> +void dump_jit_exec_time_info(uint64_t dev_time, GString *buf)
>> +{
>> +    static uint64_t last_cpu_exec_time;
>> +    uint64_t cpu_exec_time;
>> +    uint64_t delta;
>> +
>> +    cpu_exec_time = tcg_cpu_exec_time();
>> +    delta = cpu_exec_time - last_cpu_exec_time;
>> +
>> +    g_string_append_printf(buf, "async time  %" PRId64 " (%0.3f)\n",
>> +                           dev_time, dev_time /
>> (double)NANOSECONDS_PER_SECOND);
>> +    g_string_append_printf(buf, "qemu time   %" PRId64 " (%0.3f)\n",
>> +                           delta, delta /
>> (double)NANOSECONDS_PER_SECOND);
>> +    last_cpu_exec_time = cpu_exec_time;
>> +}
>> +
>> +/* dump JIT statisticis using TCGProfile and TBStats */
> 
> "statistics"
> 
ok.

>> diff --git a/accel/tcg/tcg-accel-ops.c b/accel/tcg/tcg-accel-ops.c
>> index 3973591508..749ad182f2 100644
>> --- a/accel/tcg/tcg-accel-ops.c
>> +++ b/accel/tcg/tcg-accel-ops.c
>> @@ -70,10 +70,17 @@ void tcg_cpus_destroy(CPUState *cpu)
>>   int tcg_cpus_exec(CPUState *cpu)
>>   {
>>       int ret;
>> +    uint64_t ti;
>> +
>>       assert(tcg_enabled());
>> +    ti = profile_getclock();
>> +
>>       cpu_exec_start(cpu);
>>       ret = cpu_exec(cpu);
>>       cpu_exec_end(cpu);
>> +
>> +    qatomic_add(&tcg_ctx->prof.cpu_exec_time, profile_getclock() - ti);
> 
> You can't qatomic_add a 64-bit value on a 32-bit host.

Right, I only changed the counters to size_t, didn't fix time part. Is
it possible to support it with some kind of locks on 32-bit as a generic
fallback solution? After all, 32-bit host seems not popular, it might be
sacrifice performance a little bit.

For me, the times 'dev_time' and 'cpu_exec_time' is only for the
developer to get a sense of how large part of time spent on such as "ir"
and "code". Alex mentioned in another thread that it's arguable if they
are useful. If not, we can remove it.

> Nor is tcg_ctx a good place to put this.
> 
> If you want to accumulate per-cpu data, put it on the cpu where there's
> no chance of racing with anyone.
> 
> Finally, I suspect that this isn't even where you want to accumulate
> time for execution as separate from translation.  You'd to that in
> cpu_exec_enter/cpu_exec_exit.
> 
>> @@ -296,6 +315,8 @@ static TBStatistics *tb_get_stats(tb_page_addr_t
>> phys_pc, target_ulong pc,
>>       new_stats->cs_base = cs_base;
>>       new_stats->flags = flags;
>>       new_stats->stats_enabled = get_default_tbstats_flag();
>> +    new_stats->tbs = g_ptr_array_sized_new(4);
> 
> Why default to 4?  Is that just a guess, or are we really recomputing
> tbs that frequently?  Is there a good reason not to use g_ptr_array_new()?
> 
Is this the same question in 2019 :) I cannot find the original link
right now. I will try to find it again later.

>> diff --git a/accel/tcg/translator.c b/accel/tcg/translator.c
>> index 80ffbfb455..a60a92091b 100644
>> --- a/accel/tcg/translator.c
>> +++ b/accel/tcg/translator.c
>> @@ -19,7 +19,7 @@
>>   #include "exec/plugin-gen.h"
>>   #include "exec/replay-core.h"
>>   -static void gen_tb_exec_count(TranslationBlock *tb)
>> +static inline void gen_tb_exec_count(TranslationBlock *tb)
> 
> Why?
> 
will remove.

>>   {
>>       if (tb_stats_enabled(tb, TB_EXEC_STATS)) {
>>           TCGv_ptr ptr = tcg_temp_ebb_new_ptr();
>> @@ -147,6 +147,11 @@ void translator_loop(CPUState *cpu,
>> TranslationBlock *tb, int *max_insns,
>>       tb->size = db->pc_next - db->pc_first;
>>       tb->icount = db->num_insns;
>>   +    /* Save number of guest instructions for TB_JIT_STATS */
>> +    if (tb_stats_enabled(tb, TB_JIT_STATS)) {
>> +        tb->tb_stats->code.num_guest_inst += db->num_insns;
>> +    }
> 
> Why do this here, as opposed to the block in tb_gen_code?
> 
close to scene of action?

>> +#define stat_per_translation(stat, name) \
>> +    (stat->translations.total ? stat->name / stat->translations.total
>> : 0)
> 
> Not a fan of this macro, and as per above, the reason for doing the
> division here is questionable.
> 
>> diff --git a/include/qemu/timer.h b/include/qemu/timer.h
>> index 9a91cb1248..ad0da18a5f 100644
>> --- a/include/qemu/timer.h
>> +++ b/include/qemu/timer.h
>> @@ -989,4 +989,10 @@ static inline int64_t cpu_get_host_ticks(void)
>>   }
>>   #endif
>>   +static inline int64_t profile_getclock(void)
>> +{
>> +    return get_clock();
>> +}
> 
> Why?  You use get_clock directly most places.
> 
for the ones still in TCGProfile, but I think it's okay to remove.

>> +/* Timestamps during translation  */
>> +typedef struct TCGTime {
>> +    uint64_t start;
>> +    uint64_t ir_done;
>> +    uint64_t opt_done;
>> +    uint64_t la_done;
>> +    uint64_t code_done;
>> +} TCGTime;
> 
> int64_t would match the result of get_clock().
> 
OK. btw, there are still several occurrences of u64 for time in the
existing code.

>> +
>> +/*
>> + * The TCGProfile structure holds data for the lifetime of the
>> translator.
>> + */
>> +typedef struct TCGProfile {
>> +    /* exec time of this vcpu */
>> +    int64_t cpu_exec_time;
> 
> TCGContext is not per-cpu, and therefore TCGProfile does not correspond
> either.
> 
>> @@ -608,6 +630,7 @@ struct TCGContext {
>>         /* Exit to translator on overflow. */
>>       sigjmp_buf jmp_trans;
>> +    TranslationBlock *current_tb;
> 
> TCGContext.gen_tb already exists.
> 
no user actually, will remove it.

>> -int64_t tcg_cpu_exec_time(void);
>> +uint64_t tcg_cpu_exec_time(void);
> 
> Why?  (Also, probably wants removing, per above.)
> 
The original patch mentioned 'make it an uint64_t as we won't be dealing
in negative numbers.'

>> --- a/softmmu/runstate.c
>> +++ b/softmmu/runstate.c
>> @@ -728,9 +728,18 @@ static bool main_loop_should_exit(int *status)
>>   int qemu_main_loop(void)
>>   {
>>       int status = EXIT_SUCCESS;
>> +#ifdef CONFIG_TCG
>> +    uint64_t ti;
>> +#endif
>>         while (!main_loop_should_exit(&status)) {
>> +#ifdef CONFIG_TCG
>> +        ti = profile_getclock();
>> +#endif
>>           main_loop_wait(false);
>> +#ifdef CONFIG_TCG
>> +        dev_time += profile_getclock() - ti;
>> +#endif
>>       }
> 
> What is this intending to collect?  Because I don't think it measures
> anything.  Certainly nothing related to TCG, CPUs or even devices.
> 
It's exported to hmp cmd in dump_jit_exec_time_info() together with
cpu_exec_time.

>> +    /* ? won't this end up op_opt - op = del_op_count ? */
>> +    if (tb_stats_enabled(s->gen_tb, TB_JIT_STATS)) {
>> +        s->gen_tb->tb_stats->code.deleted_ops++;
>> +    }
> 
> Not quite.  We can emit new ops as well.  But how useful this is on its
> own is debatable.
> 
okay, so I will remove the comment.

>> +/* avoid copy/paste errors */
>> +#define PROF_ADD(to, from, field)                       \
>> +    do {                                                \
>> +        (to)->field += qatomic_read(&((from)->field));  \
>> +    } while (0)
> 
> It is only used twice.
> In addition, you can't use qatomic_read of a 64-bit variable on a 32-bit
> host.
> You really really need to build e.g. i386.
> 
>> @@ -5879,6 +5923,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock
>> *tb, uint64_t pc_start)
>>           }
>>       }
>>   +
>>   #ifdef CONFIG_DEBUG_TCG
> 
> Stray.
> 
will fix.

Thanks,
Fei.

> 
> r~




reply via email to

[Prev in Thread] Current Thread [Next in Thread]