qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup r


From: Fabiano Rosas
Subject: Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races
Date: Wed, 07 Feb 2024 10:31:51 -0300

Peter Xu <peterx@redhat.com> writes:

> On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote:
>> Based-on: 20240202102857.110210-1-peterx@redhat.com
>> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
>> 20240202102857.110210-1-peterx@redhat.com">https://lore.kernel.org/r/20240202102857.110210-1-peterx@redhat.com
>> 
>> Hi,
>> 
>> For v3 I fixed the refcounting issue spotted by Avihai. The situation
>> there is a bit clunky due to historical reasons. The gist is that we
>> have an assumption that channel creation never fails after p->c has
>> been set, so when 'p->c == NULL' we have to unref and when 'p->c !=
>> NULL' the cleanup code will do the unref.
>
> Yes, this looks good to me.  That's a good catch.
>
> I'll leave at least one more day for Avihai and/or Dan to have another
> look.  My r-b persist as of now on patch 5.
>
> Actually I think the conditional unref is slightly tricky, but it's not its
> own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly
> abused.  My understanding is we can avoid that conditional unref with below
> patch 1 as a cleanup (on top of this series).  Then patch 2 comes all
> alongside.

Yes, I even wrote some code to always set p->c and leave the unref to
the cleanup routine. Doing reference counting in the middle of the code
like that leaves us exposed to new code relying on p->c being valid
during cleanup. There's already yank and the cleanup itself which expect
p->c to be valid.

However, I couldn't get past the uglyness of overwriting p->c, so I kept
the changes minimal for this version.

I'm also wondering whether we can remove the TLS handshake thread
altogether now that we moved the multifd_send_setup call into the
migration thread. My (poor) understanding is that a1af605bd5ad hit the
issue that the QIOTask completion would just execute after
multifd_save_setup returned. Otherwise I don't see how adding a thread
to an already async task would have helped. But I need to think about
that a bit more.

>
> We don't need to rush on these, though, we should fix the thread race
>first because multiple of us hit it, and all cleanups can be done
>later.

I said we should not merge these two changes right now, but I take that
back. I'll leave it up to you, there doesn't seem to be that much impact
in adding them.

>
> =====
> From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:08:35 +0800
> Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing
>
> Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to
> blocking handshake") introduced a thread for TLS channels, which will
> resolve the issue on blocking the main thread.  However in the same commit
> p->c is slightly abused just to be able to pass over the pointer "p" into
> the thread.
>
> That's the major reason we'll need to conditionally free the io channel in
> the fault paths.
>
> To clean it up, using a separate structure to pass over both "p" and "tioc"
> in the tls handshake thread.  Then we can make it a rule that p->c will
> never be set until the channel is completely setup.  With that, we can drop
> the tricky conditional unref of the io channel in the error path.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/multifd.c | 37 +++++++++++++++++++++++--------------
>  1 file changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index adfe8c9a0a..4a85a6b7b3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -873,16 +873,22 @@ out:
>  
>  static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>  
> +typedef struct {
> +    MultiFDSendParams *p;
> +    QIOChannelTLS *tioc;
> +} MultiFDTLSThreadArgs;
> +
>  static void *multifd_tls_handshake_thread(void *opaque)
>  {
> -    MultiFDSendParams *p = opaque;
> -    QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
> +    MultiFDTLSThreadArgs *args = opaque;
>  
> -    qio_channel_tls_handshake(tioc,
> +    qio_channel_tls_handshake(args->tioc,
>                                multifd_new_send_channel_async,
> -                              p,
> +                              args->p,
>                                NULL,
>                                NULL);
> +    g_free(args);
> +
>      return NULL;
>  }
>  
> @@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams 
> *p,
>  {
>      MigrationState *s = migrate_get_current();
>      const char *hostname = s->hostname;
> +    MultiFDTLSThreadArgs *args;
>      QIOChannelTLS *tioc;
>  
>      tioc = migration_tls_client_create(ioc, hostname, errp);
> @@ -906,11 +913,14 @@ static bool 
> multifd_tls_channel_connect(MultiFDSendParams *p,
>      object_unref(OBJECT(ioc));
>      trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
>      qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
> -    p->c = QIO_CHANNEL(tioc);

This assignment also meant multifd_send_channel_destroy() would call
object_unref on the tioc object. Removing it means
qio_channel_tls_finalize() will never be called.

It also means the socket channel (ioc) refcount will be decremented one
too many times, due to the object_unref above^.

So I think we should find a point where tioc is not needed anymore and
unref it and remove the object_unref(ioc) above.

Right?

> +
> +    args = g_new0(MultiFDTLSThreadArgs, 1);
> +    args->tioc = tioc;
> +    args->p = p;
>  
>      p->tls_thread_created = true;
>      qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
> -                       multifd_tls_handshake_thread, p,
> +                       multifd_tls_handshake_thread, args,
>                         QEMU_THREAD_JOINABLE);
>      return true;
>  }
> @@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>  
>      migration_ioc_register_yank(ioc);
>      p->registered_yank = true;
> +    /* Setup p->c only if the channel is completely setup */
>      p->c = ioc;
>  
>      p->thread_created = true;
> @@ -976,14 +987,12 @@ out:
>  
>      trace_multifd_new_send_channel_async_error(p->id, local_err);
>      multifd_send_set_error(local_err);
> -    if (!p->c) {
> -        /*
> -         * If no channel has been created, drop the initial
> -         * reference. Otherwise cleanup happens at
> -         * multifd_send_channel_destroy()
> -         */
> -        object_unref(OBJECT(ioc));
> -    }
> +    /*
> +     * For error cases (TLS or non-TLS), IO channel is always freed here
> +     * rather than when cleanup multifd: since p->c is not set, multifd
> +     * cleanup code doesn't even know its existance.
> +     */
> +    object_unref(OBJECT(ioc));
>      error_free(local_err);
>  }
>  
> -- 
> 2.43.0
> =====
> From 9e574c3216f6459e3a808096d905e2554d962cad Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:24:39 +0800
> Subject: [PATCH 2/2] migration/multifd: Drop registered_yank
>
> With a clear definition of p->c protocol, where we only set it up if the
> channel is fully established (TLS or non-TLS), registered_yank boolean will
> have equal meaning of "p->c != NULL".
>
> Drop registered_yank by checking p->c instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This one looks good. I know it depends on the previous patch, but if you
plan to add it:

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  migration/multifd.h | 2 --
>  migration/multifd.c | 7 +++----
>  2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 8a1cad0996..b3fe27ae93 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -78,8 +78,6 @@ typedef struct {
>      bool tls_thread_created;
>      /* communication channel */
>      QIOChannel *c;
> -    /* is the yank function registered */
> -    bool registered_yank;
>      /* packet allocated len */
>      uint32_t packet_len;
>      /* guest page size */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 4a85a6b7b3..278453cf84 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -648,11 +648,11 @@ static int multifd_send_channel_destroy(QIOChannel 
> *send)
>  
>  static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
>  {
> -    if (p->registered_yank) {
> +    if (p->c) {
>          migration_ioc_unregister_yank(p->c);
> +        multifd_send_channel_destroy(p->c);
> +        p->c = NULL;
>      }
> -    multifd_send_channel_destroy(p->c);
> -    p->c = NULL;
>      qemu_sem_destroy(&p->sem);
>      qemu_sem_destroy(&p->sem_sync);
>      g_free(p->name);
> @@ -932,7 +932,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>      qio_channel_set_delay(ioc, false);
>  
>      migration_ioc_register_yank(ioc);
> -    p->registered_yank = true;
>      /* Setup p->c only if the channel is completely setup */
>      p->c = ioc;
>  
> -- 
> 2.43.0
> ====
>
> Thanks,



reply via email to

[Prev in Thread] Current Thread [Next in Thread]