[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup r
From: |
Fabiano Rosas |
Subject: |
Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races |
Date: |
Wed, 07 Feb 2024 10:31:51 -0300 |
Peter Xu <peterx@redhat.com> writes:
> On Tue, Feb 06, 2024 at 06:51:12PM -0300, Fabiano Rosas wrote:
>> Based-on: 20240202102857.110210-1-peterx@redhat.com
>> [PATCH v2 00/23] migration/multifd: Refactor ->send_prepare() and cleanups
>> 20240202102857.110210-1-peterx@redhat.com">https://lore.kernel.org/r/20240202102857.110210-1-peterx@redhat.com
>>
>> Hi,
>>
>> For v3 I fixed the refcounting issue spotted by Avihai. The situation
>> there is a bit clunky due to historical reasons. The gist is that we
>> have an assumption that channel creation never fails after p->c has
>> been set, so when 'p->c == NULL' we have to unref and when 'p->c !=
>> NULL' the cleanup code will do the unref.
>
> Yes, this looks good to me. That's a good catch.
>
> I'll leave at least one more day for Avihai and/or Dan to have another
> look. My r-b persist as of now on patch 5.
>
> Actually I think the conditional unref is slightly tricky, but it's not its
> own fault, IMHO, OTOH it's more about a1af605bd5ad where p->c is slightly
> abused. My understanding is we can avoid that conditional unref with below
> patch 1 as a cleanup (on top of this series). Then patch 2 comes all
> alongside.
Yes, I even wrote some code to always set p->c and leave the unref to
the cleanup routine. Doing reference counting in the middle of the code
like that leaves us exposed to new code relying on p->c being valid
during cleanup. There's already yank and the cleanup itself which expect
p->c to be valid.
However, I couldn't get past the uglyness of overwriting p->c, so I kept
the changes minimal for this version.
I'm also wondering whether we can remove the TLS handshake thread
altogether now that we moved the multifd_send_setup call into the
migration thread. My (poor) understanding is that a1af605bd5ad hit the
issue that the QIOTask completion would just execute after
multifd_save_setup returned. Otherwise I don't see how adding a thread
to an already async task would have helped. But I need to think about
that a bit more.
>
> We don't need to rush on these, though, we should fix the thread race
>first because multiple of us hit it, and all cleanups can be done
>later.
I said we should not merge these two changes right now, but I take that
back. I'll leave it up to you, there doesn't seem to be that much impact
in adding them.
>
> =====
> From 0830819d86e08c5175d6669505aa712a0a09717f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:08:35 +0800
> Subject: [PATCH 1/2] migration/multifd: Cleanup TLS iochannel referencing
>
> Commit a1af605bd5 ("migration/multifd: fix hangup with TLS-Multifd due to
> blocking handshake") introduced a thread for TLS channels, which will
> resolve the issue on blocking the main thread. However in the same commit
> p->c is slightly abused just to be able to pass over the pointer "p" into
> the thread.
>
> That's the major reason we'll need to conditionally free the io channel in
> the fault paths.
>
> To clean it up, using a separate structure to pass over both "p" and "tioc"
> in the tls handshake thread. Then we can make it a rule that p->c will
> never be set until the channel is completely setup. With that, we can drop
> the tricky conditional unref of the io channel in the error path.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> migration/multifd.c | 37 +++++++++++++++++++++++--------------
> 1 file changed, 23 insertions(+), 14 deletions(-)
>
> diff --git a/migration/multifd.c b/migration/multifd.c
> index adfe8c9a0a..4a85a6b7b3 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -873,16 +873,22 @@ out:
>
> static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque);
>
> +typedef struct {
> + MultiFDSendParams *p;
> + QIOChannelTLS *tioc;
> +} MultiFDTLSThreadArgs;
> +
> static void *multifd_tls_handshake_thread(void *opaque)
> {
> - MultiFDSendParams *p = opaque;
> - QIOChannelTLS *tioc = QIO_CHANNEL_TLS(p->c);
> + MultiFDTLSThreadArgs *args = opaque;
>
> - qio_channel_tls_handshake(tioc,
> + qio_channel_tls_handshake(args->tioc,
> multifd_new_send_channel_async,
> - p,
> + args->p,
> NULL,
> NULL);
> + g_free(args);
> +
> return NULL;
> }
>
> @@ -892,6 +898,7 @@ static bool multifd_tls_channel_connect(MultiFDSendParams
> *p,
> {
> MigrationState *s = migrate_get_current();
> const char *hostname = s->hostname;
> + MultiFDTLSThreadArgs *args;
> QIOChannelTLS *tioc;
>
> tioc = migration_tls_client_create(ioc, hostname, errp);
> @@ -906,11 +913,14 @@ static bool
> multifd_tls_channel_connect(MultiFDSendParams *p,
> object_unref(OBJECT(ioc));
> trace_multifd_tls_outgoing_handshake_start(ioc, tioc, hostname);
> qio_channel_set_name(QIO_CHANNEL(tioc), "multifd-tls-outgoing");
> - p->c = QIO_CHANNEL(tioc);
This assignment also meant multifd_send_channel_destroy() would call
object_unref on the tioc object. Removing it means
qio_channel_tls_finalize() will never be called.
It also means the socket channel (ioc) refcount will be decremented one
too many times, due to the object_unref above^.
So I think we should find a point where tioc is not needed anymore and
unref it and remove the object_unref(ioc) above.
Right?
> +
> + args = g_new0(MultiFDTLSThreadArgs, 1);
> + args->tioc = tioc;
> + args->p = p;
>
> p->tls_thread_created = true;
> qemu_thread_create(&p->tls_thread, "multifd-tls-handshake-worker",
> - multifd_tls_handshake_thread, p,
> + multifd_tls_handshake_thread, args,
> QEMU_THREAD_JOINABLE);
> return true;
> }
> @@ -923,6 +933,7 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
>
> migration_ioc_register_yank(ioc);
> p->registered_yank = true;
> + /* Setup p->c only if the channel is completely setup */
> p->c = ioc;
>
> p->thread_created = true;
> @@ -976,14 +987,12 @@ out:
>
> trace_multifd_new_send_channel_async_error(p->id, local_err);
> multifd_send_set_error(local_err);
> - if (!p->c) {
> - /*
> - * If no channel has been created, drop the initial
> - * reference. Otherwise cleanup happens at
> - * multifd_send_channel_destroy()
> - */
> - object_unref(OBJECT(ioc));
> - }
> + /*
> + * For error cases (TLS or non-TLS), IO channel is always freed here
> + * rather than when cleanup multifd: since p->c is not set, multifd
> + * cleanup code doesn't even know its existance.
> + */
> + object_unref(OBJECT(ioc));
> error_free(local_err);
> }
>
> --
> 2.43.0
> =====
> From 9e574c3216f6459e3a808096d905e2554d962cad Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Wed, 7 Feb 2024 10:24:39 +0800
> Subject: [PATCH 2/2] migration/multifd: Drop registered_yank
>
> With a clear definition of p->c protocol, where we only set it up if the
> channel is fully established (TLS or non-TLS), registered_yank boolean will
> have equal meaning of "p->c != NULL".
>
> Drop registered_yank by checking p->c instead.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
This one looks good. I know it depends on the previous patch, but if you
plan to add it:
Reviewed-by: Fabiano Rosas <farosas@suse.de>
> ---
> migration/multifd.h | 2 --
> migration/multifd.c | 7 +++----
> 2 files changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 8a1cad0996..b3fe27ae93 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -78,8 +78,6 @@ typedef struct {
> bool tls_thread_created;
> /* communication channel */
> QIOChannel *c;
> - /* is the yank function registered */
> - bool registered_yank;
> /* packet allocated len */
> uint32_t packet_len;
> /* guest page size */
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 4a85a6b7b3..278453cf84 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -648,11 +648,11 @@ static int multifd_send_channel_destroy(QIOChannel
> *send)
>
> static bool multifd_send_cleanup_channel(MultiFDSendParams *p, Error **errp)
> {
> - if (p->registered_yank) {
> + if (p->c) {
> migration_ioc_unregister_yank(p->c);
> + multifd_send_channel_destroy(p->c);
> + p->c = NULL;
> }
> - multifd_send_channel_destroy(p->c);
> - p->c = NULL;
> qemu_sem_destroy(&p->sem);
> qemu_sem_destroy(&p->sem_sync);
> g_free(p->name);
> @@ -932,7 +932,6 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
> qio_channel_set_delay(ioc, false);
>
> migration_ioc_register_yank(ioc);
> - p->registered_yank = true;
> /* Setup p->c only if the channel is completely setup */
> p->c = ioc;
>
> --
> 2.43.0
> ====
>
> Thanks,
- [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races, Fabiano Rosas, 2024/02/06
- [PATCH v3 1/6] migration/multifd: Join the TLS thread, Fabiano Rosas, 2024/02/06
- [PATCH v3 2/6] migration/multifd: Remove p->running, Fabiano Rosas, 2024/02/06
- [PATCH v3 3/6] migration/multifd: Move multifd_send_setup error handling in to the function, Fabiano Rosas, 2024/02/06
- [PATCH v3 5/6] migration/multifd: Unify multifd and TLS connection paths, Fabiano Rosas, 2024/02/06
- [PATCH v3 4/6] migration/multifd: Move multifd_send_setup into migration thread, Fabiano Rosas, 2024/02/06
- [PATCH v3 6/6] migration/multifd: Add a synchronization point for channel creation, Fabiano Rosas, 2024/02/06
- Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races, Peter Xu, 2024/02/06
- Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races,
Fabiano Rosas <=
- Re: [PATCH v3 0/6] migration/multifd: Fix channel creation vs. cleanup races, Peter Xu, 2024/02/07