[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 1/3 gnumach] vm_map: Fix deadlock in vm code
From: |
Sergey Bugaev |
Subject: |
Re: [PATCH 1/3 gnumach] vm_map: Fix deadlock in vm code |
Date: |
Wed, 21 Feb 2024 22:22:32 +0300 |
On Wed, Feb 21, 2024 at 9:44 PM Samuel Thibault <samuel.thibault@gnu.org> wrote:
> Damien Zammit, le mer. 21 févr. 2024 13:45:59 +0000, a ecrit:
> > Authored-by: Sergey Bugaev <bugaevc@gmail.com>
> IIRC you got to see which kind of deadlock we are protecting from here?
> I'd be very useful to document it here, so next people reading this get
> to know.
I didn't document the exact deadlock in the code, since the deadlock
is only a manifestation of the actual issue, the actual issue being
that the entry was being modified by other threads while accessed
inside vm_fault_wire(). This is -- for one thing, technically
undefined behavior -- but also will _of course_ break any kind of
logic that works with the entry's offset / address range. It could be
worse than a deadlock. So what I wanted to emphasize in the code is
that we should make sure the entry is not concurrently read and
modified, and I was going to describe the actual deadlock we've
witnessed in the commit message.
So let's go with that; I'll re-send this patch myself with a detailed
commit message and reworked comment.
> > [...] We trust that the
> > * kernel threads are well-behaved, and therefore will
> > * not do anything destructive to this region of the map
> > * while we have it unlocked.
>
> Are we not here precisely in the case that they are not well-behaved?
No. The threads are well-behaved, but turns out that's not quite
enough, because, as the new comment says, in the face of clipping and
coalescing entries, even well-behaved, well-intentioned operations on
*adjacent* VM regions can affect the same entry -- in fact, the entry
can get freed altogether if it gets coalesced into the previous entry,
why would create a use-after-free.
> The comment probably deserves an update, to make the situation cler.
I thought it was clear enough, but evidently it wasn't. I'll amend it.
> > + * Once we unlock the map, even well-intentioned operations
> > + * on adjacent VM regions can end up affecting our entry,
> > + * due to clipping and coalescing entries. So, make a
> > + * temporary copy of the entry, and pass that to vm_fault_wire()
> > + * instead of the original.
>
> We need to be very careful with such a thing. If it's a copy that is
> given to further function calls, we need to make sure that they won't
> try to modify it, i.e. add const qualifiers on the entry parameter all
> along to vm_fault_wire and its callees being passed entry.
That's a good idea, yes.
>
> That being said, this doesn't seem completely safe to me: in the end
> vm_fault_wire_fast uses the object field, are we sure that the object
> won't disappear? Or are we sure it won't because "kernel threads are
> well-behaved"? (but I don't really know how we are sure of this, so this
> probably also deserves documenting).
We are sure because kernel threads (or rather, any threads running in
the kernel) are well-behaved, yes. We trust that they won't unmap the
mapping while it's being wired, and the mapping (the entry) keeps the
object alive.
But... now that I think of it, we do also coalesce objects. So we
better keep a second reference on the object over vm_fault_wire
indeed.
> > + *
> > * HACK HACK HACK HACK
> > */
> > if (vm_map_pmap(map) == kernel_pmap) {
> > + /*
> > + * TODO: Support wiring more than one entry
> > + * in the kernel map at a time.
> > + */
> > + assert(start->vme_next == end);
>
> Are we sure that this currently is never to happen?
Not really. I don't expect the kernel to do this anywhere, but I don't
know for sure. Please try running some complex workloads with this
patch, and see if you catch this being not so?
There isn't anything fundamentally incompatible with supporting
multi-entry wiring here, I just didn't want to do an alloca.
> > + entry_copy = *start;
> > vm_map_unlock(map); /* trust me ... */
> > - } else {
> > - vm_map_lock_set_recursive(map);
> > - vm_map_lock_write_to_read(map);
> > + vm_fault_wire(map, &entry_copy);
>
> This is also assuming that for non-kernel_pmap there is just one entry
> for which to call vm_fault_wire, while userland can very well ask for
> wiring a wide range of memory, spanning various objects and whatnot.
No, this is still the kernel map branch (note that the "else" is being
deleted by the patch).
> AIUI you'd rather want the converse: inline the code for the kernel_pmap
> case, which is much more trivial since start->vme_next == end and thus
> no for loop to perform, just call vm_fault_wire and the locking stuff
> around, and be done.
That's exactly what this patch does, yes.
Sergey