qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()


From: David Hildenbrand
Subject: Re: [Qemu-devel] [PATCH v1 2/4] s390x/tcg: Introduce probe_read_access()
Date: Wed, 21 Aug 2019 19:37:53 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0

On 21.08.19 19:26, Richard Henderson wrote:
> On 8/21/19 2:22 AM, David Hildenbrand wrote:
>> +/*
>> + * Make sure the read access is permitted and TLB entries are created. In
>> + * very rare cases it might happen that the actual accesses might need
>> + * new MMU translations. If the page tables were changed in between, we
>> + * might still trigger a fault. However, this seems to barely happen, so we
>> + * can ignore this for now.
>> + */
>> +void probe_read_access(CPUS390XState *env, uint64_t addr, uint64_t len,
>> +                       uintptr_t ra)
>> +{
>> +#ifdef CONFIG_USER_ONLY
>> +    if (!guest_addr_valid(addr) || !guest_addr_valid(addr + len - 1) ||
>> +        page_check_range(addr, len, PAGE_READ) < 0) {
>> +        s390_program_interrupt(env, PGM_ADDRESSING, ILEN_AUTO, ra);
>> +    }
>> +#else
>> +    while (len) {
>> +        const uint64_t pagelen = -(addr | -TARGET_PAGE_MASK);
>> +        const uint64_t curlen = MIN(pagelen, len);
>> +
>> +        cpu_ldub_data_ra(env, addr, ra);
>> +        addr = wrap_address(env, addr + curlen);
>> +        len -= curlen;
>> +    }
>> +#endif
>> +}
> 
> I don't think this is really the right approach, precisely because of the
> comment above.
> 
> I think we should
> 
> (1) Modify the generic probe_write to return the host address,
>     akin to tlb_vaddr_to_host except it *will* fault.
> 
> (2) Create a generic version of probe_write for CONFIG_USER_ONLY,
>     much like the one you have done for target/s390x.
> 
> (3) Create generic version of probe_read that does the same.
> 
> (4) Rewrite fast_memset and fast_memmove to fetch all of the host
>     addresses before doing any modifications.  The functions are
>     currently written as if len can be very large, handling any
>     number of pages.  Except that's not true.  While there are
>     several kinds of users apart from MVC, two pages are sufficient
>     for all users.
> 
>     Well, should be.  We would need to adjust do_mvcl to limit the
>     operation to TARGET_PAGE_SIZE (CC=3, cpu-determined number of
>     bytes moved without reaching end of first operand).
>     Which is probably a good idea anyway.  System mode should not
>     spend forever executing one instruction, as it would if you
>     pass in a 64-bit length from MVCLE.


Hah, guess what, I implemented a similar variant of "fetch all
of the host addresses" *but* it is not that easy as you might
think (sorry for the bad news).

There are certain cases where we can't get access to the raw host
page. Namely, cpu watchpoints, LAP, NODIRTY. In summary: this won't
as you describe. (my first approach did exactly this)

The following patch requires another re-factoring
(tcg_s390_cpu_mmu_translate), but you should get the idea.



>From 0cacd2aea3dbc25e93492cca04f6c866b86d7f8a Mon Sep 17 00:00:00 2001
From: David Hildenbrand <address@hidden>
Date: Tue, 20 Aug 2019 09:37:11 +0200
Subject: [PATCH v1] s390x/tcg: Fault-safe MVC (MOVE) implementation

MVC can cross page boundaries. In case we fault on the second page, we
already partially copied data. If we have overlaps, we would
trigger a fault after having partially moved data, eventually having
our original data already overwritten. When continuing after the fault,
we would try to move already modified data, not the original data -
very bad.

glibc started to use MVC for forward memmove() and is able to trigger
exectly this corruption (via rpmbuild and rpm). Fedora 31 (rawhide)
currently fails to install as we trigger rpm database corruptions due to
this bug.

We need a way to translate a virtual address range to individual pages that
we can access later on without triggering faults. Probing all virtual
addresses once before the read/write is not sufficient - the guest could
have modified the page tables (e.g., write-protect, map out) in between,
so on we could fault on any new tlb_fill() - we have to skip any new MMU
translations.

Unfortunately, there are TLB entries for which cannot get a host address
for (esp., watchpoints, LAP, NOTDIRTY) - in these cases we cannot avoid
a new MMU translation using the ordinary ld/st helpers. Let's fallback
to guest physical addresses in these cases, that we access via
cpu_physical_memory_(read|write),

This change reduced the boottime for s390x guests (to prompt) from ~1:29
min to ~1:19 min in my tests. For example, LAP protected pages are now only
translated once when writing to them using MVC and we don't always fallback
to byte-based copies.

We will want to use the same mechanism for other accesses as well (e.g.,
mvcl), prepare for that right away.

Signed-off-by: David Hildenbrand <address@hidden>
---
 target/s390x/mem_helper.c | 213 +++++++++++++++++++++++++++++++++++---
 1 file changed, 200 insertions(+), 13 deletions(-)

diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
index 91ba2e03d9..1ca293e00d 100644
--- a/target/s390x/mem_helper.c
+++ b/target/s390x/mem_helper.c
@@ -24,8 +24,10 @@
 #include "exec/helper-proto.h"
 #include "exec/exec-all.h"
 #include "exec/cpu_ldst.h"
+#include "exec/cpu-common.h"
 #include "qemu/int128.h"
 #include "qemu/atomic128.h"
+#include "tcg_s390x.h"
 
 #if !defined(CONFIG_USER_ONLY)
 #include "hw/s390x/storage-keys.h"
@@ -104,6 +106,181 @@ static inline void cpu_stsize_data_ra(CPUS390XState *env, 
uint64_t addr,
     }
 }
 
+/*
+ * An access covers one page, except for the start/end of the translated
+ * virtual address range.
+ */
+typedef struct S390Access {
+    union {
+        char *haddr;
+        hwaddr paddr;
+    };
+    uint16_t size;
+    bool isHaddr;
+} S390Access;
+
+/*
+ * Prepare access to a virtual address range, guaranteeing we won't trigger
+ * faults during the actual access. Sometimes we can't get access to the
+ * host address (e.g., LAP, cpu watchpoints/PER, clean pages, ...). Then, we
+ * translate to guest physical addresses instead. We'll have to perform
+ * slower, indirect, access to these physical addresses then.
+ */
+static void access_prepare_idx(CPUS390XState *env, S390Access access[],
+                               int nb_access, vaddr vaddr, target_ulong size,
+                               MMUAccessType access_type, int mmu_idx,
+                               uintptr_t ra)
+{
+    int i = 0;
+    int cur_size;
+
+    /*
+     * After we obtained the host address of a TLB entry that entry might
+     * be invalidated again - e.g., via tlb_set_dirty(), via another
+     * tlb_fill(). We assume here that it is fine to temporarily store the
+     * host address to access it later - we didn't agree to any tlb flush and
+     * there seems to be no mechanism protecting the return value of
+     * tlb_vaddr_to_host().
+     */
+    while (size) {
+        g_assert(i < nb_access);
+        cur_size = adj_len_to_page(size, vaddr);
+
+        access[i].isHaddr = true;
+        access[i].haddr = tlb_vaddr_to_host(env, vaddr, access_type, mmu_idx);
+        if (!access[i].haddr) {
+            access[i].isHaddr = false;
+            tcg_s390_cpu_mmu_translate(env, vaddr, cur_size, access_type,
+                                       mmu_idx, false, &access[i].paddr,
+                                       NULL, ra);
+        }
+        access[i].size = cur_size;
+
+        vaddr += cur_size;
+        size -= cur_size;
+        i++;
+    }
+
+    /* let's zero-out the remaining entries, so we have a size of 0 */
+    if (i < nb_access) {
+        memset(&access[i], 0 , sizeof(S390Access) * (nb_access - i));
+    }
+}
+
+static void access_prepare(CPUS390XState *env, S390Access access[],
+                           int nb_access, target_ulong vaddr, target_ulong 
size,
+                           MMUAccessType access_type, uintptr_t ra)
+{
+    int mmu_idx = cpu_mmu_index(env, false);
+
+    access_prepare_idx(env, access, nb_access, vaddr, size, access_type,
+                       mmu_idx, ra);
+}
+
+static void access_set(CPUS390XState *env, S390Access write[], int nb_write,
+                       uint8_t byte, target_ulong size)
+{
+    target_ulong cur_size;
+    void *buf = NULL;
+    int w = 0;
+
+    while (size) {
+        g_assert(w < nb_write);
+        if (!write[w].size) {
+            w++;
+            continue;
+        }
+
+        cur_size = MIN(size, write[w].size);
+        if (write[w].isHaddr) {
+            memset(write[w].haddr, byte, cur_size);
+            write[w].haddr += cur_size;
+        } else {
+#ifndef CONFIG_USER_ONLY
+            if (!buf) {
+                buf = g_malloc(TARGET_PAGE_SIZE);
+                memset(buf, byte, cur_size);
+            }
+            cpu_physical_memory_write(write[w].paddr, buf, cur_size);
+            write[w].paddr += cur_size;
+#else
+            g_assert_not_reached();
+#endif
+        }
+        write[w].size -= cur_size;
+        size -= cur_size;
+    }
+    g_free(buf);
+}
+
+/*
+ * Copy memory in chunks up to chunk_size. If the ranges don't overlap or
+ * if it's a forward move, this function behaves like memmove().
+ *
+ * To achieve a backwards byte-by-byte copy (e.g., MVC), the chunk_size
+ * must not be bigger than the address difference (in the worst case, 1 byte).
+ */
+static void access_copy(CPUS390XState *env, S390Access write[], int nb_write,
+                        S390Access read[], int nb_read, target_ulong size,
+                        target_ulong chunk_size)
+{
+    target_ulong cur_size;
+    void *buf = NULL;
+    int r = 0;
+    int w = 0;
+
+    g_assert(chunk_size > 0);
+    chunk_size = MIN(chunk_size, TARGET_PAGE_SIZE);
+
+    while (size) {
+        g_assert(w < nb_write);
+        g_assert(r < nb_read);
+        if (!write[w].size) {
+            w++;
+            continue;
+        }
+        if (!read[r].size) {
+            r++;
+            continue;
+        }
+        cur_size = MIN(MIN(MIN(size, write[w].size), read[r].size), 
chunk_size);
+
+        if (write[w].isHaddr && read[r].isHaddr) {
+            memmove(write[w].haddr, read[r].haddr,
+                    cur_size);
+            write[w].haddr += cur_size;
+            read[r].haddr += cur_size;
+#ifndef CONFIG_USER_ONLY
+        } else if (!write[w].isHaddr && read[r].isHaddr) {
+            cpu_physical_memory_write(write[w].paddr,
+                                      read[r].haddr, cur_size);
+            write[w].paddr += cur_size;
+            read[r].haddr += cur_size;
+        } else if (write[w].isHaddr && !read[r].isHaddr) {
+            cpu_physical_memory_read(read[r].paddr,
+                                     write[w].haddr, cur_size);
+            write[w].haddr += cur_size;
+            read[r].paddr += cur_size;
+        } else {
+            if (!buf) {
+                buf = g_malloc(chunk_size);
+            }
+            cpu_physical_memory_read(read[r].paddr, buf, cur_size);
+            cpu_physical_memory_write(write[w].paddr, buf, cur_size);
+            write[w].paddr += cur_size;
+            read[r].paddr += cur_size;
+#else
+        } else {
+            g_assert_not_reached();
+#endif
+        }
+        write[w].size -= cur_size;
+        read[r].size -= cur_size;
+        size -= cur_size;
+    }
+    g_free(buf);
+}
+
 static void fast_memset(CPUS390XState *env, uint64_t dest, uint8_t byte,
                         uint32_t l, uintptr_t ra)
 {
@@ -302,24 +479,34 @@ uint32_t HELPER(oc)(CPUS390XState *env, uint32_t l, 
uint64_t dest,
 static uint32_t do_helper_mvc(CPUS390XState *env, uint32_t l, uint64_t dest,
                               uint64_t src, uintptr_t ra)
 {
-    uint32_t i;
+    /* 256 bytes cannot cross more than two pages */
+    S390Access read[2];
+    S390Access write[2];
 
     HELPER_LOG("%s l %d dest %" PRIx64 " src %" PRIx64 "\n",
                __func__, l, dest, src);
+    l++;
 
-    /* mvc and memmove do not behave the same when areas overlap! */
-    /* mvc with source pointing to the byte after the destination is the
-       same as memset with the first source byte */
+    g_assert(l <= 256);
+    access_prepare(env, write, ARRAY_SIZE(write), dest, l, MMU_DATA_STORE, ra);
+
+    /*
+     * The result of MVC is as if moving single bytes from left to right
+     * (in contrast to memmove()). It can be used like memset().
+     */
     if (dest == src + 1) {
-        fast_memset(env, dest, cpu_ldub_data_ra(env, src, ra), l + 1, ra);
-    } else if (dest < src || src + l < dest) {
-        fast_memmove(env, dest, src, l + 1, ra);
-    } else {
-        /* slow version with byte accesses which always work */
-        for (i = 0; i <= l; i++) {
-            uint8_t x = cpu_ldub_data_ra(env, src + i, ra);
-            cpu_stb_data_ra(env, dest + i, x, ra);
-        }
+        access_set(env, write, ARRAY_SIZE(write),
+                   cpu_ldub_data_ra(env, src, ra), l);
+        return env->cc_op;
+    }
+
+    access_prepare(env, read, ARRAY_SIZE(read), src, l, MMU_DATA_LOAD, ra);
+    if (dest < src || src + l <= dest) {
+        access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), l,
+                    TARGET_PAGE_SIZE);
+    } else if (src < dest) {
+        access_copy(env, write, ARRAY_SIZE(write), read, ARRAY_SIZE(read), l,
+                    dest - src);
     }
 
     return env->cc_op;
-- 
2.21.0


-- 

Thanks,

David / dhildenb



reply via email to

[Prev in Thread] Current Thread [Next in Thread]