Re: [Gluster-devel] important change to syncop infra

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] important change to syncop infra

From:	Xavier Hernandez
Subject:	Re: [Gluster-devel] important change to syncop infra
Date:	Wed, 11 Dec 2013 13:26:53 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.1

I missed the const attribute. Thanks for the explanation. This shouldeliminate the THIS problem, as you proved, however other cases are stillpossible that won't work correctly.

Basically, any TLS pointer passed to user code instead of beingdereferenced will have problems. In this case, uuid and lkowner havethis problem. Consider the following code:

@@ -1129,6 +1129,7 @@ gf_defrag_migrate_data (xlator_t *this,gf_defrag_info_t *defrag, loc_t *loc,

         struct timeval           start          = {0,};
         int32_t                  err            = 0;
         int                      loglevel       = GF_LOG_TRACE;
+        char *                   str_uuid       = NULL;

         gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",
                 loc->path);

@@ -1275,8 +1276,10 @@ gf_defrag_migrate_data (xlator_t *this,gf_defrag_info_t *defrag, loc_t *loc,* and also that guarantees that file has tobe mostly

                          * migrated */

+                        str_uuid = uuid_utoa(entry_loc.gfid);
                         ret = syncop_getxattr (this, &entry_loc, &dict,
GF_XATTR_LINKINFO_KEY);

+ gf_log(this->name, GF_LOG_DEBUG, "uuid in tls:%s", str_uuid);

                         if (ret < 0) {
                                 if (errno != ENODATA) {
                                         loglevel = GF_LOG_ERROR;


It seems obvious, but this is the generated assembly code anyway:

1279
    2e8f:       48 8b 7c 24 60          mov    0x60(%rsp),%rdi

2e94: e8 00 00 00 00 callq 2e99<gf_defrag_migrate_data+0x719> <<<=== get pointer to TLS object

                        2e95: R_X86_64_PLT32    uuid_utoa-0x4
1280
    2e99:       4c 8b 74 24 20          mov    0x20(%rsp),%r14
    2e9e:       48 8b 54 24 70          mov    0x70(%rsp),%rdx

2ea3: 48 8d 0d 00 00 00 00 lea 0x0(%rip),%rcx #2eaa <gf_defrag_migrate_data+0x72a>

                        2ea6: R_X86_64_PC32     .LC82-0x4
    2eaa:       48 8b 74 24 30          mov    0x30(%rsp),%rsi
1279

2eaf: 49 89 c7 mov%rax,%r15 <<<=== store the TLS pointer

1280
    2eb2:       4c 89 f7                mov    %r14,%rdi

2eb5: e8 00 00 00 00 callq 2eba<gf_defrag_migrate_data+0x73a> <<<=== possible thread switch

                        2eb6: R_X86_64_PLT32 syncop_getxattr-0x4
1282

2eba: 4c 89 3c 24 mov%r15,(%rsp) <<<=== use the TLS pointer

    2ebe:       49 8b 3e                mov    (%r14),%rdi

2ec1: 4c 8d 0d 00 00 00 00 lea 0x0(%rip),%r9 # 2ec8<gf_defrag_migrate_data+0x748>

                        2ec4: R_X86_64_PC32     .LC83-0x4

2ec8: 48 8d 15 00 00 00 00 lea 0x0(%rip),%rdx #2ecf <gf_defrag_migrate_data+0x74f>

                        2ecb: R_X86_64_PC32     .rodata+0xbc

2ecf: 48 8d 35 00 00 00 00 lea 0x0(%rip),%rsi #2ed6 <gf_defrag_migrate_data+0x756>

                        2ed2: R_X86_64_PC32     .LC1-0x4
1280
    2ed6:       89 c5                   mov    %eax,%ebp
1282
    2ed8:       41 b8 08 00 00 00       mov    $0x8,%r8d
    2ede:       31 c0                   xor    %eax,%eax
    2ee0:       b9 02 05 00 00          mov    $0x502,%ecx

2ee5: e8 00 00 00 00 callq 2eea<gf_defrag_migrate_data+0x76a>

                        2ee6: R_X86_64_PLT32    _gf_log-0x4

The str_uuid pointer used can point to the uuid of another thread thatcould have been modified. I don't know very well the internals ofsyncops and synctask, but I think it's even possible that another taskrunning on the same thread while the current one was sleeping couldmodify the contents of str_uuid.

This case seems very obvious and useless but maybe other combinationsare harder to detect.

I think that currently they are only used by uuid_utoa() andlkowner_utoa() to write log messages, which seems safe, but we will haveto be careful when using syncops to avoid possible future problems likethis.


Xavi

El 11/12/13 12:02, Pranith Kumar Karampuri ha escrit:

it's really a very interesting and hard to find bug, but I'm wondering
how we can prevent this to happen in the general case. There might be
other operations based on pointers to thread local storage that will
suffer this problem. Probably 'errno' is one of the most dangerous, but
TLS is also used to resolve 'THIS' in a very similar way to errno, and
there are temporary uuid and lkowner values also stored into TLS. More
things could appear in the future. These are also potential cases to
have problems with syncops.

An access to THIS before and after a syncop might trigger this bug, right ?

for errno it happens because it has the attribute __attribute__((const)).

# grep errno_location /usr/include/bits/errno.h
extern int *__errno_location (void) __THROW __attribute__ ((__const__));
#   define errno (*__errno_location ())

(The following article explains it best IMO: http://lwn.net/Articles/285332)

But it is not there for lkowner, uuid-buf, THIS. Just to confirm, I added the 
following code
------------------------------------------------------
+        xlator_t                *tmp_this = NULL;

gf_log (this->name, GF_LOG_INFO, "migrate data called on %s",

                  loc->path);
@@ -1258,8 +1259,10 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t 
*defrag, loc_t *loc,
                           * and also that guarantees that file has to be mostly
                           * migrated */

+ tmp_this = THIS;

                          ret = syncop_getxattr (this, &entry_loc, &dict,
                                                 GF_XATTR_LINKINFO_KEY);
+                        tmp_this = THIS;
                          if (ret < 0) {
                                  if (errno != ENODATA) {
                                          loglevel = GF_LOG_ERROR;
@@ -1267,7 +1270,7 @@ gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t 
*defrag, loc_t *loc,
                                  } else {
                                          loglevel = GF_LOG_TRACE;
                                  }
-                                gf_log (this->name, loglevel, "%s: failed to "
+                                gf_log (tmp_this->name, loglevel, "%s: failed to 
"
-----------------------------------------------------
assembly code it generated:
=====================================================
1262                            tmp_this = THIS;
    0x00007faf17120dbc <+1788>:   xor    %eax,%eax
    0x00007faf17120dbe <+1790>:   callq  0x7faf171191d0 <address@hidden> 
<<----------------first call to THIS

1263                            ret = syncop_getxattr (this, &entry_loc, &dict,
    0x00007faf17120dc3 <+1795>:   mov    0x60(%rsp),%rdx
    0x00007faf17120dc8 <+1800>:   mov    0x28(%rsp),%rsi
    0x00007faf17120dcd <+1805>:   lea    0x2b550(%rip),%rcx        # 
0x7faf1714c324
    0x00007faf17120dd4 <+1812>:   mov    0x18(%rsp),%rdi
    0x00007faf17120dd9 <+1817>:   callq  0x7faf17118a30 <address@hidden>
    0x00007faf17120dde <+1822>:   mov    %eax,%ebp

1264                                                   GF_XATTR_LINKINFO_KEY);
---Type <return> to continue, or q <return> to quit---
1265                            tmp_this = THIS;
    0x00007faf17120de0 <+1824>:   xor    %eax,%eax
    0x00007faf17120de2 <+1826>:   callq  0x7faf171191d0 <address@hidden> 
<<----------------second call to THIS
    0x00007faf17120de9 <+1833>:   mov    (%rax),%r15

1266                            if (ret < 0) {
    0x00007faf17120de7 <+1831>:   test   %ebp,%ebp
    0x00007faf17120dec <+1836>:   js     0x7faf1712100e 
<gf_defrag_migrate_data+2382>

1267                                    if (errno != ENODATA) {
    0x00007faf1712100e <+2382>:   mov    0x50(%rsp),%rax 
<<<---------------------------------------------check that errno does not callq 
__errno_location
    0x00007faf17121013 <+2387>:   mov    $0x9,%ebp
    0x00007faf17121018 <+2392>:   mov    (%rax),%edi
    0x00007faf1712101a <+2394>:   cmp    $0x3d,%edi

================================================

I think it's very difficult to track all these cases and handle them
correctly. Another solution could be to tell the compiler to forget
previous pointer values when a syncop is called, but I don't see any way
to do so. Since it's storing pointers to values, any barrier or volatile
seems useless.

We can do this by putting __attribute__((noinline)) for such functions. But 
since errno is not in our control we can't do that.

Xavi


El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:

hi,
      We found a day-1 bug when syncop_xxx() infra is used inside a synctask
      with compilation optimization (CFLAGS -O2). This bug has been dormant
      for at least 2 years.
There are around ~400(rebalance, replace-brick, bd, self-heal-daemon,
quota, fuse lock/fd migration) places where syncop is used in the code
base all of which are potential candidates which can take the hit.

I sent first round of patch at http://review.gluster.com/6475 to catch
regressions upstream.
These are the files that are affected by the changes I introduced to fix
this:

   api/src/glfs-fops.c                             |  36
   ++++++++++++++++++++++++++++++++++
   api/src/glfs-handleops.c                        |  15 ++++++++++++++
   api/src/glfs-internal.h                         |   7 +++++++
   api/src/glfs-resolve.c                          |  10 ++++++++++
   libglusterfs/src/syncop.c                       | 117
   
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
   xlators/cluster/afr/src/afr-self-heald.c        |  45
   +++++++++++++++++++++++++++++-------------
   xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
   xlators/cluster/dht/src/dht-helper.c            |  24
   +++++++++++++++--------
   xlators/cluster/dht/src/dht-rebalance.c         | 168
   
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
   xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
   xlators/features/locks/src/posix.c              |   3 ++-
   xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
   xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
   xlators/mount/fuse/src/fuse-bridge.c            |  16 ++++++++++-----

Please review your respective component for these changes in gerrit.

Thanks
Pranith.

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t
*loc,
                          dict_t *migrate_data)
{
.....
code section - [ Loop ]
          while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                         &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno
by a thread)
                  /* Need to keep track of ENOENT errno, that means, there
                  is no
                     need to send more readdirp() */
                  readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                          ret = syncop_getxattr (this, &entry_loc, &dict,
                                                 GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr().
This may not always be executed in same thread which executed [SYNCOP-1])
                          if (ret < 0) {
                                  if (errno != ENODATA) {
                                          loglevel = GF_LOG_ERROR;
                                          defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the
way syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code
that is generated:
   [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets
expanded as *(__errno_location())
     0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0
     <address@hidden>
     0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------
     Address returned by __errno_location() is stored in a special location
     in stack for later use.
     0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
     0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
   [ERRNO-2]
1281                                        if (errno != ENODATA) {
     0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----
     Because it already stored the address returned by __errno_location(),
     it just dereferences the address to get the errno value. BUT THIS CODE
     NEED NOT BE EXECUTED BY SAME THREAD!!!
     0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
     0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
     0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though
t2 is executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
     0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
     <address@hidden><<--- As it is calling __errno_location() again
     it gets the location from t2 so it works as intended.
     0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
     0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
     0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1
     <gf_defrag_migrate_data+2287>

Fix:
We decided to make syncop_xxx() return (-errno) value as the return value
in case of errors and all the functions which make syncop_xxx() will need
to use (-ret) to figure out the reason for failure in case of syncop_xxx()
failures.

Pranith

_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
- Re: [Gluster-devel] important change to syncop infra, Xavier Hernandez, 2013/12/11
  - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
    - Re: [Gluster-devel] important change to syncop infra, Xavier Hernandez <=
    - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
    - Re: [Gluster-devel] important change to syncop infra, John Mark Walker, 2013/12/11
- Re: [Gluster-devel] important change to syncop infra, Anand Subramanian, 2013/12/12
  - Re: [Gluster-devel] important change to syncop infra, Anand Avati, 2013/12/12
    - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/12

Prev by Date: Re: [Gluster-devel] important change to syncop infra
Next by Date: Re: [Gluster-devel] important change to syncop infra
Previous by thread: Re: [Gluster-devel] important change to syncop infra
Next by thread: Re: [Gluster-devel] important change to syncop infra
Index(es):
- Date
- Thread