Re: [Gluster-devel] important change to syncop infra

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] important change to syncop infra

From:	Xavier Hernandez
Subject:	Re: [Gluster-devel] important change to syncop infra
Date:	Wed, 11 Dec 2013 11:04:42 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.1

Hi Pranith,

it's really a very interesting and hard to find bug, but I'm wonderinghow we can prevent this to happen in the general case. There might beother operations based on pointers to thread local storage that willsuffer this problem. Probably 'errno' is one of the most dangerous, butTLS is also used to resolve 'THIS' in a very similar way to errno, andthere are temporary uuid and lkowner values also stored into TLS. Morethings could appear in the future. These are also potential cases tohave problems with syncops.


An access to THIS before and after a syncop might trigger this bug, right ?

I think it's very difficult to track all these cases and handle themcorrectly. Another solution could be to tell the compiler to forgetprevious pointer values when a syncop is called, but I don't see any wayto do so. Since it's storing pointers to values, any barrier or volatileseems useless.


Xavi


El 11/12/13 09:51, Pranith Kumar Karampuri ha escrit:

hi,
     We found a day-1 bug when syncop_xxx() infra is used inside a synctask 
with compilation optimization (CFLAGS -O2). This bug has been dormant for at 
least 2 years.
There are around ~400(rebalance, replace-brick, bd, self-heal-daemon, quota, 
fuse lock/fd migration) places where syncop is used in the code base all of 
which are potential candidates which can take the hit.

I sent first round of patch at http://review.gluster.com/6475 to catch 
regressions upstream.
These are the files that are affected by the changes I introduced to fix this:

  api/src/glfs-fops.c                             |  36 
++++++++++++++++++++++++++++++++++
  api/src/glfs-handleops.c                        |  15 ++++++++++++++
  api/src/glfs-internal.h                         |   7 +++++++
  api/src/glfs-resolve.c                          |  10 ++++++++++
  libglusterfs/src/syncop.c                       | 117 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------
  xlators/cluster/afr/src/afr-self-heald.c        |  45 
+++++++++++++++++++++++++++++-------------
  xlators/cluster/afr/src/pump.c                  |  12 ++++++++++--
  xlators/cluster/dht/src/dht-helper.c            |  24 +++++++++++++++--------
  xlators/cluster/dht/src/dht-rebalance.c         | 168 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------
  xlators/cluster/dht/src/dht-selfheal.c          |   6 ++++--
  xlators/features/locks/src/posix.c              |   3 ++-
  xlators/features/qemu-block/src/bdrv-xlator.c   |  15 ++++----------
  xlators/features/qemu-block/src/qb-coroutines.c |  14 ++++++++++----
  xlators/mount/fuse/src/fuse-bridge.c            |  16 ++++++++++-----

Please review your respective component for these changes in gerrit.

Thanks
Pranith.

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                         dict_t *migrate_data)
{
.....
code section - [ Loop ]
         while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                        &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a 
thread)
                 /* Need to keep track of ENOENT errno, that means, there is no
                    need to send more readdirp() */
                 readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                         ret = syncop_getxattr (this, &entry_loc, &dict,
                                                GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This 
may not always be executed in same thread which executed [SYNCOP-1])
                         if (ret < 0) {
                                 if (errno != ENODATA) {
                                         loglevel = GF_LOG_ERROR;
                                         defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code 
from [ERRNO-2] can be executed by a different thread(t2) because of the way 
syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code that 
is generated:
  [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets expanded 
as *(__errno_location())
    0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 <address@hidden>
    0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address 
returned by __errno_location() is stored in a special location in stack for later use.
    0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
    0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
  [ERRNO-2]
1281                                        if (errno != ENODATA) {
    0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because 
it already stored the address returned by __errno_location(), it just dereferences the 
address to get the errno value. BUT THIS CODE NEED NOT BE EXECUTED BY SAME THREAD!!!
    0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
    0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
    0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So 
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is 
executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
    0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300 
<address@hidden><<--- As it is calling __errno_location() again it gets the location 
from t2 so it works as intended.
    0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
    0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
    0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1 
<gf_defrag_migrate_data+2287>

Fix:
We decided to make syncop_xxx() return (-errno) value as the return value in 
case of errors and all the functions which make syncop_xxx() will need to use 
(-ret) to figure out the reason for failure in case of syncop_xxx() failures.

Pranith

_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
- Re: [Gluster-devel] important change to syncop infra, Xavier Hernandez <=
  - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
    - Re: [Gluster-devel] important change to syncop infra, Xavier Hernandez, 2013/12/11
    - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/11
    - Re: [Gluster-devel] important change to syncop infra, John Mark Walker, 2013/12/11
- Re: [Gluster-devel] important change to syncop infra, Anand Subramanian, 2013/12/12
  - Re: [Gluster-devel] important change to syncop infra, Anand Avati, 2013/12/12
    - Re: [Gluster-devel] important change to syncop infra, Pranith Kumar Karampuri, 2013/12/12

Prev by Date: Re: [Gluster-devel] Could we add a MAINTAINERS file to the glusterfs sources?
Next by Date: Re: [Gluster-devel] important change to syncop infra
Previous by thread: [Gluster-devel] important change to syncop infra
Next by thread: Re: [Gluster-devel] important change to syncop infra
Index(es):
- Date
- Thread