gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] preventing gfid-mismatches because of crashes in afr


From: Pranith Kumar Karampuri
Subject: [Gluster-devel] preventing gfid-mismatches because of crashes in afr
Date: Tue, 11 Mar 2014 11:37:35 -0400 (EDT)

hi,

   Traditionally afr just remembers which of the directories are good vs stale 
in extended attributes and then at the time of self-heal, does full directory 
scan and deletes stale entries and creates new entries. There are two problems 
with this approach
1) even creating/deleting/renaming one entry requires full scan of the 
directory.
2) If both bricks crash at the same time while a rename is going on, then it 
can lead to same-name, different gfid split-brains.
   Example:
            0) dir1 has file 'a' with gfid-a, dir2 has file 'b' with gfid-b.
            1) user executes rename dir1/a -> dir2/b on the mount over-writing 
the original file b.
            2) On brick-0 rename succeeds so the end result is dir1 does not 
have 'a' and dir2 has file 'b' with gfid-a
            3) at this point both the brick processes go down or data center 
shutdown happens etc, so brick-1 still has dir1 with file 'a' with 'gfid-a' and 
dir2 with file 'b' with 'gfid-b'.
            4) Now when both bricks are back up, dir1 can be healed 
conservatively where 'a' will be recreated with 'gfid-a' and heal it from 
brick-1 to brick-0 (incorrectly undoing the rename).
            5) But for dir2 on brick-0 there is a file 'b' with gfid-a where as 
on brick-1 there is a file 'b' with 'gfid-b', afr at the moment doesn't store 
any information to figure out which one is correct.

To address this issue, granularity of preop/postop of the entry operations need 
to be incremented.
a filename inside a directory can be uniquely identified by the entry-tuple 
(parent-gfid, entryname, entry-gfid).
Example: For dir2/b in the example above we can represent it as (gfid-of-dir2, 
b, gfid-b) on brick-1

So we need to remember such information for every entry fop along with whether 
that entry is coming 'in' to the directory or going 'out' of the directory.
So in the previous example we would have remembered dir2/b with gfid-b is going 
out of that directory so that entry could be deleted and dir2/b with gfid-a can 
be healed from brick-0.

The solution that we come up with should have the following functionalities 
broadly:
1) Given an entry-tuple it should be able to remember that it is going in or 
out of that directory.
2) Given an existing entry-tuple it should be able to forget it.
3) Given an entry-tuple, we should be able to query if that entry-tuple is 
going in/out.

This is one possible way to address this issue:
0) Create directory .glusterfs/indices/entry and two files 'in', 'out' in that 
directory and
1) Every time creat/mknod/symlink/link/mkdir happens create a hardlink from 
following path .glusterfs/indices/entry/pargfid/gfid/filename to 
'.glusterfs/indices/entry/in' as part of pre-op
2) Every time unlink/rmdir happens create a hardlink from following path inside 
.glusterfs/indices/entry/pargfid/gfid/filename to 
'.glusterfs/indices/entry/out' as part of pre-op
3) Every time rename happens create the following 2/3 hardlinks
   - .glusterfs/indices/entry/old-pargfid/gfid/old-filename to 
'.glusterfs/indices/entry/out'
   - .glusterfs/indices/entry/new-pargfid/gfid/new-filename to 
'.glusterfs/indices/entry/in'
and if the destination exists:
   - .glusterfs/indices/entry/new-pargfid/exisiting-file-gfid/new-filename to 
'.glusterfs/indices/entry/out'
4) Delete the same files as part of post-op.

To improve upon the solution we can do some optimizations:
Max filename is 255 bytes. And pargfid, gfid can take 16 bytes each.
So
1) If the file that is created/deleted/renamed is <= 223 (filename-max-len(255) 
- twice-gfid-len(32) = 223) then instead of representing the entry-tuple as 
pargfid/gfid/filename (i.e. two directories and a filename) it can be 
represented as modified-filename: pargfidgfidfilename i.e. first 16 bytes 
pargfid next 16 bytes as gfid and the rest as filename. Instead use this 
filename as link to 'in', 'out'. (2 mkdirs are saved)
2) If the file that is created/deleted/renamed is <= 249 and > 223 then we can 
probably use pargfid/gfidfilename as the link. (1 mkdir is saved)

Let me know your thoughts and do let me know If there is an easier way which 
can satisfy all the functionalities I listed above.

Thanks to Niels for listening to the initial approach and reading the initial 
draft :-).

Pranith



reply via email to

[Prev in Thread] Current Thread [Next in Thread]