Re: [Gluster-devel] self heal problem

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] self heal problem

From:	Stephan von Krawczynski
Subject:	Re: [Gluster-devel] self heal problem
Date:	Wed, 24 Mar 2010 16:03:53 +0100

Hi Tejas,

to my knowledge the situation does not derive from what you call
"handcrafted". My personal point of interest is how the code is able at all to
falsly self heal, as all parameters it should possibly use for the decision
(and that I collected with getfattr and stat) show clearly that the decision
should have been taken the other way round.
What precisely lead the code to do what it did? Are there other parameters
involved in the decision?
I would have no problem to delete the server tree on the backend if that
helped. Nevertheless I really would like to understand how the self heal
decision is made and what it is really based upon, because obviously it cannot
be based on the stat values or the xattrs of the file in question.

--
Regards,
Stephan



On Wed, 24 Mar 2010 08:46:37 -0600 (CST)
"Tejas N. Bhise" <address@hidden> wrote:

> Hi Stephan,
> 
> GlusterFS keeps track if an operation happened on one copy but not 
> on the replica, in case a replica was not accessible. From the attributes
> remote1 and remote2, it shows that there is no pending operation on the other
> replica.
> 
> From the attributes you have shown it seems that you have gone to 
> the backend directly, bypassed glusterfs, and hand crafted such a 
> situation. The way the code is written, we do not think that we can
> reach the state you have shown in your example.
> 
> The remote1 and remote2 attributes show all zeroes which means
> that there were no operations pending on any server.
> 
> If not hand crafted, then please give the detailed testcase which can 
> lead to this situation based on just filesize.
> 
> If this situation was handcrafted  then it would be akin to 
> overwriting the section of a disk which carries the metadata of a 
> filesystem and then claiming that the FS is getting corrupted.
> 
> Please see the other code around the one you have pointed in the
> other mail and you can see the other higher order checks that are
> made.
> 
> Regards,
> Tejas.
> 
> 
> 
> 
> ----- Original Message -----
> From: "Stephan von Krawczynski" <address@hidden>
> To: address@hidden
> Sent: Tuesday, March 23, 2010 7:33:17 PM GMT +05:30 Chennai, Kolkata, Mumbai, 
> New Delhi
> Subject: Re: [Gluster-devel] self heal problem
> 
> Let me show you this further information for one file falsly self-healed:
> 
> server1:
> 
> # getfattr -d -m '.*' -e hex <filename>
> getfattr: Removing leading '/' from absolute path names
> # file: <filename>
> trusted.afr.remote1=0x000000000000000000000000
> trusted.afr.remote2=0x000000000000000000000000
> trusted.posix.gen=0x4b9bb33c00001be6
> 
> # stat <filename>
>   File: <filename>
>   Size: 4509            Blocks: 16         IO Block: 4096   reguläre Datei
> Device: 804h/2052d      Inode: 16560280    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-03-23 11:10:36.000000000 +0100
> Modify: 2010-03-23 00:32:25.000000000 +0100
> Change: 2010-03-23 12:36:40.000000000 +0100
> 
> 
> server2:
> 
> # getfattr -d -m '.*' -e hex <filename>
> getfattr: Removing leading '/' from absolute path names
> # file: <filename>
> trusted.afr.remote1=0x000000000000000000000000
> trusted.afr.remote2=0x000000000000000000000000
> trusted.posix.gen=0x4b9bb2f600001be6
> 
> # stat <filename>
>   File: <filename>
>   Size: 4024            Blocks: 8          IO Block: 4096   reguläre Datei
> Device: 804h/2052d      Inode: 42762291    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-03-23 11:10:36.000000000 +0100
> Modify: 2010-03-23 14:32:23.000000000 +0100
> Change: 2010-03-23 14:32:23.000000000 +0100
> 
> 
> As you can see the latest file version is on server2 (modify date) and is 
> _smaller_ in size.
> 
> Now on client 2 a ls shows interesting values:
> 
> # ls -l <filename>
> -rw-r--r--  1 root root 4509 Mar 23 14:37 <filename>
> 
> As you can see here, the file date looks increased and the size clearly shows 
> that self-heal went wrong.
> 
> Consequently the server2 copy now looks like:
> 
> # stat <filename>
>   File: <filename>
>   Size: 4509            Blocks: 16         IO Block: 4096   reguläre Datei
> Device: 804h/2052d      Inode: 42762291    Links: 1
> Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2010-03-23 11:10:36.000000000 +0100
> Modify: 2010-03-23 00:32:25.000000000 +0100
> Change: 2010-03-23 14:41:13.000000000 +0100
> 
> Modification date went back and file size is increased, so the older file 
> version was choosen to overwrite the newer one.
> 
> -- 
> Regards,
> Stephan
> 
> 
> _______________________________________________
> Gluster-devel mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] self heal problem, Stephan von Krawczynski, 2010/03/23
- Re: [Gluster-devel] self heal problem, Stephan von Krawczynski, 2010/03/23
  - Re: [Gluster-devel] self heal problem, Stephan von Krawczynski, 2010/03/24
- Re: [Gluster-devel] self heal problem, Tejas N. Bhise, 2010/03/24
  - Re: [Gluster-devel] self heal problem, Stephan von Krawczynski <=
  - Re: [Gluster-devel] self heal problem, Stephan von Krawczynski, 2010/03/24
  - Re: [Gluster-devel] self heal problem, Ed W, 2010/03/27
- Re: [Gluster-devel] self heal problem, Tejas N. Bhise, 2010/03/27
  - Re: [Gluster-devel] self heal problem, Ed W, 2010/03/29
    - Re: [Gluster-devel] self heal problem, Tejas N. Bhise, 2010/03/29
    - Re: [Gluster-devel] self heal problem, Stephan von Krawczynski, 2010/03/29

Prev by Date: Re: [Gluster-devel] self heal problem
Next by Date: Re: [Gluster-devel] self heal problem
Previous by thread: Re: [Gluster-devel] self heal problem
Next by thread: Re: [Gluster-devel] self heal problem
Index(es):
- Date
- Thread