[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator h

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator h

From:	Gareth Bult
Subject:	[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator have problem )
Date:	Thu, 17 Jan 2008 09:44:23 +0000 (GMT)

Hi,

Yes, I would agree these changes would improve the current implementation.

However, a "better" way would be for the client, on failing to write to ONE of 
the AFR volumes, to write the change to a logfile on the remaining volumes .. 
then for the recovering server to playback the logfile when it comes back up, 
or to recopy the file if there are insufficient logs or if the file has been 
erased.

This would "seem" to be a very simple implementation .. 

Client;

Write to AFR
If Fail then
  if log file does not exist create log file
  Record, file, version, offset, size, data in logfile

On server;

When recovering;

  for each entry in logfile
     if file age > most recent transaction
        re-copy whole file
     else
        replay transaction

  if all volumes "UP", remove logfile 

?????

One of the REAL benefits of this is that the file is still available DURING a 
heal operation.
At the moment a HEAL only takes place when a file is being opened, and while 
the copy is taking place the file blocks ...

Gareth.

----- Original Message -----
step 3.: "Angel" <address@hidden>
To: "Gareth Bult" <address@hidden>
Cc: address@hidden
Sent: 17 January 2008 08:47:06 o'clock (GMT) Europe/London
Subject: New IDEA: The Checksumming xlator ( AFR Translator have problem )

Hi Gareth

You said it!!, gluster is revolutionary!!

AFR does a good job, we only have to help AFR be a better guy!!

What we need is a checksumming translator!!

Suppouse you have your posix volumes A and B on diferent servers.

So your are using AFR(A,B) on client

One of your AFRed node fails ( A ) and some time later it goes back to life but 
its backend filesystem 
got trashed and fsck'ed and now maybe there subtle differences on the files 
inside.

¡¡Your beloved 100GB XEN files now dont match on your "fautly" A node and your 
fresh B node!! 

AFR would notice this by means (i think) of a xattrS on both files, that's 
VERSION(FILE on node A) != VERSION(FILE on node B) or anything like that.

But the real problem as you pointed out is that AFR only know files dont match, 
so have to copy every byte from you 100GB image from B to A (automatically on 
self-heal or on file access )

That's many GB's (maybe PB's)  going back and forth over the net. THIS IS VERY 
EXPENSIVE, all we know that.

Enter the Checksumming xlator (SHA1 or MD5 maybe md4 as rsync seems to use that 
with any problem)

Checksumming xlator sits a top your posix modules on every node. Whenever you 
request the xattr SHA1[block_number] on a file the checksumming xlator 
intercepts this call
reads block number "block_number" from the file calculates SHA1 and returns 
this as xattr pair key:value.

Now AFR can request SHA1 blockwise on both servers and update only those blocks 
that dont match SHA1.

With a decent block size we can save a lot of info for every transaction.

-- In the case your taulty node lost its contents you have to copy the whole 
100GB XEN files again
-- In the case SHA1 mismatch AFR can only update diferences saving a lot of 
resources like RSYNC does. 

One more avanced feature would be incoproprate xdelta librari functions, making 
possible generate binary patchs against files...

Now we only need someone to implement this xlator :-)

Regards
 
El Jueves, 17 de Enero de 2008 01:49, escribió:
> Mmm...
> 
> There are a couple of real issues with self heal at the moment that make it a 
> minefield for the inexperienced.
> 
> Firstly there's the mount bug .. if you have two servers and two clients, and 
> one AFR, there's a temptation to mount each client against a different 
> server. Which initially works fine .. right up until one of the glusterfsd's 
> ends .. when it still works fine. However, when you restart the failed 
> glusterfsd, one client will erroneously connect to it (or this is my 
> interpretation of the net effect), regardless of the fact that self-heal has 
> not taken place .. and because it's out of sync, doing a "head -c1" on a file 
> you know has changed gets you nowhere. So essentially you need to remount 
> clients against non-crashed servers before starting a crashed server .. which 
> is not nice. (this is a filed bug)
> 
> Then we have us poor XEN users who store 100Gb's worth of XEN images on a 
> gluster mount .. which means we can live migrate XEN instances between 
> servers .. which is fantastic. However, after a server config change or a 
> server crash, it means we need to copy 100Gb between the servers .. which 
> wouldn't be so bad if we didn't have to stop and start each XEN instance in 
> order for self heal to register the file as changed .. and while self-heal is 
> re-copying the images, they can't be used, so you're looking as 3-4 mins of 
> downtime per instance.
> 
> Apart from that (!) I think gluster is a revolutionary filesystem and will go 
> a long way .. especially if the bug list shrinks .. ;-)
> 
> Keep up the good work :)
> 
> [incidentally, I now have 3 separate XEN/gluster server stacks, all running 
> live-migrate - it works!]
> 
> Regards,
> Gareth.
>

-- 
----------------------------
Clister UAH
----------------------------

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Re: New IDEA: The Checksumming xlator ( AFR Translator have problem ), Gareth Bult <=

Prev by Date: Re: [Gluster-devel] glusterfs crash with TLA634
Next by Date: [Gluster-devel] Re: ALU Scheduler ?
Previous by thread: [Gluster-devel] glusterfs crash with TLA634
Next by thread: [Gluster-devel] Re: ALU Scheduler ?
Index(es):
- Date
- Thread