gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Choice of Translator question


From: Gareth Bult
Subject: Re: [Gluster-devel] Choice of Translator question
Date: Thu, 27 Dec 2007 13:44:58 +0000 (GMT)

>The trusted.afr.version extended attribute tracks while file version is 
>being used, and on a read, all participating AFR members should respond 
>with this information, and any older/obsoleted file versions are 
>replaced by a newer copy from one of the valid AFR members (this is 
>self-heal)

Yes, understood.

>I think they are planning striped reads per block (maybe definable) at a later 
>date.

Mmm, so at the moment, when it says AFR does striped reads, what it really 
means is that it does striped reads, just so long as you have lots of 
relatively small files and not a few large files .. ???

>Read from the the file from a client (head -c1 FILE >/dev/null to force)

OR find /mountedfs -exec head -c1 > /dev/null {} \;

.. which is good, but VERY inefficient for a large file-system.

>you could use the stripe translator over AFR to AFR chunks of the DB 
>file, thus allowing per chunk self-heal.

Mmm, my experimentation indicates that this does not happen. I've just spent 3 
hours trying to prove / disprove this with various configurations - AFR 
self-heals on a file basis, not on a stripe-chunk basis.

If I have 4 bricks, two stripes using 2 bricks each, then an AFR on top - any 
sort of self-heal replicates the entire DB.
If I have 4 bricks, two AFR's and one stripe on top, I get the same thing.


>I'm not familiar enough with database file writing practices in general (not 
>to mention your 
>particular database's practices), or the stripe translator to tell 
>whether any of the following will cause you problems, but they are worth 
>looking into:

We're talking about flat files here, some with append, some with seek/write 
updates.

>1) Will the overhead the stripe translator introduces with a very large file 
>and relatively small chunks cause performance problems? (5G in 1MB stripes = 
>5000 parts...)

No, this would be fine if the AFR/Stripe combination actually did a per-chunk 
self heal.

>2) How will GlusterFS handle a write to a stripe that is currently 
>self-healing?  Block?

The stripe replicates the entire stripe (which is big) and both read and write 
operations block during the heal.

>3) Does the way the DB writes the DB file cause massive updates throughout the 
>file, or does it generally just append and update the indices, or something 
>completely different.  It could have an affect on how well something like this 
>works.

I don't think access speed is an issue, glusterfs is very quick. The issue is 
recovery, it appears not to operate as advertised!

>Essentially, using this layout, you are keeping track of which stripes have 
>changed and only have to sync those particular ones on self-heal. The longer 
>the downtime, the longer self-heal will take, but you can mitigate that 
>problem with a rsync  of the stripes between the active and failed GlusterFS 
>nodes BEFORE starting glusterfsd onthe failed node (make sure to get the 
>extended attributes too).

Ok, firstly, manual rsync's sort of defeat the object of the exercise.
Secondly, having to go through this process every time a configuration is 
changed / glusterfsd is restarted is unworkable.
Thirdly, replicating many GB's of data hammers the IO system and slows down the 
entire cluster - again undesirable.

Being able to restart a glusterfsd without breaking the replica's would help, 
but I see no mention of this ...

>The above setup, if feasible, would mitigate restart cost, to the point where 
>only a few megs might need to be synced on a glusterfs restart.

Ok, well I appear to have both AFR and Striping working and I can observe their 
operation at brick level and confirm they are working Ok.

Here's my basic test harness;

On the client system;

$dd if=/dev/zero of=/mnt/stripe/database bs=1M count=1024

write.py
#!/usr/bin/python
io=open("/mnt/stripe/database","r+")
io.seek(1024*1024*900)
io.write("Change set version # 6\n")
io.close()

On the bricks I have;

read.py
#!/usr/bin/python
io=open("/export/stripe-1/database","r+")
io.seek(1024*1024*900)
print io.readline()
io.close()

When I run write.py on the client, both bricks show the correct change.
Then I kill glusterfsd on brick2.
Running write.py on the client shows an update on brick1, obviously not on 
brick2.
Restarting glusterfsd on brick2 shows a reconnect in the logs.
On the client; head -c1 database
Initiates a self heal, shown in the logs with DEBUG turned on
Running read.py on brick1 and brick2 blocks ...
An entire 1G chunk is copied to brick 2
read.py on bricks 1 and 2 then continue when the copy finishes ..

(!)

I'm using fuse-2.7.2 from the repos and gluster 1.3.7 from the stable tgz ...

fyi; The fuse that comes with Ubuntu/Gutsy seems to cause gluster to crash 
under write-load, I'm still waiting to see if the current CVS version solves 
the problem ...

Gareth.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]