Re: Fwd: [Gluster-devel] proposals to afr

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Fwd: [Gluster-devel] proposals to afr

From:	Kevan Benson
Subject:	Re: Fwd: [Gluster-devel] proposals to afr
Date:	Wed, 24 Oct 2007 11:46:54 -0700
User-agent:	Thunderbird 2.0.0.6 (X11/20070728)

Alexey Filin wrote:

On 10/23/07, Kevan Benson <address@hidden> wrote:

Actually, I just thought of a major problem with this.  I think the
extended attributes need to be set as atomic operations.  Imagine the
case where two processes are writing the file at the same time, the op
counters could get very messed up.



atomic operations is an ideal which is not possible on practice sometimes,
ideal hardware exists in mind only, developers choose a compromise between
complexity, performance, reliability, flexibility etc on existing hardware
always.

to provide operation-counter(or version if it is updated after each
operation) consistency the concurrent access to the same file is to be done:

* with one thread (to allow concurrent operations with _one_ file to be
serviced by _one_ thread only) which can provide atomicity with explicit
queuing
* or with sync primitive(s) for many threads.

io threads help to decrease latencies when many clients use the same brick
(as e.g. a glfs doc says) or to overlap network/disk io to increase
performance per a client (is it implemented in glfs?)

It depends on the context. Really, an atomic operation just means thatnothing else can interrupt the action while it's doing this one thing.In the case of glusterfs, it could easily be achieved with a shortthread lock if it isn't already (I suspect it probably is). I'mreferring to atomic in the sense that there's only one extendedattribute, not multiple across threads. If two separate threads(serving two separate requests) are acting on the same file, the opcounter as you defined it (an extended attribute) could itself becomeinconsistent between different AFR subvolumes, depending on the orderthe write requests are processed.


Imagine the following order of operations:
On subvolume A

First op: request(1).write("this is a request from program a") &&opCounterIncrementSecond op: request(2).write("this is a request from program b") &&opCounterIncrement

crash

On subvolume B

First op: request(2).write("this is a request from program b") &&opCounterIncrementSecond op: request(1).write("this is a request from program a") &&opCounterIncrement

crash

At this point the opcounters would be the same, the trusted_afr_versionthe same, the data different, and no self-heal would be triggered.

Now, I'm not familiar enough with the internals of GlusterFS to tell youwhether what I outlines above is even possible, but it is a racecondition I can see causing problems unless files are implicitly lockedby AFR writes. I'm not sure.

Another solution comes to mind.  Just set another extended attribute

denoting that the file is being written to currently (and unset it
afterwards).  If the AFR subvolume notices that the file islisted as
being written to but no clients have it open (I hope this is easily
determinable) a flag is returned for the file.  If all subvolumes return
this flag for the file in the AFR (and all the trusted_afr_versions are
the same), choose one version of the file (for example from the first
AFR subvolume) as the legit copy and copy it to the other AFR nodes.  It
doesn't matter which version is the most up to date, they will all be
fairly close, and since this is from a failed write operation there was
no guarantee the file was in a valid state after the write.  it's
doesn't matter which copy you get, as long as it's consistent across AFR
members.



I like it more op counter, advantage to op counter is that the flag is set
only two times (open()/close()) so an overhead is minimal (concurrent access
to the flag is to be synchronized), the disadvantage is if not closed file
is enough big it has to be copied sometimes when it is not required, it is
acceptable if afr crashes rare

Wait, I assumed by operation you meant every specific write to the file,so this opcounter could be incremented quite a bit, but you just statedit would only be set once as a flag, so maybe I'm misunderstanding you.If it's incremented per actual file operation, quite a lot of incrementsmight happen. For example, using wget to save a remote file to diskdoesn't write everything at once, it does many writes as it's bufferfills with enough information to be worth writing to disk.

My thought above was a simple flag as to whether or not the file wasbing written just to denote whether it should be considered in aconsistent state if a crash happens.

This whole conversation's gotten into somewhat esoteric territory thatrequires more input from the GlusterFS team on whether it's even worthconsidering doing stuff this way. Maybe they have a better solution inthe works? Any team members care to comment on their thoughts on this?


--

-Kevan Benson
-A-1 Networks

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] proposals to afr, Alexey Filin, 2007/10/21
- Re: [Gluster-devel] proposals to afr, Kevan Benson, 2007/10/21
  - Re: [Gluster-devel] proposals to afr, Alexey Filin, 2007/10/22
    - Re: [Gluster-devel] proposals to afr, Kevan Benson, 2007/10/22
  - Re: [Gluster-devel] proposals to afr, Krishna Srinivas, 2007/10/23
    - Message not available
    - Fwd: [Gluster-devel] proposals to afr, Alexey Filin, 2007/10/23
    - Re: Fwd: [Gluster-devel] proposals to afr, Kevan Benson, 2007/10/23
    - Re: Fwd: [Gluster-devel] proposals to afr, Alexey Filin, 2007/10/24
    - Re: Fwd: [Gluster-devel] proposals to afr, Kevan Benson <=
    - Re: Fwd: [Gluster-devel] proposals to afr, Alexey Filin, 2007/10/25
    - Re: Fwd: [Gluster-devel] proposals to afr, Alexey Filin, 2007/10/25
    - Re: Fwd: [Gluster-devel] proposals to afr, Krishna Srinivas, 2007/10/25
    - Re: Fwd: [Gluster-devel] proposals to afr, Chris Johnson, 2007/10/25
    - Message not available
    - Re: Fwd: [Gluster-devel] proposals to afr, Chris Johnson, 2007/10/25
    - Re: [Gluster-devel] option client-volume-filename (was) Re: Fwd: [Gluster-devel] proposals to afr, Matt Paine, 2007/10/25
    - Re: [Gluster-devel] option client-volume-filename (was) Re: Fwd: [Gluster-devel] proposals to afr, Chris Johnson, 2007/10/26
    - Re: Fwd: [Gluster-devel] proposals to afr, Krishna Srinivas, 2007/10/25
    - Re: Fwd: [Gluster-devel] proposals to afr, Kevan Benson, 2007/10/25
    - Re: Fwd: [Gluster-devel] proposals to afr, Krishna Srinivas, 2007/10/25

Prev by Date: Re: [Gluster-devel] performance improvements
Next by Date: [Gluster-devel] GlusterFS QA
Previous by thread: Re: Fwd: [Gluster-devel] proposals to afr
Next by thread: Re: Fwd: [Gluster-devel] proposals to afr
Index(es):
- Date
- Thread