Re: [Gluster-devel] Can I bring a development idea to Dev's attention?

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Can I bring a development idea to Dev's attention?

From:	Ed W
Subject:	Re: [Gluster-devel] Can I bring a development idea to Dev's attention?
Date:	Fri, 24 Sep 2010 11:07:20 +0100
User-agent:	Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4

 On 24/09/2010 08:25, Shehjar Tikoo wrote:

Thanks, we have similar locking improvements in mind but cannotpromise a date when these will be available. Some of the challengesthat we'll need to think about are how to map any such locking schemeto standard locking behaviour for posix/nfsv3/v4/cifs.


Hi, thanks for replying

Whilst I can see that there is some optimisation to be had by combiningbrick level locking with filesystem level locking, I just want toclarify that my proposal was really about intra-brick locks, and notreally about the toplevel filesystem level locking?

Just to clarify (and apologies if I'm trying to teach filesystem expertsreally obvious stuff...)

- The goal of any fileserver is to take async requests from lots ofclients and arrange to serialise that access- Up until recently such fileservers have been on a single machine, butinvolving multiple async clients connecting- Even in a single server solution the bottleneck becomes that eachclient cannot cache *any* data since it's not known if the server copyhas changed since we accessed it (even a microsecond earlier)- The solution which has become popular (see CIFS, NFSV4 (?), GFS2, etc)was to offer clients an "optimistic lock", ie the client can acquire atoken which while it's held means that it can cache data locked by thattoken and even offer writeback optimisations on that data (obviouslysubject to whatever the application tolerates for unsync'ed data)- This "optimistic lock" means that we effectively push the file lockingto the client, hence once a lock is acquired then further access by theclient is no longer bounded by the network access latency, under manycircumstances this leads to massive speedups- Clearly when a second client comes along and demands access to thesame data then we need a process to break the lock and inform the firstclient that they need to reacquire the lock (or revert to a kind of"write-through" access system while waiting)

So this process clearly benefits situations where there is serialisedaccess by single clients at a time. Excluding databases however, thisaccess pattern seems quite common for lots of applications

So with regards to Gluster I would see that we need this same type oflocking implemented at the brick level. Hence if you re-read thedescription above, then each *gluster server* would be the possibleclients (think of the lower level being bricks talking to each other,and the upper level being clients talking to bricks). ie yes, posixlocking needs to serialise access to every end client that connects toevery brick, but we can also benefit from locking to serialise accessbetween bricks (if 3,000 clients hammer one brick for a single file,then we care that our single brick is allowed to read/write that filefreely because it informed the other bricks that it now holds a lock,it's a separate problem to serialise all the clients talking to the onebrick)

So compared with traditional fileservers we actually need two levels oflocking to serialise access. At one level we need to serialise clientsaccess to the filesystem, and lower down we need to serialise accessbetween bricks

I think an alternative way of looking (and perhaps implementing) thesituation could be something like:


- Consider two bricks with files replicated between them
- Client 1 accesses Brick 1 and requests File A

- Brick 1 contacts the other replicas and requests to become the "masterreplica" of that file. All future accesses to that file must now gothrough only Brick 1 while it remains in that "role"- If Client 2 accesses Brick 1 and tries to do something with File A,then the normal filesystem locking must arrange for serialisationbetween Client 1 and Client 2, however, Brick 1 need not contact anyother brick and there is no network latency penalty serving that file toClient 2 (obviously at some point one client will write data and we needto sync that, but read access incurs no network access)

- OK, now the trick is what happens when Client 3 accesses Brick 2 andrequests File A... Somehow we need to wrest control back from Brick 1and inform it that it's no longer the "master". A really simplesolution to this (at least conceptually) is to proxy all access requestsfrom Brick 2 back to Brick 1. This satisfies our requirement thataccesses are serialised across bricks and effectively there is still a"master" brick remaining in control.- We can see that this setup is conceptually similar to having atraditional lock server arbitrating brick access to a given file, but inexample above we have implemented a distributed lock server, the lockserver effectively becoming the same server as what we hope is the "hotserver", so that we aren't incuring network latency to contact the lockserver all the time.- A further improvement would clearly be to have some kind of processwhere the "master brick" can move about, ie in the case above if Client3 starts to bash away at Brick 2 for File A, then Brick 2 is migrated tobecome the "master" and hold the lock, now any access through Brick 1must effectively proxy requests back to Brick 2 or re-acquire it's lock(ie become the master)

OK, so the above is a very simple example of optimistic locking andcould be trivially implemented using an external lock server whichtracks which brick currently holds the lock for a given file (ie canread/write freely without first checking if other bricks have modifiedthe file). A given brick which doesn't hold a lock on a file must firstdo kind of what it does already and contact the lock server to see ifanother brick holds the lock. If not it can acquire the lock itself.If the lock is held elsewhere we either need to break the lock (or proxyaccess requests to the server holding the lock).

Really this is not so different to what is there today, but it's simplyan efficiency improvement because we don't need to touch *every* brickfor *every* file access, instead we make some network requests on firstaccess to a file and then can continue to touch that file for a periodafterwards without needing further network access with other bricks

However, whilst some kind of implementation of the above could offer ahuge performance speedup for many of the situations which come up on themailing list, the issue is that the lock server becomes a) a bottleneckand b) point of failure. So the chain of thought almost certainly goessomething like:

- Make the gluster bricks become the lock servers, ie they negotiateamongst themselves. Really this is roughly what happens right now, onlyit's on every access, rather than access being "sticky" once acquired- Now analyse all the corner cases that bricks go down holding locks, orget segmented while holding/acquiring locks and discover some trickyissues...

Paxos seems like a clever way of dealing with the locking goingdistributed, yet not necessarily having a 100% consistent view of whoowns which lock. By introducing a voting method it can show robustnessin the face of failed machines and new machines can be added withoutneeding to store reliable state information (or at least this is truewith the improvements described in the articles)

Does that make sense? Apologies if the above is long winded, but thepoint is really that the performance improvements come from pushinglocks between bricks, and probably this is distinct from client levellocking such as nfs/cifs/posix, etc locking

For advanced cluster filesystems such as GFS2, the general "optimisticlocking" technique appears to show massive speed improvements (for manyaccess patterns) and it's also likely to do so in Gluster. Really myoriginal email jumped two steps and suggested an improved form ofdistributed locking, which itself could be used as the actualimplementation, but other forms of distributed locking between brickswould be highly desirable also.


Thanks for listening

Ed W

[Prev in Thread]

Current Thread

[Next in Thread]

[Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W, 2010/09/23
- Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Craig Carl, 2010/09/24
  - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W, 2010/09/24
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Gordan Bobic, 2010/09/24
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W, 2010/09/25
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Gordan Bobic, 2010/09/26
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W, 2010/09/26
- Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Shehjar Tikoo, 2010/09/24
  - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W <=
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Ed W, 2010/09/25
    - Re: [Gluster-devel] Can I bring a development idea to Dev's attention?, Craig Carl, 2010/09/25

Prev by Date: [Gluster-devel] Announcing GlusterFS v3.1 Beta
Next by Date: Re: [Gluster-devel] Can I bring a development idea to Dev's attention?
Previous by thread: Re: [Gluster-devel] Can I bring a development idea to Dev's attention?
Next by thread: Re: [Gluster-devel] Can I bring a development idea to Dev's attention?
Index(es):
- Date
- Thread