Re: [Gluster-devel] Improving real world performance by moving files clo

gluster-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Improving real world performance by moving files clo

From:	Derek Price
Subject:	Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Date:	Fri, 16 May 2008 11:44:16 -0400
User-agent:	Thunderbird 2.0.0.14 (Windows/20080421)

I mostly agree with you.  A few additional points are inlined below.

address@hidden wrote:

On Fri, 16 May 2008, Derek Price wrote:
address@hidden wrote:
Isn't that effectively the same thing? Unless there is quorum, DLMlocks out the entire FS (it also does this when a node dies, until itgets definitive confirmation that it has been successfully fenced).For normal file I/O all nodes in the cluster have to acknowledge alock before it can be granted.
Why? It requires a meta-data cache, but as long as every node in thequorum stores a given file's most recent revision # when any lock isgranted, even if it doesn't actually sync the file data, then anyquorum should be able to agree on what the version number of the mostup-to-date copy of a file is. All nodes are required to report only ifyou assume that any given file has a small number of "owners" and thatthe querier doesn't know who the owner is.
That's to do with file versioning, not locking, though. What am I missing?

I'm assuming that versioning and locking can and should be combined.You've admitted the necessity for keeping copies of files synchronizedand IO is always going to require some sort of lock to accomplish this.By having the quorum remain aware of what the most recent version of agiven file is, whether that file is locked, and perhaps where copies ofthe file reside, you could reduce the number of nodes that must beconsulted when a lock is needed.

I think you will also speed things up if you don't have to consult allnodes for every IO operation. If all available nodes must be consulted,then you introduce an implicit wait until a specified timeout for everyIO request if any single node is down. With the quorum model, evenbefore fencing takes place, almost half the nodes can go incommunicadoand the rest can operate as efficiently as they did with all nodes inservice.

If some HA and fault-tolerant DHT implementation exists that alreadyhandles atomic hash inserts with recognizable failures for keys thatalready exist, then perhaps that could take the place of DLM's quorummodel, but I think any algorithm that requires contacting all nodes willprove to be a bad idea in the end.

To remain fault tolerant, this requires that servers make some effortto stay up-to-date with the meta-data cache, but maybe this could bedealt with efficiently with the DHT someone else brought up?
I'm not sure that so much metadata caching is actually necessary. If afile open brings the file to the local machine (this cannot beguaranteed because the local machine may be out of space, and it may beunable to free space by expunging an old file due to that file not beingredundant enough in the network), then the metadata of that file, beingattached to the file, is implicitly "cached". But this isn't reallycaching at all - it's migration.
The algorithm for opening a file might be as follows:
1) node broadcasts/multicasts an open request to all peers
2) peers that have the file available respond with the metadata (size,version, etc) they have and possibly their current load (to assist withload balancing by fetching the file from the least loaded peer)3.1) if the file is available locally, agree a lock with other nodes,and use it.3.2) if the file is not available locally, but there is enough space,fetch it and do 3.1)3.3) if there isn't enough space locally to fetch the file, see ifenough space can be freed. If this succeeds, do 3.2)3.4) if space cannot be freed, use the file remotely from the leastloaded peer.
Expunging algorithm would be similar - broadcast a file status request(similar to 1) above). If enough nodes respond with the latest versionof the file (set some threshold depending on how much redundancy isrequired), the local file can be be removed and the space freed for afile that is more useful locally. This shouldn't really happen until thelocal data store starts to get full.

I might optimize the expunge algorithm slightly by having nodes with lowloads volunteer to copy files that otherwise couldn't be expunged from anode. Better yet, perhaps, would be a background process that runs onlightly loaded nodes and tries to create additional redundant copies atsome configurable tolerance beyond the "minimum # of copies" threshold.If copies beyond the minimum are only created on file access, then aheavily loaded node could quickly fill up its own disk with all the"redundant" copies of files and have to start relying on remote access,further bogging down the busy node.

Locking could be handled somewhat lazily - a lock request gets broadcastand as long as quorum peers respond, and there are no peers saying "no,I have that lock!", the lock can be granted. A lock can have TTL (incase a node dies while holding a lock), and the refresh should beexpected if the node expects to keep the lock. This could be used tospeed up locking (each node would have a list of currently valid locks,without the need to check explicitly, for example - it would only needto broadcast a lock-request when it looks like the lock can be granted).
For file delta writes, an AFR type mechanism could be used to send thedeltas to all the nodes that have the file. This could all get quitetricky, because it might require a separate multicast group to be set upfor up to every node combination subset, in order to keep the networkbandwidth down (or you'd just end up broadcasting to all nodes, whichmeans things wouldn't scale as switches should, it'd be more like usinghubs).
This would potentially have the problem that there is only 24 bits of IPmulticast address space, but that should provide enough groups withsensible redundancy levels to cover all node combinations. This may ormay not be way OTT complicated, though. There is probably a simpler andmore sane solution.

I'm not sure what overhead is involved in creating multicast groups, butthey would only be required for files currently locked for write, soperhaps creating and discarding the multicast groups could be done inconjunction with creation and release of write locks.

It's also possible that you could reduce the complexity of this problemby simply discarding as many copies down to as close to the minimum # asother nodes will allow, on write. However, I think that might reducesome of the performance benefits this design otherwise gives each node.Perhaps there are some useful ideas on how to perform this complexsynchronization already in the design of P2P file transfer networks?What would that be, something like implicit striping based on thelocations of valid redundant copies/deltas?


Derek
--
Derek R. Price
Solutions Architect
Ximbiot, LLC <http://ximbiot.com>
Get CVS and Subversion Support from Ximbiot!

v: +1 248.835.1260
f: +1 248.246.1176

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads, (continued)

Prev by Date: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Next by Date: [Gluster-devel] booster translator error
Previous by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Next by thread: Re: [Gluster-devel] Improving real world performance by moving files closer to their target workloads
Index(es):
- Date
- Thread