gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gluster-devel] [RFC] A new caching/synchronization mechanism to speed u


From: Xavier Hernandez
Subject: [Gluster-devel] [RFC] A new caching/synchronization mechanism to speed up gluster
Date: Tue, 04 Feb 2014 10:07:22 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi,

currently, inodelk() and entrylk() are being used to make sure that changes happen synchronously on all bricks, avoiding data/metadata corruption when multiple clients modify the same inode concurrently. So far so good, however I think this introduces a significant overhead to avoid a situation that will happen very rarely. It also limits the advantage of client-side caches.

I propose to implement a new translator that uses a MESI-like protocol (protocol used to maintain memory coherency between local caches of CPU cores). This translator would add virtually 0 overhead when there isn't more than one client accessing the same inode, and an overhead comparable to current implementation if there is contention.

Another advantage of this protocol would be that it will be possible to implement much more aggressive caching mechanisms on the client side that will improve overall performance without losing any current features.

At a high level this is how it could work:

Each client tracks the state of each inode it uses (M - Modified, E - Exclusive, S - Shared, I - Invalid). All inodes will be created in the invalid state. When the client needs to write the inode, it asks all bricks exclusive access. Once granted, the inode will be in exclusive state and any read/write operation could be made locally on the client side, because it knows that nobody else will be modifying the inode. If the inode is successfully written (on the local cache), the state will change to modified. Eventually the changes will be sent to the bricks in background and the state will go back to exclusive, or invalid if the inode is not needed anymore.

Now, if another client needs to read or write the same inode, it will send a request to all bricks. If the inode is in the exclusive or modified state in one of the clients, the bricks will notify the current owner of the inode to flush all pending changes. Once completed, the new client will be granted exclusive (if it's a write request) or shared (if it's a read request) access to the inode. The former owner will leave the inode in the invalid state (if it's a write request) or shared (if it's a read request).

Multiple clients can read a shared inode simultaneously, however if one client needs exclusive access to the inode, all other clients will need to set inode's state to invalid before granting exclusive access.

The only synchronization point needed is to make sure that all bricks agree on the inode state and which client owns it. This can be achieved without locking using a method similar to what I implemented in the DFC translator.

Besides the lock-less architecture, the main advantage is that much more aggressive caching strategies can be implemented very near to the final user, increasing considerably the throughput of the file system. Special care has to be taken with things than can fail on background writes (basically brick space and user access rights). Those should be handled appropiately on the client side to guarantee future success of writes.

Of course this is only a high level overview. A deeper analysis should be done to see what to do on each special case.

What do you think ?

Xavi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]