gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Gluster driver for Archipelago - Development process


From: Vijay Bellur
Subject: Re: [Gluster-devel] Gluster driver for Archipelago - Development process
Date: Wed, 04 Dec 2013 11:59:05 +0530
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0

On 12/04/2013 06:45 AM, Anand Avati wrote:
On Tue, Dec 3, 2013 at 7:34 AM, Vijay Bellur <address@hidden
<mailto:address@hidden>> wrote:

    Adding gluster-devel as there is a good amount of detail on the
    ongoing integration with Archipelago.

                1. There are no async operations for
                open/create/close/stat/unlink,
                which are necessary for various operations of Archipelago.


            Is there more description on how various operations of
            Archipelago
            rely on async operations for open/create etc.? I must admit
            that I
            haven't gone through your code but will definitely do so to
            get a
            better understanding.


        Sure, I 'll explain our rationale but first, let me provide some
        insight
        on the fundamental logic of Archipelago to understand the context on
        which we operate:

        An Archipelago volume is a COW volume, consisting of many contiguous
        pieces (volume chunks), typically 4MB in size. It is COW since
        it may
        share read-only chunks with other volumes (e.g. if the volumes are
        created from the same OS image) and creates new chunks to write to
        them. In order to refer to a chunk, we assign a name to it (e.g.
        volume1_0002) which can be considered as the object name (Rados)
        or file
        name (Gluster, NFS).

        The above logic is handled by separate Archipelago entities
        (mappers,
        volume composers). This means that the storage driver’s only
        task is to
        read/write chunks from and to the storage. Also, given that
        there is one
        such driver per host - where 50 VMs can be running - means that
        it must
        handle a lot of chunks.

        Now, back to our storage driver and the need for full
        asynchronism. When
        it receives a read/write request for a chunk, it will generally
        need to
        open the file, create it if it doesn’t exist, perform the I/O and
        finally close the file. Having non-blocking read/write but blocking
        open/create/close essentially makes this request a blocking request.
        This means that if the driver supports e.g. 64 in-flight
        requests, it
        needs to have 64 threads to be able to manage all of them.


    open/create/close are not completely synchronous in gluster with
    open-behind and write-behind translators loaded in the client side
    graph. open-behind and write-behind translators by default are part
    of the client side graph in GlusterFS. With open-behind translator,
    open() system call is short circuited by GlusterFS and the actual
    open happens in the background. For create, an open with O_CREAT |
    O_EXCL flags would be handled by open-behind. Similarly an actual
    close is done in the background by the write-behind translator. As
    such, Archipelago should not experience significant latency with
    open & close operations.


A create (O_CREAT with or without O_EXCL) is currently not handled by
open-behind and will always be synchronous with a network round trip.

My bad that I overlooked this behavior for create. Alex - would it work if we were to invoke a callback with user provided context for asynchronous creates from libgfapi?


An
open() on an existing file is cut-short by open-behind. But even that
might not be sufficient because the path resolver (lookup() etc.) always
works synchronously with network round-drops for path based operations.
We will need to design new APIs for true async path based operations
(with path resolver also executed asynchronously).

True asynchronous behavior would be good to have. If tests with Archipelago do not show significant latency for open operations, we can possibly defer this to phase 2 of our integration and go ahead with the existing open-behind implementation for now.


        Let’s assume that open/create/close are indeed non-blocking or
        virtually
        nonexistent [1]. Most importantly, this would greatly reduce the
        read/write latency, especially for 4k requests. Another benefit
        is the
        ability to use a much smaller number of threads. However, besides
        read/write, there are also other operations that the driver must
        support
        such as stat()ing or deleting a file. If these operations are
        blocking, this means that a spurious delete and stat can stall our
        driver. Once more, it needs to have a lot of threads to be
        operational.


Currently stat() and unlink() are synchronous too.

Note that internally all the operations in gluster are asynchronous in
their true nature. gfapi provides "convenience wrappers" for these calls
in a synchronous way. It is trivial to expose the asynchronous calls
through gfapi, but we haven't done so for the path based operations
because there hasn't been a need thus far. And without even a single
consumer, we did not want to reason about the semantics of async path calls.

Do you have an example driver/header which shows the ideal behavior for
the async path based calls? Will you provide a context pointer and
expect to receive it in the callback? Or do you expect the API to return
a stub for the async call dispatch and poll on it?

                2. There is no way to create notifications on a file (as
                Rados can
                with its objects).


            How are these notifications consumed?


        They are consumed by the lock/unlock operations that are also
        handled by
        our driver. For instance, the Rados driver can wait
        asynchronously for
        someone to unlock an object by registering a watch to the object
        and a
        callback function. Conversely, the unlock operation makes sure
        to send a
        notification to all watchers of the object. Thus, the lock/unlock
        operation can happen asynchronously [2].

        I have read that Gluster supports Posix locks, but this is not the
        locking scheme we have in mind. We need a persistent type of
        lock that
        would stay on a file even if the process closed the file
        descriptor or
        worse, crashed.


    How would you recover if a process that held the lock goes away
    forever? We can provide an option to make this kind of behavior
    possible with posix-locks translator.

        Our current solution is to  create a “lock file” e.g.
        “volume1_0002_lock” with the owner name written in it. Thus, the
        lock/unlock operations generally happen as follows:

        a) Lock: Try to exclusively create a lock file. If successful,
        write the
        owner id to it. If not, sleep for 1 second and retry.
        b) Unlock: Read a lock file and its owner. If we are the owner,
        delete
        it. Else, fail.

        As you can see, this is not an elegant way and is subject to race
        conditions. If Gluster can provide a better solution, we would
        be more
        than happy to know about it.


Currently our locking API exposed through gfapi is POSIX-like
(synchronous, non-persistent). We have other internal locking mechanisms
(entry-lk for locking an abstract name/string in a directory and
inode-lk - nested range locks in a file) which are currently not exposed
through gfapi. But these are not persistent either. Providing async API
version of these calls is not hard (if we have a good understanding
about things like whether the caller provides a context pointer or
expects a call specific stub from the API etc). However I'm not sure
(yet) how to provide persistent locks (and what the right behavior
should be in the event of crash of the caller). You may be able to
simulate (somewhat) persistent locks in the driver using a combination
of sync/async locking APIs + xattrs.

The requirement from Archipelago seems to necessitate avoiding clean up of locks during release/flush/disconnect. There are some complexities that we will run into if we want to provide this behavior. Understanding the lock recovery and cleanup semantics better will help us in determining the right way out here.

Regards,
Vijay




reply via email to

[Prev in Thread] Current Thread [Next in Thread]