Re: [Gluster-devel] Barrier design issues wrt volume snapshot

On Thu, Mar 6, 2014 at 11:19 AM, Krishnan Parthasarathi <address@hidden> wrote:

----- Original Message -----
> On Thu, Mar 6, 2014 at 12:21 AM, Vijay Bellur <address@hidden> wrote:
>
> > Adding gluster-devel.
> >
> >
> > On 03/06/2014 01:15 PM, Krishnan Parthasarathi wrote:
> >
> >> All,
> >>
> >> In recent discussions around design (and implementation) of the barrier
> >> feature, couple of things came to light.
> >>
> >> 1) changelog xlator needs barrier xlator to block unlink and rename FOPs
> >> in the call path. This is apart from the current list of FOPs that
> >> are blocked
> >> in their call back path.
> >> This is to make sure that the changelog has a bounded queue of unlink
> >> and rename FOPs,
> >> from the time barriering is enabled, to be drained, committed to
> >> changelog file and published.
> >>
> >
> Why is this necessary?

The only consumer of changelog today, georeplication, can't tolerate missing unlink/rename
entries from changelog, even with the initial xsync based crawl, until changelog entries
are available for the master volume.
So, changelog xlator needs to ensure that the last rotated
(publishable) changelog should have entries for all the unlink(s)/rename(s) that made
it to the snapshot. For this, changelog needs barrier xlator to block unlink/rename
FOPs in the call path too. Hope that helps.

This sounds like a very changelog specific requirement. This is best addressed in the changelog translator itself. If unlink/rmdir/renames should not be "in progress" during a snapshot, then we need to hold off new ops in the call path, trigger a log rotation and the rotation should wait for completion of ongoing fops anyways.

>
>
> 2) It is possible in a pure distribute volume that the following sequence
> >> of FOPs could result
> >> in snapshots of bricks disagreeing on inode type for a file or
> >> directory.
> >>
> >> t1: snap b1
> >> t2: unlink /a
> >> t3: mkdir /a
> >> t4: snap b2
> >>
> >> where, b1 and b2 are bricks of a pure distribute volume V.
> >>
> >> The above sequence can happen with the current barrier xlator design,
> >> since we allow unlink FOPs
> >> to go through to the disk and only block their acknowledgement to the
> >> application. This implies
> >> a concurrent mkdir on the same name could succeed, since DHT doesn't
> >> serialize unlink and mkdir FOPs,
> >> unlike AFR.
> >>
> >> Avati,
> >>
> >> I hear that you have a solution for problem 2). Could you please start
> >> the discussion on this thread?
> >> It would help us to decide how to go about with the barrier xlator
> >> implementation.
> >>
> >
>
> The solution is really a long pending implementation of dentry
> serialization in the resolver of protocol server. Today we allow multiple
> FOPs to happen in parallel which modify the same dentry. This results in
> hairy races (including non atomicity of rename) and has been kept open for
> a while now. Implementing the dentry serialization in the resolver will
> "solve" 2 as a side effect. Hence that is a better approach than making
> changes in the barrier translator.
>

I am not sure I understood how this works from the brief introduction above.
Could you explain a bit?

By dentry serialization, I mean we should have only one operation modifying a <pargfid>/bname at a given time. This needs changes in the resolver of protocol server and possibly some changes in the inode table. This is really for solving rare races, and I think is something we need to work on independent of the snapshot requirements.

Avati

From:	Anand Avati
Subject:	Re: [Gluster-devel] Barrier design issues wrt volume snapshot
Date:	Thu, 6 Mar 2014 11:44:49 -0800