gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] GlusterFS Snapshot internals


From: Rajesh Joseph
Subject: Re: [Gluster-devel] GlusterFS Snapshot internals
Date: Tue, 8 Apr 2014 08:04:35 -0400 (EDT)

Hi Paul,

Whenever a brick comes online it performs a handshake with glusterd. The brick 
will not send a notification to 
clients until the handshake is done. We are planning to provide an extension to 
this and recreate those missing snaps.

Best Regards,
Rajesh

----- Original Message -----
From: "Paul Cuzner" <address@hidden>
To: "Rajesh Joseph" <address@hidden>
Cc: "gluster-devel" <address@hidden>
Sent: Tuesday, April 8, 2014 12:49:13 PM
Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

Rajesh, 

Perfect explanation - the 'penny has dropped'. I was missing the healing 
process of the snap being based on the snap from the replica. 

One final question - I assume the scenario you mention about the brick coming 
back online before the snapshots are taken is theoretical and there are blocks 
in place to prevent this from happening? 

BTW, I'll get the BZ RFE's in by the end of my week, and will post the BZ's 
back to the list for info. 

Thanks! 

PC 

----- Original Message -----

> From: "Rajesh Joseph" <address@hidden>
> To: "Paul Cuzner" <address@hidden>
> Cc: "gluster-devel" <address@hidden>
> Sent: Tuesday, 8 April, 2014 5:09:10 PM
> Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> Hi Paul,

> It would be great if you can raise RFEs for both snap after restore and
> snapshot naming.

> Let's say your volume "Vol" has bricks b1, b2, b3 and b4.

> @0800 - S1 (snapshot volume) -> s1_b1, s1_b2, s1_b3, s1_b4 (These are
> respective snap bricks which are on independent thin LVs)

> @0830 - b1 went down

> @1000 - S2 (snapshot volume) -> s2_b1, x, s2_b3, s2_b4. Here we mark the
> brick has pending snapshot.
> Note that s2_b1 will have all the changes missed by b2 till 1000 hours. AFR
> will mark the
> pending changes on s2_b1.

> @1200 - S3 (Snapshot volume) -> s3_b1, x, s3_b3, s3_b4. This missed snapshot
> is also recorded.

> @1400 - S4 (Snapshot volume) -> s4_b1, x, s4_b3, s4_b4. This missed snapshot
> is also recorded.

> @1530 - b2 comes back. Before making it online we take snapshot s2_b2, s3_b2
> and s4_b2. Since all
> these three snapshots are taken nearly at the same time content-wise all of
> them would be
> at the same state. Now these bricks are added to their respective volumes.
> Note that till
> now no healing is done. After addition snapshot volumes will look like this:
> S2 -> s2_b1, s2_b2, s2_b3, s2_b4.
> S3 -> s3_b1, s3_b2, s3_b3, s3_b4.
> S4 -> s4_b1, s4_b2, s4_b3, s4_b4.
> After this b2 will come online, i.e. clients can access this brick. Now S2,
> S3 and S4 is healed.
> s2_b2 will get healed from s2_b1, s3_b2 will be healed from s3_b1 and so on
> and so forth.
> This healing will take s2_b2 to the point when the snapshot is taken.

> If the bricks come online before taking these snapshots self heal will try to
> take the brick (b2) to point closer
> to the current time (@1530). Therefore it will not be consistent with the
> other replica-set.

> Please let me know if you have more questions or clarifications.

> Best Regards,
> Rajesh

> ----- Original Message -----
> From: "Paul Cuzner" <address@hidden>
> To: "Rajesh Joseph" <address@hidden>
> Cc: "gluster-devel" <address@hidden>
> Sent: Tuesday, April 8, 2014 8:01:57 AM
> Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> Thanks Rajesh.

> Let me know if I should raise any RFE's - snap after restore, snapshot
> naming, etc

> I'm still being thick about the snapshot process with missing bricks. What
> I'm missing is the heal process between snaps - my assumption is that the
> snap of a brick needs to be consistent with the other brick snaps within the
> same replica set. Lets use a home drive use case as an example - typically,
> I'd expect to see a home directories getting snapped at 0800, 1000,
> 1200,1400, 1600, 1800, 2200 each day. So in that context, say we have a
> dist-repl volume with 4 bricks, b1<->b2, b3<->b4;

> @ 0800 all bricks are available, snap (S1) succeeds with a snap volume being
> created from all bricks
> --- files continue to be changed and added
> @ 0830 b2 is unavailable (D0). Gluster tracks the pending updates on b1,
> needed to be applied to b2
> --- files continue to be changed and added.
> @ 1000 snap requested - 3 of 4 bricks available, snap taken (S2) on b1, b3
> and b4 - snapvolume activated
> --- files continue to change
> @ 1200 a further snap performed - S3
> --- files continue to change
> @ 1400 snapshot S4 taken
> --- files change
> @ 1530 missing brick 2 comes back online (D1)

> Now between disruption of D0 and D1 there have been several snaps. My
> understanding is that each snap should provide a view of the filesystem
> consistent at the time of the snapshot - correct?

> You mention
> + brick2 comes up. At this moment we take a snapshot before we allow new I/O
> or heal of the brick. We multiple snaps are missed then all the snaps are
> taken at this time. We don't wait till the brick is brought to the same
> state as other bricks.
> + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot volume).
> Self heal will take of bringing brick2 state to its other replica set.

> According to this description, if you snapshot b2 as soon as it's back online
> - that generates S1,S2 and S3 as at 08:30 - and lets self heal bring b2 up
> to the current time D1. However, doesn't this mean that S1,S2 and S3 on
> brick2 are not equal to S2,S3,S4 on brick1?

> If that is right, then if b1 is unavailable the corresponding snapshots on b2
> wouldn't support the recovery points of 1000,1200 and 1400 - which we know
> are ok on b1.

> I guess I'd envisaged snapshots working hand-in-glove with self heal to
> maintain the snapshot consistency - and may just be stuck on that thought.

> Maybe this is something I'll only get on whiteboard - wouldn't be the first
> time :(

> I appreciate you patience in explaining this recovery process!

> ----- Original Message -----

> > From: "Rajesh Joseph" <address@hidden>
> > To: "Paul Cuzner" <address@hidden>
> > Cc: "gluster-devel" <address@hidden>
> > Sent: Monday, 7 April, 2014 10:12:53 PM
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > Thanks Paul for your valuable comments. Please find my comments in-lined
> > below.

> > Please let us know if you have more questions or clarifications. I will try
> > to update the
> > doc where ever more clarity is needed.

> > Thanks & Regards,
> > Rajesh

> > ----- Original Message -----
> > From: "Paul Cuzner" <address@hidden>
> > To: "Rajesh Joseph" <address@hidden>
> > Cc: "gluster-devel" <address@hidden>
> > Sent: Monday, April 7, 2014 1:59:10 AM
> > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals

> > Hi Rajesh,

> > Thanks for updating the design doc. It reads well.

> > I have a number of questions that would help my understanding;

> > Logging : The doc doesn't mention how the snapshot process is logged -
> > - will snapshot use an existing log or a new log?
> > [RJ]: As of now snapshot make use of existing logging framework.
> > - Will the log be specific to a volume, or will all snapshot activity be
> > logged in a single file?
> > [RJ]: Snapshot module is embedded in gluster core framework. Therefore the
> > logs will also be part of glusterd logs.
> > - will the log be visible on all nodes, or just the originating node?
> > [RJ]: Similar to glusterd snapshot logs related to each node will be
> > visible
> > in those nodes.
> > - will the highlevel snapshot action be visible when looking from the other
> > nodes either in the logs or at the cli?
> > [RJ]: As of now highlevel snapshot action will be visible only in the logs
> > of
> > originator node. Though cli can be used see
> > list and info of snapshots from any other nodes.

> > Restore : You mention that after a restore operation, the snapshot will be
> > automatically deleted.
> > - I don't believe this is a prudent thing to do. Here's an example, I've
> > seen
> > ALOT. Application has a programmatic error, leading to data 'corruption' -
> > devs work on the program, storage guys roll the volume back. So far so
> > good...devs provide the updated program, and away you go...BUT the issue is
> > not resolved, so you need to roll back again to the same point in time. If
> > you delete the snap automatically, you loose the restore point. Yes the
> > admin could take another snap after the restore - but why add more work
> > into
> > a recovery process where people are already stressed out :) I'd recommend
> > leaving the snapshot if possible, and let it age out naturally.
> > [RJ]: Snapshot restore is a simple operation wherein volume bricks will
> > simply point to the brick snapshot instead of the original brick. Therefore
> > once the restore is done we cannot use the same snapshot again. We are
> > planning to implement a configurable option which will automatically take
> > snapshot of the snapshot to fulfill the above mentioned requirement. But
> > with the given timeline and resources we will not be able to target it in
> > the coming release.

> > Auto-delete : Is this a post phase of the snapshot create, so the
> > successfully creation of a new snapshot will trigger the pruning of old
> > versions?
> > [RJ] Yes, if we reach the snapshot limit for a volume then the snapshot
> > create operation will trigger pruning of older snapshots.

> > Snapshot Naming : The doc states the name is mandatory.
> > - why not offer a default - volume_name_timestamp - instead of making the
> > caller decide on a name. Having this as a default will also make the list
> > under .snap more usable by default.
> > - providing a sensible default will make it easier for end users for self
> > service restore. More sensible defaults = more happy admins :)
> > [RJ]: This is a good to have feature we will try to incorporate this in the
> > next release.

> > Quorum and snaprestore : the doc mentions that when a returning brick comes
> > back, it will be snap'd before pending changes are applied. If I understand
> > the use of quorum correctly, can you comment on the following scenario;
> > - With a brick offline, we'll be tracking changes. Say after 1hr a snap is
> > invoked because quorum is met
> > - changes continue on the volume for another 15 minutes beyond the snap,
> > when
> > the offline brick comes back online.
> > - at this point there are two point in times to bring the brick back to -
> > the
> > brick needs the changes up to the point of the snap, then a snap of the
> > brick followed by the 'replay' of the additional changes to get back to the
> > same point in time as the other replica's in the replica set.
> > - of course, the brick could be offline for 24 or 48 hours due to a
> > hardware
> > fault - during which time multiple snapshots could have been made
> > - it wasn't clear to me how this scenario is dealt with from the doc?
> > [RJ]: Following action is taken in case we miss a snapshot on brick.
> > + Lets say brick2 is down while taking snapshot s1.
> > + Snapshot s1 will be taken for all the bricks except brick2. Will update
> > the
> > bookkeeping about the missed activity.
> > + I/O can continue to happen on origin volume.
> > + brick2 comes up. At this moment we take a snapshot before we allow new
> > I/O
> > or heal of the brick. We multiple snaps are missed then all the snaps are
> > taken at this time. We don't wait till the brick is brought to the same
> > state as other bricks.
> > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot volume).
> > Self heal will take of bringing brick2 state to its other replica set.

> > barrier : two things are mentioned here - a buffer size and a timeout
> > value.
> > - from an admin's pespective, being able to specify the timeout (secs) is
> > likely to be more workable - and will allow them to align this setting with
> > any potential timeout setting within the application running against the
> > gluster volume. I don't think most admins will know or want to know how to
> > size the buffer properly.
> > [RJ]: In the current release we are only providing the timeout value as a
> > configurable option. The buffer size is being considered for future release
> > as configurable option or we find our-self what would be the optimal value
> > based on user's system configuration.

> > Hopefully the above makes sense.

> > Cheers,

> > Paul C

> > ----- Original Message -----

> > > From: "Rajesh Joseph" <address@hidden>
> > > To: "gluster-devel" <address@hidden>
> > > Sent: Wednesday, 2 April, 2014 3:55:28 AM
> > > Subject: [Gluster-devel] GlusterFS Snapshot internals

> > > Hi all,

> > > I have updated the GlusterFS snapshot forge wiki.

> > > https://forge.gluster.org/snapshot/pages/Home

> > > Please go through it and let me know if you have any questions or
> > > queries.

> > > Best Regards,
> > > Rajesh

> > > [PS]: Please ignore previous mail. Accidentally hit send before
> > > completing
> > > :)

> > > _______________________________________________
> > > Gluster-devel mailing list
> > > address@hidden
> > > https://lists.nongnu.org/mailman/listinfo/gluster-devel



reply via email to

[Prev in Thread] Current Thread [Next in Thread]