gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Stateless Nodes - HowTo - was Re: glusterfs-3.3.0qa3


From: Ian Latter
Subject: Re: [Gluster-devel] Stateless Nodes - HowTo - was Re: glusterfs-3.3.0qa34 released
Date: Wed, 29 May 2013 21:25:39 +1000

Hello,


  Following up on this thread I upgraded to GlusterFS 3.3.1 where the glusterd 
behavior was slightly different.  

  In 3.3.0 I observed;

>   When you do this in the startup process you can skip the "gluster 
> peer probe" and simply call the "gluster volume create" as we did in
> a non clustered environment, on every node as it boots (on every 
> boot, including the first).  The nodes that are late to the party will be 
> told that the configuration already exists, and the clustered volume
> *should* come up.

In 3.3.1 this was reversed.  A clustered "glusterd" would refuse to accept a 
volume create (for a distributed volume) if one of the nodes was down.  When 
all nodes were up, the last booted node could run the "gluster volume create" 
to kick off the cluster (for any node to then issue the "gluster volume start")

Further to this, so long as one node "exists" the in-memory configuration of 
the clustered volumes persists within glusterd.  I.e; for a two node cluster;
  - boot node 1 - clean drives, establish cluster relations and sit
  - boot node 2 - clean drives, establish cluster relations and configure a
                            cluster volume
  - reboot node 1 - clean drives, establish cluster relations and ..

  .. at this point node 1 and node 2 have NFS shares successfully running but 
only node 2 has its bricks successfully serving the DHT volume (observable via 
"gluster volume status").  Performing a "gluster volume stop" and "gluster 
volume start" on the clustered volume will reassert both nodes' bricks in the 
volume.

I have now codified this into a 28Mbyte firmware image - here;
  
http://midnightcode.org/projects/saturn/code/midnightcode-saturn-0.1-20130526230106-usb.burn.gz

I have documented the installation, configuration and operations of that image 
in a sizable manual - here;
  
http://midnightcode.org/papers/Saturn%20Manual%20-%20Midnight%20Code%20-%20v0.1.pdf


What I'm not happy about is my understanding of the recovery strategy, from a 
native GlusterFS perspective, for a DHT volume.  

I.e. when a node that is providing a brick to a gluster cluster volume reboots, 
what is the intended recovery strategy for that distributed volume with respect 
to the lost/found brick?  
  Is it recommended that the volume be stopped and started to rejoin the 
lost/found brick?  Or is there a transparent method for re-introducing the 
lost/found brick (from a client perspective) such as a "brick delete" then 
"brick add" for that node in the volume?  As it wasn't clear to me what the 
impact would be of removing and adding a brick to a DHT (either regarding the 
on disk data/attr state or the future performance of that DHT volume), I didn't 
pursue this path.  If you can "brick delete" then is this the preferred method 
for shutting down a node in order to cleanly umount the disk under the brick? - 
rather than shutting down the entire volume whenever a single node drops out?

I recognise that DHT is not a HA (highly available) solution, I'm simply 
looking for the recommended operational and recovery strategies for DHT volumes 
in multi-node configurations.

My timing isn't good - I understand that everyone here is (rightfully) focused 
on 3.4.0.  When time is avail further down the track, could someone in the know 
please steer me through the current gluster architecture with respect to the 
the above query?



Thanks,



----- Original Message -----
>From: "Ian Latter" <address@hidden>
>To: <address@hidden>
>Subject:  [Gluster-devel] Stateless Nodes - HowTo - was Re:    
>glusterfs-3.3.0qa34 released
>Date: Fri, 17 May 2013 00:37:25 +1000
>
> Hello,
> 
> 
>   Well I can't believe that it's been more than a year since I started 
> looking into a stateless cluster implementation for GlusterFS .. time flies 
> eh.
> 
>   First - what do I mean by "stateless"?  I mean that;
>     - the user configuration of the operating environment is maintained 
> outside of the OS
>     - the operating system is destroyed on reboot or power off, and all OS 
> and application configuration is irrecoverably lost
>     - on each boot we want to get back to the user preferred/configured 
> operating environment through the most normal methods possible (preferably by 
> running the same commands and JIT building of config files that were used to 
> configure the system the first time; should be used every time).
> 
>   In this way, you could well argue that the OE state is maintained in a type 
> of provisioning or orchestration tool, outside of the OS and application 
> instances (or in my case in the Saturn configuration file that is the only 
> persistent data maintained between running OE instances).
> 
>   Per the thread below, to get a stateless node (no clustering involved) we 
> would remove the xattr values from each shared brick, on boot;
>     removexattr(mount_point, "trusted.glusterfs.volume-id")
>     removexattr(mount_point, "trusted.gfid")
> 
>   And then we would populate glusterd/glusterd.info with an externally stored 
> UUID (to make it consistent across boots).  These three actions would allow 
> the CLI "gluster volume create" commands to run unimpeded - thanks to Amar 
> for that detail.
> 
>   Note1: that we've only been experimenting with DHT/Distribute, so I don't 
> know if other Gluster xlator modules have pedantic needs in addition to the 
> above.
>   Note2: that my glusterd directory is in /etc (/etc/glusterd/glusterd.info), 
> where-as the current location in the popular distro's is, I believe, /var/lib 
> (/var/lib/glusterd/glusterd.info), so I will refer to the relative path in 
> this message.
> 
> 
>   But we have finally scaled out beyond the limits of our largest chassis 
> (down to 1TB free) and need to cluster to add on more capacity via the next 
> chassis. Over the past three nights I've had a chance to experiment with 
> GlusterFS 3.3.0 (I will be looking at 3.4.0 shortly) and create a 
> "distribute" volume between two clustered nodes.  To get a stateless outcome 
> we then need to be able to boot one node from scratch and have it re-join the 
> cluster and volume from only the "gluster" CLI command/s.
> 
>   For what its worth, I couldn't find a way to do this.  The peer probing 
> model doesn't seem to allow an old node to rejoin the cluster.
> 
>   So many thanks to Mike of FunWithLinux for this post and steering me in the 
> right direction;
>     http://funwithlinux.net/2013/02/glusterfs-tips-and-tricks-centos/
> 
>   The trick seems to be (in addition to the non-cluster configs, above) to 
> manage the cluster membership outside of GlusterFS.  On boot, we 
> automatically populate the relevant peer file
> (glusterd/peers/{uuid}) with the UUID, state=3, and hostname/IP address; one 
> file for each other node in the cluster (excluding the local node).  I.e.
> 
>     # cat /etc/glusterd/peers/ab2d5444-5a01-427a-a322-c16592676d29
>       uuid=ab2d5444-5a01-427a-a322-c16592676d29
>       state=3
>       hostname1=192.168.179.102
> 
>   Note that if you're using IP addresses as your node handle (as opposed to 
> host names) then you must retain the same IP address across boots for this to 
> work, lest you make modifications to the existing/running cluster nodes that 
> will require glusterd to be restarted.
> 
>   When you do this in the startup process you can skip the "gluster peer 
> probe" and simply call the "gluster volume create" as we did in a non 
> clustered environment, on every node as it boots (on every boot, including 
> the first).  The nodes that are late to the party will be told that the 
> configuration already exists, and the clustered volume *should* come up.
> 
>   I am still experimenting, but I say "should" because you can sometimes see 
> a delay in the re-establishment of the clustered volume, and you can 
> sometimes see the clustered volume fail to re-establish. When it fails to 
> re-establish the solution seems to be a "gluster volume start" for that 
> volume, on any node.  FWIW I believe I'm seeing this locally because Saturn 
> tries to nicely stop all Gluster volumes on reboot, which is affecting the 
> cluster (of course) - lol - a little more integration work to do.
>   
> 
>   The external state needed then looks like this on the first node (101);
> 
>     set gluster server        uuid 6b481ebb-859a-4c2b-8b5f-8f0bba7c3b9a
>     set gluster peer0         uuid ab2d5444-5a01-427a-a322-c16592676d29
>     set gluster peer0         ipv4_address 192.168.179.102
>     set gluster volume0       name myvolume
>     set gluster volume0       is_enabled 1
>     set gluster volume0       uuid 00000000-0000-0000-0000-000000000000
>     set gluster volume0       interface eth0
>     set gluster volume0       type distribute
>     set gluster volume0       brick0 /dev/hda
>     set gluster volume0       brick1 192.168.179.102:/glusterfs/exports/hda
> 
>   And the external state needed looks like this on the second node (102);
> 
>     set gluster server        uuid ab2d5444-5a01-427a-a322-c16592676d29
>     set gluster peer0         uuid 6b481ebb-859a-4c2b-8b5f-8f0bba7c3b9a
>     set gluster peer0         ipv4_address 192.168.179.101
>     set gluster volume0       name myvolume
>     set gluster volume0       is_enabled 1
>     set gluster volume0       uuid 00000000-0000-0000-0000-000000000000
>     set gluster volume0       interface eth0
>     set gluster volume0       type distribute
>     set gluster volume0       brick0 192.168.179.101:/glusterfs/exports/hda
>     set gluster volume0       brick1 /dev/hda
> 
>   Note that I assumed that there was a per volume UUID (currently all zeros) 
> that I would need to re-instate but haven't seen yet (presumably it's one 
> value that's currently being removed from the mount point xattr's on each 
> boot).
> 
> 
>   I hope that this information helps others who are trying to dynamically 
> provision and re-provision virtual/infrastructure environments.  I note that 
> this information covers a topic that has not been written up on the Gluster 
> site;
> 
>      HowTo - GlusterDocumentation
>      http://www.gluster.org/community/documentation/index.php/HowTo
>      [...]
>      Articles that need to be written
>      Troubleshooting
>        - UUID's and cloning Gluster instances
>        - Verifying cluster integrity
>      [...]
> 
> 
>   Please feel free to use this content to help contribute to that FAQ/HowTo 
> document.
> 
> 
> Cheers,
> 
> 
> ----- Original Message -----
> >From: "Ian Latter" <address@hidden>
> >To: "Amar Tumballi" <address@hidden>
> >Subject:  Re: [Gluster-devel] glusterfs-3.3.0qa34 released
> >Date: Wed, 18 Apr 2012 18:55:46 +1000
> >
> > 
> > ----- Original Message -----
> > >From: "Amar Tumballi" <address@hidden>
> > >To: "Ian Latter" <address@hidden>
> > >Subject:  Re: [Gluster-devel] glusterfs-3.3.0qa34 released
> > >Date: Wed, 18 Apr 2012 13:42:45 +0530
> > >
> > > On 04/18/2012 12:26 PM, Ian Latter wrote:
> > > > Hello,
> > > >
> > > >
> > > >    I've written a work around for this issue (in 3.3.0qa35)
> > > > by adding a new configuration option to glusterd
> > > > (ignore-strict-checks) but there are additional checks
> > > > within the posix brick/xlator.  I can see that volume starts
> > > > but the bricks inside it fail shortly there-after, and
> > that of
> > > > the 5 disks in my volume three of them have one
> > > > volume_id and two them have another - so this isn't going
> > > > to be resolved without some human intervention.
> > > >
> > > >    However, while going through the posix brick/xlator I
> > > > found the "volume-id" parameter.  I've tracked it back
> > > > to the volinfo structure in the glusterd xlator.
> > > >
> > > >    So before I try to code up a posix inheritance for my
> > > > glusterd work around (ignoring additional checks so
> > > > that a new volume_id is created on-the-fly / as-needed),
> > > > does anyone know of a CLI method for passing the
> > > > volume-id into glusterd (either via "volume create" or
> > > > "volume set")?  I don't see one from the code ...
> > > > glusterd_handle_create_volume does a uuid_generate
> > > > and its not a feature of glusterd_volopt_map ...
> > > >
> > > >    Is a user defined UUID init method planned for the CLI
> > > > before 3.3.0 is released?  Is there a reason that this
> > > > shouldn't be permitted from the CLI "volume create" ?
> > > >
> > > >
> > > We don't want to bring in this option to CLI. That is
> > because we don't 
> > > think it is right to confuse USER with more
> > options/values. 'volume-id' 
> > > is a internal thing for the user, and we don't want him to
> > know about in 
> > > normal use cases.
> > > 
> > > In case of 'power-users' like you, If you know what you
> > are doing, the 
> > > better solution is to do 'setxattr -x trusted.volume-id
> > $brick' before 
> > > starting the brick, so posix translator anyway doesn't get
> > bothered.
> > > 
> > > Regards,
> > > Amar
> > > 
> > 
> > 
> > Hello Amar,
> > 
> >   I wouldn't go so far as to say that I know what I'm
> > doing, but I'll take the compliment ;-)
> > 
> >   Thanks for the advice.  I'm going to assume that I'll 
> > be revisiting this issue when we can get back into 
> > clustering (replicating distributed volumes).  I.e. I'm
> > assuming that on this path we'll end up driving out 
> > issues like split brain;
> >  
> > https://github.com/jdarcy/glusterfs/commit/8a45a0e480f7e8c6ea1195f77ce3810d4817dc37
> > 
> > 
> > Cheers,
> > 
> > 
> > 
> > --
> > Ian Latter
> > Late night coder ..
> > http://midnightcode.org/
> > 
> > _______________________________________________
> > Gluster-devel mailing list
> > address@hidden
> > https://lists.nongnu.org/mailman/listinfo/gluster-devel
> > 
> 
> 
> --
> Ian Latter
> Late night coder ..
> http://midnightcode.org/
> 
> _______________________________________________
> Gluster-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
> 


--
Ian Latter
Late night coder ..
http://midnightcode.org/



reply via email to

[Prev in Thread] Current Thread [Next in Thread]