>To: "glusterfs" <address@hidden
>Subject: [Gluster-devel] DHT-based Replication
>Date: Mon, 21 Nov 2011 21:34:30 -0200
> Good day everyone!
> It is a pleasure for me to write to this list again, after
so many years. I
> tested GlusterFS a lot about four years ago because of a
project I still
> have today. It was 2007 and I worked with Gluster 1.2 and
1.3. I like to
> think that I gave one or two good ideas that were used in
the project. Amar
> was a great friend back then (friend is the correct word
as he was on my
> Orkut's friends list). Unfortunately Gluster's performance
> enough, mostly because of the ASCII protocol and I had to
go to the HA-NFS
> By that time, a new clustering translator was created, it
was called DHT.
> In those four years that passed I studied DHT and even
> DHT-based internal system here at my company to create a
> (I didn't do any reconciliation thing though, that is the
hard part for
> me). It worked fine, it still does!
> What is the beauty of DHT? It is easy to find where a file
or a directory
> is in the cluster. It is just a hash operation! You just
create a cluster
> of N servers and everything should be distributed evenly
> Every player out there got the distribution part of
storage clusters right:
> GlusterFS, Cassandra, MongoDB, etc.
> What is not right in my opinion is how everybody does
> Everything works in pairs (or triples, or replica sets of
any number). The
> problem with pairs (and replica sets) is that in a 10
server cluster, if
> one fails only its pair will have to handle the double of
the usual load,
> while the other ones will be working with the usual load.
So we can assign
> at most 50% of its capacity to any storage server. Only
50%. 50% efficiency
> means we spend the double in hardware, rack space and
energy (the worst
> part). Even with RAID10 storage, the ones we use, have
this problem too. If
> a disk dies, one will get double read IOs while the other
ones are 50% idle
> (or should be).
> So, I have a suggestion that fixes this problem. I call it
> replication. It is different from DHT-based distribution.
> implemented it internally, it already worked, at least
here. Giving the
> amount of money and energy this idea saves, I think this
idea is worth a
> million bucks (on your hands at least). Even though it is
> I'm giving it to you guys for free. Just please give
credit if you are
> going to use it.
> It is very simple: hash(name) to locate the primary server
(as usual), and
> use another hash, like hash(name + "#copy2") for the
second copy and so on.
> You just have to certify that it doesn't fall into the
same server, but
> this is easy: hash(name + "#copy2/try2").
> So, say we have 11 storage servers. Yeah, now we can have
a prime number of
> servers and everything will still work fine. Imagine that
> #11 dies. The secondary copy of all files at #11 are
spread across all the
> other 10 servers, so those servers are now only getting a
load 10% bigger!
> Wow! Now I can use 90% of what my storages can handle and
still be up when
> one fails! Had I done this the way things are today, my
system would be
> down because there is no way a storage can handle 180% of
IO its capacity
> (by definition). So now I have 90% efficiency! For exactly
the same costs
> to have 50k users I can have 90k users now! That is a 45%
costs savings on
> storage, with just a simple algorithm change.
> So, what are your thoughts on the idea?
> Thank you very much!
> Best regards,
> Daniel Colchete