gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1


From: Paul Cuzner
Subject: Re: [Gluster-devel] bit rot support for glusterfs design draft v0.1
Date: Mon, 27 Jan 2014 17:18:04 -0500 (EST)




From: "shishir gowda" <address@hidden>
To: address@hidden
Sent: Monday, 27 January, 2014 6:30:13 PM
Subject: [Gluster-devel] bit rot support for glusterfs design draft v0.1

Hi All,

Please find the updated bit-rot design for glusterfs volumes.

Thanks to Vijay Bellur for his valuable inputs in the design.

Phase 1: File level bit rot detection

The initial approach is to achieve bit rot detection at file level,
where checksum is computed for a complete file, and checked during
access.

A single daemon(say BitD) per node will be responsible for all the
bricks of the node. This daemon, will be registered to the gluster
management daemon, and any graph changes
(add-brick/remove-brick/replace-brick/stop bit-rot) will be handles
accordingly. This BitD will register with changelog xlator of all the
bricks for the node, and process changes from them.

Doesn't having a single daemon for all bricks, instead of a per brick 'bitd' introduce the potential of a performance bottleneck?


Change log xlator, would give the list of files (in terms of gfid)
which have changed during a defined interval. Checksum's would have to
be computed for these based on either fd close() call for non NFS
access, or every write for anonymous fd access (NFS). The computed
checksum in addition to the timestamp of the computation would be
saved as a extended-attribute (xattr) of the file. By using change-log
xlators, we would prevent periodic scans of the bricks, to identify
the files whose checksums need to be updated.

Using the changelog is a great idea, but I'd also see a requirement for an admin initiated full scan at least when bringing existing volumes under bitd control.

Also, what's the flow if the xattr is unreadable, due to bit rot. In btrfs meta data is typically mirrored.


Upon access (open for non-anonymous-fd calls, every read for
anonymous-fd calls) from any clients, the bit rot detection xlator
loaded ontop of the bricks, would recompute the checksum of the file,
and allow the calls to proceed if they match, or fail them if they
mis-match. This introduces extra workload for NFS workloads, and for
large files which require read of the complete file to recompute the
checksum(we try to solve this in phase-2).
every read..? That's sounds like such an overhead, admins would just turn it off.

I assume failing a read due to checksum inconsistency in a replicated volume would trigger one of the other replica's to be used, so the issue is transparent to the end user/application.



Since a data write happens first, followed by a delayed checksum
compute, there is a time frame where we might have data updated, but
checksums yet to be computed. We should allow the access of such files
if the file timestamps (mtime) has changed, and is within a defined
range from the current time.

Additionally, we could/should have the ability to switch of checksum
compute from glusterfs perspective, if the underlying FS
exposes/implements bit-rot detection(btrfs).
+1 Why re-invent the wheel!


Phase 2: Block-level(User space/defined) bit rot detection and correction.

The eventual aim is to be able to heal/correct bit rots in files. To
achieve this, computing checksum at a more fine grain level like a
block (size limited by the bit rot algorithm), so that we not only
detect bit rots, but also have the ability to restore them.
Additionally, for large files, checking the checksums at block level
is more efficient, rather than recompute the checksum of the whole
file for a an access.

In this phase, we could move the checksum computation phase to the
xlator loaded on-top of the posix translator at each bricks. with
every write, we could compute the checksum, and store the checksum and
continue with the write or vice versa. Every access would also be able
to read/compute the checksum of the requested block, check it with the
save checksum of the block, and act accordingly. This would take away
the dependency on the external BitD, and changelog xlator.

Additionally, using a Error-correcting code(ECC) or
Forward-error-correction(FEC) alogrithm, would enable us the correct
few bits in the block which have gone corrupt. And compute of the
complete files checksum is eliminated, as we are dealing with blocks
of defined size.

We require the ability to store these fine-grained checksums
efficiently, and extended attributes would not scale for this
implementation. Either a custom backed store, or a DB would be
preferrable in this instance.
so if there is a per 'block' checksum, won't our capacity overheads increase to store the extra meta data, ontop of our existing replication/raid overhead?

Where does Xavi's disperse volume fit into this? Would an Erasure Coded volume lend itself easier to those use cases (cold data) where bit rot is key consideration?

If so, would a more simple bit rot strategy for gluster be
1) disperse volume
2) btrfs checksums + plumbing to trigger heal when scrub detects a problem

I like simple :)



Please feel free to comment/critique.

With regards,
Shishir

_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel


reply via email to

[Prev in Thread] Current Thread [Next in Thread]