[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH] enhancement: modify md5sum to allow piping
From: |
Daniel Santos |
Subject: |
[PATCH] enhancement: modify md5sum to allow piping |
Date: |
Thu, 20 Dec 2012 16:09:38 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.11) Gecko/20121128 Thunderbird/10.0.11 |
There are many times, usually when doing system backups, maintenance,
recovery, etc., that I would like to pipe large files through md5sum to
produce or verify a hash so that I do not have to read the file multiple
times. This is especially the case when backing up a system from a
livecd across the network
dd if=/dev/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
or
tar c /mnt/sda3 | pbzip2 -c2 | netcat 192.168.1.123 45678
Attached is a preliminary patch set that will allow for this as in the
following example
dd if=/dev/sda3 | pbzip2 -c2 | md5sum -po /tmp/sda3.dat.bzip2.md5 |
netcat 192.168.1.123 45678
-p is short for --pipe and -o <filename> is short for --outfile
<filename>. Then, on the receiving end, the hash can be determined as
the file is read, eliminating any worry about network corruption:
netcat -l -p 45678| md5sum -po sda3.dat.bzip2.rx.md5 > sda3.dat.bzip2
The only caveat being that you have to manually compare the sum files,
which you can just do by calling diff, a small cost when compared to
re-reading a 200GiB file!
You can even get the sum prior to compression, although if you wanted to
avoid a duplicate read on the server end, you would have to decompress
as you read it and either store the file uncompressed or re-compress it.
dd if=/dev/sda3 | md5sum -po /tmp/sda3.dat.md5 | pbzip2 -c2 | netcat
192.168.1.123 45678
with
netcat -l -p 45678| pbzip2 -cd | md5sum -po sda3.dat.rx.md5 > sda3.dat
The attached patchset is in a very early stage and has many problems:
* GNU coding style compliance (this coding style is new to me)
* API in gnulib is changed, may break other apps
* all changes are lumped together and needs to be broken apart into
logical changes
* it has a few hacks that need to be cleaned up
Also, this patch set addresses a problem with the gnulib's hash
functions where there was a lot of copy & paste code. I've implemented
a mechanism to clean this up w/o a performance hit (as long as we're
using gcc 4.6.1+). This change should probably go into a separate
patchset & bug report.
Finally, after the cursory amount that I've worked with this code, I see
a number of other areas where I believe there's room for improvement.
* The copy & paste code problem (mentioned above)
* Centralize the location where BLOCKSIZE is defined and only verify
it's a multiple of 64 in gnulib/lib/{md,sha}*.c
* Perhaps allow BLOCKSIZE to be defined at configure time? Honestly,
I'm not intimately familiar enough with the issues where I can be
certain it would alter performance on any system, but I'm thinking
about embedded where reading 32k chunks may end up thrashing the
cache, but 8k or 4k would not. However, I don't think I would be in
favor of this being a run-time parameter, as it would seem to be a
lot of waste (and lost optimizations) for something that's probably
pretty specific to the hardware and build target.
* Centralize compiler sniffing into a single gnulib header, (like
"compiler.h" or some such) and define the GCC_VERSION macro as
described in
http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html.
* Make better use of __builtin_expect via portable likely/unlikely
macros to make sure error handling code gets moved out of the main
bodies of functions (which can save a cache miss here and there).
Of course, this would require the above item to do cleanly.
* Introduce some tuning parameter in the configure script to choose
between smaller and larger, but more optimized code. I bring this
up mainly because in my re-work of the copy & pasted code, I see a
large opportunity to create a much smaller executable (if needed),
but one that would create slightly slower code, which would usually
be undesirable on a machine with plenty of RAM, storage and CPU cache.
Obviously, these should be made into separate bug reports as well and I
can send separate emails for them if you like.
Daniel
0001-md5sum-pipe.patch
Description: Text Data
0001-piping-support.patch
Description: Text Data
- [PATCH] enhancement: modify md5sum to allow piping,
Daniel Santos <=