[rdiff-backup-users] Re: more info on 25gig files

rdiff-backup-users

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[rdiff-backup-users] Re: more info on 25gig files

From:	rsync2eran
Subject:	[rdiff-backup-users] Re: more info on 25gig files
Date:	Fri, 01 Jul 2005 15:44:24 +0300
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050513 Thunderbird/1.0.2 Fedora/1.0.2-1.3.3 Mnenhy/0.7.2.0

Hi,

On 01/07/05 02:31, Donovan Baarda wrote:
 > The number of pairs for n blocks is n*n/2, or 2^43, so that's a one in
> 2^21 chance.... 

You're thinking about colliding block signatures in a single file; I was
thinking about syncing two files that look independently random-looking
(e.g., compressed and/or encrypted). Then you have n blocks for the new
file times n blocks for the old file, so no factor of two.

> ie if you have 2 million 4GB files, you are likely to
> get a signature md4sum collision in one. That might not be comfortable,
> but it is probably not disturbing...

Then it seems our disturbance thresholds are different...
To put it another way, if you have 1000 users running a daily 4GB sync
over 3 years, it's likely that at least one of them will see a
corruption. For increased effect, change 4GB to the subject of this thread.

> In practice it is not quite that bad... as each match effectively skips
> over the matching block's bytes, so it is really one block per
> non-matching byte... ie the critical thing is not the file size, it's
> the number of changed bytes. In the case of a large database file with a
> small number of changes, you are usually in the clear.

That's because of the heuristic that considers first the next block in
the old file, right? Yes, I guess this does improve things on most
workloads.

>>As for the rollsum, I don't trust it even for non-malicious data. It's
>>too easy to believe it will be fooled by some natural structure in the
>>data. And as for malicious settings, well, last time we considered those
> 
> It has got to be worth something... even if it only counts for 4 bits,
> it still gives you 16 x the confidence.

ACK.

> Yeah, it all goes out the window with malicious data... the only fix for
> this is a variable seed for the signature... probably random is best.

Yes, that's if you want to increase the chances of a successful transfer
in malicious settings. Though just to prevent silent corruption, a
strong whole-file hash would suffice.

> Note that adding a variable seed allows you to correct a blocksum
> collision in the signature... if you detect one, just re-generate the
> signature with a different seed.
> 
> Note that detecting and correcting this would require seek()
> functionality on the input. You can't differentiate a matching blocksum
> from a genuine collision or identical block without comparing the actual
> block data.

To distinguish genuine collisions from identical blocks, you can keep
stronger hashes in memory (or temporary file) during delta generation.

Requiring seekable input would be quite a loss of functionality (e.g., I
use rdiff to sync huge backup tarballs piped directly from tar).

> You may be able to leave handling of signature collisions up to the
> application by simply reporting them and giving the application the
> oppertunity to confirm and restart or resume the signature calculation.

That's a strange question to ask the application... And often
unanswerable (e.g., for unseekable input).

If you don't check those collision within librsync (as above), maybe
it's better to just keep assuming that colliding blocks are identical,
go through with the transfer, and then check the whole-file checksum. If
it doesn't match, tell the application to redo everything with a new
random seed (and perhaps larger hashes, like rsync does automatically).

>>>The disturbing part is that any corruption that does occur is silently
>>>ignored.
>>
>>Yes. Do you think the integrity testing should be done by librsync, or
>>by every client app (with a prominent notice that they should)?
> 
> I'm not sure... I think it makes sense to put it into the signature.
> Whether that is done by librsync or rdiff, I don't know. In either case,
> it will require a change to the signature file.
> 
> If we are going to change the signature file, we may as well do it
> right, adding both a whole-file checksum and a variable seed.

Sounds good, especially the variable seed part. :-)

Since that forces a change in librsync, maybe use the opportunity to
change from MD4 to a less broken hash? Just a quick check on my 2GHz
Athlon XP:

* "rdiff signature" on large cached file:  109MB/sec
* OpenSSL MD5, 1KB blocks:                 246MB/sec
               8KB blocks:                 326MB/sec
* OpenSSL SHA1, 1KB blocks:                148MB/sec
                8KB blocks:                174MB/sec

So upgrading to MD5 is almost for free, and SHA1 won't slow things down
noticeably either. With randomized IV, both (but especially SHA1) should
be OK in this context.

BTW, a strong full-file hash would have its own computational cost: both
 the sender and the receiver would need to compute it in addition to the
per-block hashes. In principle, for mostly-unchanged files this cost can
be made negligible by replacing the full-file hash by a "meta-hash": as
you transmit or receive, note the hash of the data represented by each
packet, and compute the meta-hash as the hash of all these hashes. For a
literal data packets, the packet's hash is just the hash of its literal
content. But for a copy-block-from-old-file packet, the sender takes the
packet hash to be the (untruncated) hash of the *new* block, while the
receiver takes the packet hash to be the (untruncated) hash of the
copied *old* block -- both of these were computed anyway by the
respective parties! One catch is that, in the librsync framework, during
the delta stage the receiver no longer remembers the untruncated block
hashes computed at the delta stage, and if he needs to recompute them
then he doesn't get any benefit compared to plain full-file hashes. But
the sender (which is already doing the heavy lifting) does get the
meta-hash essentially for free.

The computation of these meta-hashes can be done only inside librsync,
so even if you prefer to initially implement full-file hashes in the
straightforward way, keeping the option open seems reason enough to do
the integrity checking inside librsync.

> I'm now working at google, and am seriously considering making librsync
> my 20% project... hopefully this means I will do some serious work on it
> soon.

Ah, congratulations! Lucky Google!

  Eran

[Prev in Thread]

Current Thread

[Next in Thread]

[rdiff-backup-users] Re: more info on 25gig files, rsync2eran, 2005/07/01
- [rdiff-backup-users] Re: more info on 25gig files, Donovan Baarda, 2005/07/01
  - [rdiff-backup-users] Re: more info on 25gig files, rsync2eran, 2005/07/01
    - [rdiff-backup-users] Re: more info on 25gig files, Donovan Baarda, 2005/07/01
    - [rdiff-backup-users] Re: more info on 25gig files, rsync2eran <=
    - [rdiff-backup-users] Re: more info on 25gig files, Donovan Baarda, 2005/07/01
    - [rdiff-backup-users] Re: more info on 25gig files, rsync2eran, 2005/07/20
    - [rdiff-backup-users] Re: more info on 25gig files, Donovan Baarda, 2005/07/06
    - [rdiff-backup-users] Re: more info on 25gig files, rsync2eran, 2005/07/20
    - [rdiff-backup-users] Re: more info on 25gig files, rsync2eran, 2005/07/20
    - Re: [rdiff-backup-users] Re: more info on 25gig files, David Kempe, 2005/07/01

Prev by Date: [rdiff-backup-users] Re: more info on 25gig files
Next by Date: [rdiff-backup-users] Re: more info on 25gig files
Previous by thread: [rdiff-backup-users] Re: more info on 25gig files
Next by thread: [rdiff-backup-users] Re: more info on 25gig files
Index(es):
- Date
- Thread