guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: distributed substitutes: file slicing


From: pukkamustard
Subject: Re: distributed substitutes: file slicing
Date: Sun, 25 Jun 2023 09:48:11 +0000

Csepp <raingloom@riseup.net> writes:

> I have a question / suggestion about the distributed substitutes
> project: would downloads be split into uniformly sized chunks or could
> the sizes vary?

For the proposal that uses ERIS (https://issues.guix.gnu.org/52555) the
chunks are uniformly sized (32KiB).

> Specifically, in an extreme case where an update introduced a single
> extra byte at the beginning of a file, would that result in completely
> new chunks?

Yes, that would be the case.

ERIS uses fixed-block sizes and such extreme cases would result in
completely new chunks - very bad de-duplication.

The reason for using fixed-block sizes is security/privacy. When using
variable sized blocks the sizes are observable by a potential censor and
are also a function of the content itself. This leaks information about
the transferred content.

I believe there are documented cases of HTTPS connections being
blocked/censored based on size of requests [citation needed]. This is
something ERIS tries to prevent.

That being said, I think there is still room for optimizing the
de-duplication even with fixed-size blocks.

> An alternative I've been thinking about is this:
> find the store references in a file and split it along these references,
> optionally apply further chunking to the non-reference blobs.
>
> It's probably best to do this at the NAR level??

I like the idea!

If I understand correctly we would split whenever a store reference
appears. When a single store reference changes (this probably happens
quite often) then only the preceeding block changes.

I think there is also a way to do something similar while preserving
fixed size blocks:

Maintain a lookup table for all store references appearing in a store
item. When serializing this lookup table goes to the front (or back)
with appropriate padding so that it is block aligned. All store
references in the remaining serialization are replaced by a reference to
the lookup table. Now when a store reference changes only the lookup
table changes, the remaining content remains the same and is
de-duplicated.

A similar idea for also allowing de-duplication when individual files
change: https://codeberg.org/eris/eer/src/branch/eris-fs/eer/eris-fs/index.md

Also check out the Guix `wip-digests` branch. There are some related
interesting ideas there.

I'm working on rebasing and updating the decentralized substitute
patches. Sorry for the slowness. They would at first only address
block-wise transfer with a naive encoding that does not do very good
de-duplication. 

As outlined I think de-duplication can be added later and I think it's
great to start thinking about it and experimenting with ideas.

-pukkamustard



reply via email to

[Prev in Thread] Current Thread [Next in Thread]