[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#42162: Recovering source tarballs
From: |
Ludovic Courtès |
Subject: |
bug#42162: Recovering source tarballs |
Date: |
Mon, 20 Jul 2020 10:39:06 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) |
Hi!
There are many many comments in your message, so I took the liberty to
reply only to the essence of it. :-)
zimoun <zimon.toutoune@gmail.com> skribis:
> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs. There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch". Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-) I mean
> the 70% could be a bit mitigated.
The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.
However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.
>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time. Cuirass
>> jobset? Mcron job to preserve GC roots? Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help. At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS? Feasible? Lookup issue?
Lookup issue. :-) The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root. So it would seem we can’t
use IPFS as-is for tarballs.
>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs. But that raises two questions:
>>
>> • If we no longer deal with tarballs but upstreams keep signing
>> tarballs (not raw directory hashes), how can we authenticate our
>> code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?
Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).
>> • SWH internally store Git-tree hashes, not nar hashes, so we still
>> wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>> tarball = metadata + tree
>
> There is different issues at different levels:
>
> 1. how to lookup? what information do we need to keep/store to be able
> to query SWH?
> 2. how to check the integrity? what information do we need to
> keep/store to be able to verify that SWH returns what Guix expects?
> 3. how to authenticate? where the tarball metadata has to be stored if
> SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
> - upstream url
> - commit / tag
> - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?
Yes.
> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.
But today, we store tarball hashes, not directory hashes.
> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
> https://pypi.org/project/swh.model/
> https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.
I’m skeptical about adding a field that is practically never used.
[...]
>> The code below can “disassemble” and “assemble” a tar. When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?
We’d have a repo/database containing metadata indexed by tarball sha256.
> How this database that maps tarball hashes to metadata should be
> maintained? Git push hook? Cron task?
Yes, something like that. :-)
> What about foreign channels? Should they maintain their own map?
Yes, presumably.
> To summary, it would work like this, right?
>
> at package time:
> - store an integrity identiter (today sha256-nix-base32)
> - disassemble the tarball
> - commit to another repo the metadata using the path (address)
> sha256/base32/<identitier>
> - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
> - use the integrity identifier to query the database repo
> - lookup the SWHID from the database repo
> - fetch the data from SWH
> - or lookup the IPFS identifier from the database repo and fetch the
> data from IPFS, for another example
> - re-assemble the tarball using the metadata from the database repo
> - check integrity, authentication, etc.
That’s the idea.
> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.
Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).
Thanks,
Ludo’.
- bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, Ludovic Courtès, 2020/07/02
- bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, zimoun, 2020/07/02
- bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, Ludovic Courtès, 2020/07/02
- bug#42162: Recovering source tarballs, zimoun, 2020/07/15
- bug#42162: Recovering source tarballs,
Ludovic Courtès <=
- bug#42162: Recovering source tarballs, zimoun, 2020/07/20
- bug#42162: Recovering source tarballs, Dr. Arne Babenhauserheide, 2020/07/20
- bug#42162: Recovering source tarballs, zimoun, 2020/07/20
- bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/21
- bug#42162: Recovering source tarballs, zimoun, 2020/07/21
- bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/22
- bug#42162: Recovering source tarballs, Timothy Sample, 2020/07/30
- bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/31