bug#42162: Recovering source tarballs

bug-guix

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42162: Recovering source tarballs

From:	Ludovic Courtès
Subject:	bug#42162: Recovering source tarballs
Date:	Mon, 20 Jul 2020 10:39:06 +0200
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)

Hi!

There are many many comments in your message, so I took the liberty to
reply only to the essence of it.  :-)

zimoun <zimon.toutoune@gmail.com> skribis:

> On Sat, 11 Jul 2020 at 17:50, Ludovic Courtès <ludo@gnu.org> wrote:
>
>> For the now, since 70% of our packages use ‘url-fetch’, we need to be
>> able to fetch or to reconstruct tarballs.  There’s no way around it.
>
> Yes, but for example all the packages in gnu/packages/bioconductor.scm
> could be "git-fetch".  Today the source is over url-fetch but it could
> be over git-fetch with https://git.bioconductor.org/packages/flowCore or
> git@git.bioconductor.org:packages/flowCore.
>
> Another example is the packages in gnu/packages/emacs-xyz.scm and the
> ones from elpa.gnu.org are "url-fetch" and could be "git-fetch", for
> example using
> http://git.savannah.gnu.org/gitweb/?p=emacs/elpa.git;a=tree;f=packages/ace-window;h=71d3eb7bd2efceade91846a56b9937812f658bae;hb=HEAD
>
> So I would be more reserved about the "no way around it". :-)  I mean
> the 70% could be a bit mitigated.

The “no way around it” was about the situation today: it’s a fact that
70% of packages are built from tarballs, so we need to be able to fetch
them or reconstruct them.

However, the two examples above are good ideas as to the way forward: we
could start a url-fetch-to-git-fetch migration in these two cases, and
perhaps more.

>> In the short term, we should arrange so that the build farm keeps GC
>> roots on source tarballs for an indefinite amount of time.  Cuirass
>> jobset?  Mcron job to preserve GC roots?  Ideas?
>
> Yes, preserving source tarballs for an indefinite amount of time will
> help.  At least all the packages where "lookup-content" returns #f,
> which means they are not in SWH or they are unreachable -- both is
> equivalent from Guix side.
>
> What about in addition push to IPFS?  Feasible?  Lookup issue?

Lookup issue.  :-)  The hash in a CID is not just a raw blob hash.
Files are typically chunked beforehand, assembled as a Merkle tree, and
the CID is roughly the hash to the tree root.  So it would seem we can’t
use IPFS as-is for tarballs.

>> For the future, we could store nar hashes of unpacked tarballs instead
>> of hashes over tarballs.  But that raises two questions:
>>
>>   • If we no longer deal with tarballs but upstreams keep signing
>>     tarballs (not raw directory hashes), how can we authenticate our
>>     code after the fact?
>
> Does Guix automatically authenticate code using signed tarballs?

Not automatically; packagers are supposed to authenticate code when they
add a package (‘guix refresh -u’ does that automatically).

>>   • SWH internally store Git-tree hashes, not nar hashes, so we still
>>     wouldn’t be able to fetch our unpacked trees from SWH.
>>
>> (Both issues were previously discussed at
>> <https://sympa.inria.fr/sympa/arc/swh-devel/2016-07/>.)
>>
>> So for the medium term, and perhaps for the future, a possible option
>> would be to preserve tarball metadata so we can reconstruct them:
>>
>>   tarball = metadata + tree
>
> There is different issues at different levels:
>
>  1. how to lookup? what information do we need to keep/store to be able
>     to query SWH?
>  2. how to check the integrity? what information do we need to
>     keep/store to be able to verify that SWH returns what Guix expects?
>  3. how to authenticate? where the tarball metadata has to be stored if
>     SWH removes it?
>
> Basically, the git-fetch source stores 3 identifiers:
>
>  - upstream url
>  - commit / tag
>  - integrity (sha256)
>
> Fetching from SWH requires the commit only (lookup-revision) or the
> tag+url (lookup-origin-revision) then from the returned revision, the
> integrity of the downloaded data is checked using the sha256, right?

Yes.

> Therefore, one way to fix lookup of the url-fetch source is to add an
> extra field mimicking the commit role.

But today, we store tarball hashes, not directory hashes.

> The easiest is to store a SWHID or an identifier allowing to deduce the
> SWHID.
>
> I have not checked the code, but something like this:
>
>   https://pypi.org/project/swh.model/
>   https://forge.softwareheritage.org/source/swh-model/
>
> and at package time, this identifier is added, similarly to integrity.

I’m skeptical about adding a field that is practically never used.

[...]

>> The code below can “disassemble” and “assemble” a tar.  When it
>> disassembles it, it generates metadata like this:
>
> [...]
>
>> The ’assemble-archive’ procedure consumes that, looks up file contents
>> by hash on SWH, and reconstructs the original tarball…
>
> Where do you plan to store the "disassembled" metadata?
> And where do you plan to "assemble-archive"?

We’d have a repo/database containing metadata indexed by tarball sha256.

> How this database that maps tarball hashes to metadata should be
> maintained?  Git push hook?  Cron task?

Yes, something like that.  :-)

> What about foreign channels?  Should they maintain their own map?

Yes, presumably.

> To summary, it would work like this, right?
>
> at package time:
>  - store an integrity identiter (today sha256-nix-base32)
>  - disassemble the tarball
>  - commit to another repo the metadata using the path (address)
>    sha256/base32/<identitier>
>  - push to packages-repo *and* metadata-database-repo
>
> at future time: (upstream has disappeared, say!)
>  - use the integrity identifier to query the database repo
>  - lookup the SWHID from the database repo
>  - fetch the data from SWH
>  - or lookup the IPFS identifier from the database repo and fetch the
>    data from IPFS, for another example
>  - re-assemble the tarball using the metadata from the database repo
>  - check integrity, authentication, etc.

That’s the idea.

> The format of metadata (disassemble) that you propose is schemish
> (obviously! :-)) but we could propose something more JSON-like.

Sure, if that helps get other people on-board, why not (though sexps
have lived much longer than JSON and XML together :-)).

Thanks,
Ludo’.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, Ludovic Courtès, 2020/07/02
- bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, zimoun, 2020/07/02
  - bug#42162: gforge.inria.fr to be taken off-line in Dec. 2020, Ludovic Courtès, 2020/07/02
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/11
    - bug#42162: Recovering source tarballs, Christopher Baines, 2020/07/13
    - bug#42162: Recovering source tarballs, zimoun, 2020/07/20
    - bug#42162: Recovering source tarballs, zimoun, 2020/07/15
    - bug#42162: Recovering source tarballs, Ludovic Courtès <=
    - bug#42162: Recovering source tarballs, zimoun, 2020/07/20
    - bug#42162: Recovering source tarballs, Dr. Arne Babenhauserheide, 2020/07/20
    - bug#42162: Recovering source tarballs, zimoun, 2020/07/20
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/21
    - bug#42162: Recovering source tarballs, zimoun, 2020/07/21
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/22
    - bug#42162: Recovering source tarballs, Timothy Sample, 2020/07/30
    - bug#42162: Recovering source tarballs, Ludovic Courtès, 2020/07/31

Prev by Date: bug#42212: [PATCH 1/1] self: Use nearest tag as the version string in documentation.
Next by Date: bug#42385: guile-based jupyter kernels mix guile3 and guile2.2
Previous by thread: bug#42162: Recovering source tarballs
Next by thread: bug#42162: Recovering source tarballs
Index(es):
- Date
- Thread