guix-commits
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

branch master updated: website: Add post on software identification.


From: Ludovic Courtès
Subject: branch master updated: website: Add post on software identification.
Date: Mon, 04 Mar 2024 08:05:16 -0500

This is an automated email from the git hooks/post-receive script.

civodul pushed a commit to branch master
in repository guix-artwork.

The following commit(s) were added to refs/heads/master by this push:
     new 7e658b3  website: Add post on software identification.
7e658b3 is described below

commit 7e658b3ff206e033c2a6420808fce595dcfb0a57
Author: Ludovic Courtès <ludo@gnu.org>
AuthorDate: Mon Mar 4 14:04:37 2024 +0100

    website: Add post on software identification.
    
    * website/posts/software-identification.md: New file.
---
 website/posts/software-identification.md | 256 +++++++++++++++++++++++++++++++
 1 file changed, 256 insertions(+)

diff --git a/website/posts/software-identification.md 
b/website/posts/software-identification.md
new file mode 100644
index 0000000..2ccbbb6
--- /dev/null
+++ b/website/posts/software-identification.md
@@ -0,0 +1,256 @@
+title: Identifying software
+author: Ludovic Courtès, Maxim Cournoyer, Jan Nieuwenhuizen, Simon Tournier
+date: 2024-03-04 15:00
+tags: Security
+---
+
+What does it take to “identify software”?  How can we tell what software
+is running on a machine to determine, for example, what security
+vulnerabilities might affect it?
+
+In October 2023, the US Cybersecurity and Infrastructure Security Agency
+(CISA) published a white paper entitled [_Software Identification
+Ecosystem Option
+Analysis_](https://www.cisa.gov/resources-tools/resources/software-identification-ecosystem-option-analysis)
+that looks at existing options to address these questions.  The
+publication was followed by a [request for
+comments](https://www.regulations.gov/document/CISA-2023-0026-0001); our
+[comment](https://git.savannah.gnu.org/cgit/guix/maintenance.git/plain/doc/cisa-2023-0026-0001/cisa-2023-0026-0001.pdf)
+as Guix developers didn’t make it on time to be published, but we’d like
+to share it here.
+
+Software identification for cybersecurity purposes is an crucial topic,
+as the white paper explains in its introduction:
+
+> Effective vulnerability management requires software to be trackable
+> in a way that allows correlation with other information such as known
+> vulnerabilities […]. This correlation is only possible when different
+> cybersecurity professionals know they are talking about the same
+> software.
+
+The [Common Platform Enumeration
+(CPE)](https://en.wikipedia.org/wiki/Common_Platform_Enumeration)
+standard has been designed to fill that role; it is used to identify
+software as part of the well-known [Common Vulnerabilities and Exposures
+(CVE)](https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures)
+process.  But CPE is showing its limits as an _extrinsic identification
+mechanism_: the human-readable identifiers chosen by CPE fail to capture
+the complexity of what “software” is.
+
+We think functional software deployment as implemented by Nix and Guix,
+coupled with the source code identification work carried out by Software
+Heritage, provides a unique perspective on these matters.
+
+# On Software Identification
+
+The *Software Identification Ecosystem Option Analysis* white paper
+released by CISA in October 2023 studies options towards the definition
+of *a software identification ecosystem that can be used across the
+complete, global software space for all key cybersecurity use cases*.
+
+Our experience lies in the design and development of
+[GNU Guix](https://guix.gnu.org), a package manager, software deployment
+tool, and GNU/Linux distribution, which emphasizes three key elements:
+**reproducibility, provenance tracking, and auditability**. We explain
+in the following sections our approach and how it relates to the goal
+stated in the aforementioned white paper.
+
+Guix produces binary artifacts of varying complexity from source code:
+package binaries, application bundles (container images to be consumed
+by Docker and related tools), system installations, system bundles
+(container and virtual machine images).
+
+All these artifacts qualify as “software” and so does source code. Some
+of this “software” comes from well-identified upstream packages,
+sometimes with modifications added downstream by packagers (patches);
+binary artifacts themselves are the byproduct of a build process where
+the package manager uses *other* binary artifacts it previously built
+(compilers, libraries, etc.) along with more source code (the package
+definition) to build them. How can one identify “software” in that
+sense?
+
+Software is dual: it exists in *source* form and in *binary*,
+machine-executable form. The latter is the outcome of a complex
+computational process taking source code and intermediary binaries as
+input.
+
+Our thesis can be summarized as follows:
+
+> **We consider that the requirements for source code identifiers differ
+> from the requirements to identify binary artifacts.**
+>
+> Our view, embodied in GNU Guix, is that:
+>
+> 1.  **Source code** can be identified in an unambiguous and
+>     distributed fashion through *inherent identifiers* such as
+>     cryptographic hashes.
+>
+> 2.  **Binary artifacts**, instead, need to be the byproduct of a
+>     *comprehensive and verifiable build process itself available as
+>     source code*.
+
+In the next sections, to clarify the context of this statement, we show
+how Guix identifies source code, how it defines the *source-to-binary*
+path and ensures its verifiability, and how it provides provenance
+tracking.
+
+# Source Code Identification
+
+Guix includes [package
+definitions](https://guix.gnu.org/manual/en/html_node/Defining-Packages.html)
+for almost 30,000 packages. Each package definition identifies its
+[origin](https://guix.gnu.org/manual/en/html_node/origin-Reference.html)—its
+“main” source code as well as patches. The origin is
+**content-addressed**: it includes a SHA256 cryptographic hash of the
+code (an *inherent identifier*), along with a primary URL to download
+it.
+
+Since source is content-addressed, the URL can be thought of as a hint.
+Indeed, **we connected Guix to the [Software
+Heritage](https://www.softwareheritage.org) source code archive**: when
+source code vanishes from its original URL, Guix falls back to
+downloading it from the archive. This is made possible thanks to the use
+of inherent (or intrinsic) identifiers both by Guix and Software
+Heritage.
+
+More information can be found in this [2019 blog
+post](https://guix.gnu.org/en/blog/2019/connecting-reproducible-deployment-to-a-long-term-source-code-archive/)
+and in the documents of the [Software Hash Identifiers
+(SWHID)](https://www.swhid.org/) working group.
+
+# Reproducible Builds
+
+Guix provides a **verifiable path from source code to binaries** by
+ensuring [reproducible builds](https://reproducible-builds.org). To
+achieve that, Guix builds upon the pioneering research work of Eelco
+Dolstra that led to the design of the [Nix package
+manager](https://nixos.org), with which it shares the same conceptual
+foundation.
+
+Namely, Guix relies on *hermetic builds*: builds are performed in
+isolated environments that contain nothing but explicitly-declared
+dependencies—where a “dependency” can be the output of another build
+process or source code, including build scripts and patches.
+
+An implication is that **builds can be verified independently**. For
+instance, for a given version of Guix, `guix build gcc`
+should produce the exact same binary, bit-for-bit. To facilitate
+independent verification, `guix challenge gcc` compares the
+binary artifacts of the GNU Compiler Collection (GCC) as built and
+published by different parties. Users can also compare to a local build
+with `guix build gcc --check`.
+
+As with Nix, build processes are identified by *derivations*, which are
+low-level, content-addressed build instructions; derivations may refer
+to other derivations and to source code. For instance,
+`/gnu/store/c9fqrmabz5nrm2arqqg4ha8jzmv0kc2f-gcc-11.3.0.drv`
+uniquely identifies the derivation to build a specific variant of
+version 11.3.0 of the GNU Compiler Collection (GCC). Changing the
+package definition—patches being applied, build flags, set of
+dependencies—, or similarly changing one of the packages it depends
+on, leads to a different derivation (more information can be found in
+[Eelco Dolstra's PhD
+thesis](https://edolstra.github.io/pubs/phd-thesis.pdf)).
+
+Derivations form a graph that **captures the entirety of the build
+processes leading to a binary artifact**. In contrast, mere package
+name/version pairs such as `gcc 11.3.0` fail to capture the
+breadth and depth elements that lead to a binary artifact. This is a
+shortcoming of systems such as the **Common Platform Enumeration** (CPE)
+standard: it fails to express whether a vulnerability that applies to
+`gcc 11.3.0` applies to it regardless of how it was built,
+patched, and configured, or whether certain conditions are required.
+
+# Full-Source Bootstrap
+
+Reproducible builds alone cannot ensure the source-to-binary
+correspondence: the compiler could contain a backdoor, as demonstrated
+by Ken Thompson in *Reflections on Trusting Trust*. To address that,
+Guix goes further by implementing so-called **full-source bootstrap**:
+for the first time, literally every package in the distribution is built
+from source code, [starting from a very small binary
+seed](https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-building-from-source-all-the-way-down/).
+This gives an unprecedented level of transparency, allowing code to be
+audited at all levels, and improving robustness against the
+“trusting-trust attack” described by Ken Thompson.
+
+The European Union recognized the importance of this work through an
+[NLnet Privacy & Trust Enhancing Technologies (NGI0 PET)
+grant](https://nlnet.nl/project/GNUMes-fullsource/) allocated in
+2021 to Jan Nieuwenhuizen to further work on full-source bootstrap in
+GNU Guix, GNU Mes, and related projects, followed by [another
+grant](https://nlnet.nl/project/GNUMes-ARM_RISC-V/) in 2022 to expand
+support to the Arm and RISC-V CPU architectures.
+
+# Provenance Tracking
+
+We define provenance tracking as the ability **to map a binary artifact
+back to its complete corresponding source**. Provenance tracking is
+necessary to allow the recipient of a binary artifact to access the
+corresponding source code and to verify the source/binary correspondence
+if they wish to do so.
+
+The
+[`guix pack`](https://guix.gnu.org/manual/en/html_node/Invoking-guix-pack.html)
+command can be used to build, for instance, containers images. Running
+`guix pack -f docker python --save-provenance` produces a
+*self-describing Docker image* containing the binaries of Python and its
+run-time dependencies. The image is self-describing because
+`--save-provenance` flag leads to the inclusion of a
+*manifest* that describes which revision of Guix was used to produce
+this binary. A third party can retrieve this revision of Guix and from
+there view the entire build dependency graph of Python, view its source
+code and any patches that were applied, and recursively for its
+dependencies.
+
+To summarize, capturing the revision of Guix that was used is all it
+takes to *reproduce* a specific binary artifact. This is illustrated by
+[the `time-machine`
+command](https://guix.gnu.org/manual/en/html_node/Invoking-guix-time_002dmachine.html).
+The example below deploys, *at any time on any machine*, the specific
+build artifact of the `python` package as it was defined in this Guix
+commit:
+
+``` example
+guix time-machine -q --commit=d3c3922a8f5d50855165941e19a204d32469006f \
+  -- install python
+```
+
+In other words, because Guix itself defines how artifacts are built,
+**the revision of the Guix source coupled with the package name
+unambiguously identify the package’s binary artifact**. As
+scientists, we build on this property to achieve reproducible research
+workflows, as explained in this [2022 article in *Nature Scientific
+Data*](https://doi.org/10.1038/s41597-022-01720-9); as engineers, we
+value this property to analyze the systems we are running and determine
+which known vulnerabilities and bugs apply.
+
+Again, a software bill of materials (SBOM) written as a mere list of
+package name/version pairs would fail to capture as much information.
+The **Artifact Dependency Graph (ADG) of
+[OmniBOR](https://omnibor.io/)**, while less ambiguous, falls short in
+two ways: it is too fine-grained for typical cybersecurity applications
+(at the level of individual source files), and it only captures the
+alleged source/binary correspondence of individual files but not the
+process to go from source to binary.
+
+# Conclusions
+
+Inherent identifiers lend themselves well to unambiguous source code
+identification, as demonstrated by Software Heritage, Guix, and Nix.
+
+However, we believe binary artifacts should instead be treated as the
+result of a computational process; it is that process that needs to be
+fully captured to support **independent verification of the
+source/binary correspondence**. For cybersecurity purposes, recipients
+of a binary artifact must be able to be map it back to its source code
+(*provenance tracking*), with the additional guarantee that they must be
+able to reproduce the entire build process to verify the source/binary
+correspondence (*reproducible builds and full-source bootstrap*). As
+long as binary artifacts result from a reproducible build process,
+itself described as source code, **identifying binary artifacts boils
+down to identifying the source code of their build process**.
+
+These ideas are developed in the 2022 scientific paper [*Building a
+Secure Software Supply Chain with
+GNU Guix*](https://doi.org/10.22152/programming-journal.org/2023/7/1)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]