gwl-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gwl-devel] [GWL] (random) next steps?


From: Ricardo Wurmus
Subject: Re: [gwl-devel] [GWL] (random) next steps?
Date: Wed, 16 Jan 2019 23:08:34 +0100
User-agent: mu4e 1.0; emacs 26.1

Hi simon,

[- address@hidden

I wrote:

> We can connect a graph by joining the inputs of one process with the
> outputs of another.
>
> With a content addressed store we would run processes in isolation and
> map the declared data inputs into the environment.  Instead of working
> on the global namespace of the shared file system we can learn from Guix
> and strictly control the execution environment.  After a process has run
> to completion, only files that were declared as outputs end up in the
> content addressed store.
>
> A process could declare outputs like this:
>
>     (define the-process
>       (process
>         (name 'foo)
>         (outputs
>          '((result "path/to/result.bam")
>            (meta   "path/to/meta.xml")))))
>
> Other processes can then access these files with:
>
>     (output the-process 'result)
>
> i.e. the file corresponding to the declared output “result” of the
> process named by the variable “the-process”.

You wrote:

> From my point of view, there is 2 different paths:
>  1- the inputs-outputs are attached to the process/rule/unit
>  2- the processes/rules/units are a pure function and then the
> `workflow' describes how to glue them together.
[…]
> On one hand, from the path 1-, it is hard to reuse the process/rule
> because the composition is hard-coded in the inputs-outputs
> (duplication of the same process/rule with different inputs-outputs).
> The graph is written by the user when it writes the inputs-outputs
> chain.
> On the other hand, from the path 2-, it is difficult to provide both
> the inputs-outputs to the function and also the graph without
> duplicate some code.

I agree with this assessment.

I would like to note, though, that at least the declaration of outputs
works in both systems.  Only when an exact input is tightly attached to
a process/rule do we limit ourselves to the first path where composition
is inflexible.

> Last, is it useful to write on disk the intermediate files if they are
> not stored?
> In the tread [0], we discussed the possibility to stream the pipes.
> Let say, the simple case:
>    filter input > filtered
>    quality filtered > output
> and the piped version is better is you do not mind about the filtered file:
>    filter input | quality > ouput
>
> However, the classic pipe does not fit for this case:
>    filter input_R1 > R1_filtered
>    filter input_R2 > R2_filtered
>    align R1_filtered R2_filtered > output_aligned
> In general, one is not interested to conserve the files
> R{1,2}_filtered. So why spend time to write them on disk and to hash
> them.
>
> In other words, is it doable to stream the `processes' at the process
> level?
[…]
> [0] http://lists.gnu.org/archive/html/guix-devel/2018-07/msg00231.html

For this to work at all inputs and outputs must be declared.  This
wasn’t mentioned before, but it could of course be done in the workflow
declaration rather than the individual process descriptions.

But even then it isn’t clear to me how to do this in a general fashion.
It may work fine for tools that write to I/O streams, but we would
probably need mechanisms to declare this behaviour.  It cannot be
generally inferred, nor can a process automatically change the behaviour
of its procedure to switch between the generation of intermediate files
and output to a stream.

The GWL examples show the use of the “(system "foo > out.file") idiom,
which I don’t like very much.  I’d prefer to use "foo" directly and
declare the output to be a stream.

> Last, could we add a GWL session to the before-FOSDEM days?

The Guix Days are what we make of them, so yes, we can have a GWL
session there :)

--
Ricardo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]