bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New program: rand(1)


From: Shawn Wagner
Subject: Re: New program: rand(1)
Date: Sun, 21 Aug 2022 12:54:09 -0700

I agree with everything Erik says (And that's speaking as someone who wants to add a major dependency (guile) to datamash)...

I do kind of feel like datamash should focus on working with existing data, not generating new sets, and that this rand might be better suited as a separate project, but not strongly enough to protest very much.

On Sun, Aug 21, 2022 at 2:28 AM Erik Auerswald <auerswal@unix-ag.uni-kl.de> wrote:
Hi Tim,

On 21.08.22 01:16, Tim Rice wrote:
>> Currently implemented are unif (continuous Uniform distribution), exp
>> (Exponential distribution), and norm (Normal distribution). I expect
>> to implement additional distributions in the coming weeks.
>
> Hmm, this may require more thought.
>
> Several of the probability distributions I was thinking of require the
> incomplete beta function to simulate efficiently. I thought this could
> be easily copy-pasted from other free software such as the GNU
> Scientific Library (GSL) or R. Now that I've actually taken a stab at
> it, I feel less confident.
>
> I found the R implementation to be inscrutable. It seems to be inspired
> by an algorithm that was published in the Communications of the ACM in
> the 1960s, without much discussion about which mathematical
> underpinnings it relies on. The algorithm pre-dated the first edition of
> Abromowitz & Stegun by a year.
>
> The GSL implementation is clearer, because it clearly relies on the
> continued-fraction expansion from Abromiwitz & Stegun 26.5. However,
> there is a bit of a rabbit-hole of one function depending on another and
> another ad nauseum, so you can't just copy-paste one file. There are at
> least a thousand lines of code to curate.

I do not like the idea of copying lots of code from an actively
maintained library instead of using the library.  Especially when
the intent is to just use that functionality, as opposed to using
it as a base for modification to implement some different, but
related, functionality.

> I see a few options, which I haven't decided yet, so feedback would be
> welcome:
>
> * Limit the scope of rand(1) to just what is currently implemented? This
> avoids baroquities like the incomplete beta function altogether, at a
> cost to feature completeness. It's easy to do, but unsatisfying.
>
> * Make GNU Datamash depend on GSL? This is also a fairly easy option.
> There are other benefits too: GSL comes with a broad suite of
> functionality, which may be useful in future GNU Datamash development.
> However, it is a fairly drastic change that will require adjustment both
> by packagers and developers. I am conscious of the advice in the GNU
> Coding Standards: "Do not induce new dependencies on other software
> lightly."

I think this would be fine if done as an optional dependency to
add more functionality.  Without GSL, rand(1) could IMHO still
function, but be limited to functionality not implemented with
GSL.

There might be a twist here, in that it seems possible that you
would have used GSL for the existing rand(1) functionality, if
you had intended to use GSL anyway.

With rand(1) being a new addition not yet included in a formal
GNU Datamash release it would seem OK to change course and require
GSL for rand(1), but neither datamash(1) nor decorate(1).  It
could be determined during ./configure if rand(1) can be built
or not, and only rand(1) excluded without GSL.

The addition of rand(1) has already placed some burden on packagers,
since at least Ubuntu already has a "rand" package[1] containing
a rand(1) binary[2].  The Debian packaging system has provisions
to handle this and they probably need to be used for the next
GNU Datamash release.

[1] https://packages.ubuntu.com/kinetic/rand
[2] https://launchpad.net/rand

Adding an optional build and runtime dependency and thus adding
the decision whether to add it to the package or omit some
functionality seems fine to me.

> * Continue the work of integrating copy-pasted code from GSL into GNU
> Datamash? Aside from my immediate exasperation with this effort, there
> is an additional cost that future improvements to the external code
> won't necessarily make their way into our copy. Furthermore, as we
> continue implementing new features for GNU Datamash, we may see more and
> more copy-pasting from GSL going on. The longer we wait before making
> GSL a dependency, the more effort may be required down the track.

I concur that a simple copy & paste approach does not seem to
lead to maintainable code.

> * Implement something from scratch? I am not completely averse to this,
> but it increases duplication of effort between different GNU projects. I
> am also worried that with fewer eyes on GNU Datamash than GSL, I will
> introduce bugs that are not an issue in other implementations.

It seems to me as if doing this just to avoid a dependency on a
widely available library would not be worth it.

I consider the copy & paste approach as something similar to this.
Copying existing code into GNU Datamash would require taking ownership
of this code copy.  As such I would view it as a starting point for
a new implementation, because it is quite likely that it would diverge
over time from GSL anyway.

I too think that this risks introducing bugs that would have been
avoided by using GSL as a library.

> I guess the first and biggest decision is whether to make GSL a
> dependency. Let me know whether you think it would be a good idea or bad
> idea.

I would not mind if GSL were added as an optional dependency.
IMHO datamash(1) and decorate(1) should not require GSL.

HTH,
Erik


reply via email to

[Prev in Thread] Current Thread [Next in Thread]