bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New program: rand(1)


From: Tim Rice
Subject: Re: New program: rand(1)
Date: Sat, 20 Aug 2022 23:16:04 +0000

Currently implemented are unif (continuous Uniform distribution), exp 
(Exponential distribution), and norm (Normal distribution). I expect to 
implement additional distributions in the coming weeks.

Hmm, this may require more thought.

Several of the probability distributions I was thinking of require the 
incomplete beta function to simulate efficiently. I thought this could be 
easily copy-pasted from other free software such as the GNU Scientific Library 
(GSL) or R. Now that I've actually taken a stab at it, I feel less confident.

I found the R implementation to be inscrutable. It seems to be inspired by an 
algorithm that was published in the Communications of the ACM in the 1960s, without 
much discussion about which mathematical underpinnings it relies on. The algorithm 
pre-dated the first edition of Abromowitz & Stegun by a year.

The GSL implementation is clearer, because it clearly relies on the 
continued-fraction expansion from Abromiwitz & Stegun 26.5. However, there is a 
bit of a rabbit-hole of one function depending on another and another ad nauseum, 
so you can't just copy-paste one file. There are at least a thousand lines of code 
to curate.

I see a few options, which I haven't decided yet, so feedback would be welcome:

* Limit the scope of rand(1) to just what is currently implemented? This avoids 
baroquities like the incomplete beta function altogether, at a cost to feature 
completeness. It's easy to do, but unsatisfying.

* Make GNU Datamash depend on GSL? This is also a fairly easy option. There are other 
benefits too: GSL comes with a broad suite of functionality, which may be useful in 
future GNU Datamash development. However, it is a fairly drastic change that will require 
adjustment both by packagers and developers. I am conscious of the advice in the GNU 
Coding Standards: "Do not induce new dependencies on other software lightly."

* Continue the work of integrating copy-pasted code from GSL into GNU Datamash? 
Aside from my immediate exasperation with this effort, there is an additional 
cost that future improvements to the external code won't necessarily make their 
way into our copy. Furthermore, as we continue implementing new features for 
GNU Datamash, we may see more and more copy-pasting from GSL going on. The 
longer we wait before making GSL a dependency, the more effort may be required 
down the track.

* Implement something from scratch? I am not completely averse to this, but it 
increases duplication of effort between different GNU projects. I am also 
worried that with fewer eyes on GNU Datamash than GSL, I will introduce bugs 
that are not an issue in other implementations.

I guess the first and biggest decision is whether to make GSL a dependency. Let 
me know whether you think it would be a good idea or bad idea.

~ Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]