Where to discuss replacement of Rmath?

dmbates · March 12, 2018, 7:54pm

A long-standing objective related to the Distributions and StatsFuns packages is to provide native Julia versions of various pdf (probability density or mass) and cdf (cumulative distribution) functions. The Rmath package is used to evaluate many of these but it has a couple of disadvantages: its license (GPLv2) and its lack of flexibility with respect to types.

Where is a good place to discuss replacing these functions? I would have suggested creating an issue in StatsFuns but recently there were additions to very old issue in Distributions related to this?

A general question is to what extent can a native Julia implementation be based on the Rmath code without being in violation of the GPL? I realize that a Julia developer reading the Rmath code then implementing a similar function in Julia would be a violation of the license. Can a third-party read the code and provide a textual description of the method?

If there are references to papers they can sometimes be used but some journals have rather bizarre conditions. I believe ACM Trans. Math. Software was the one of the worst for imposing conditions on algorithms that it published.

I would like to see native Julia implementations because of the flexibility and maintainability but I don’t want to taint such an effort by participating if there would be a problem with my being involved in some of the original code in R.

ararslan · March 12, 2018, 8:32pm

Yes indeed, that works perfectly well within the terms of the licenses and I believe is a fairly common way to get around GPL poisoning.

I don’t think that would be a problem so long as the Julia code is not derived from the R code.

jonathanBieler · March 12, 2018, 10:15pm

I’ve also been frustrated with Rmath sometimes. I think the way to go is to create a package with the specifications (list of functions with a short description and references if available, that would be quite similar to Rmath.jl) and tests, then we can start fill in the functions. Once everything works it could be swapped in Distributions.

Maybe have of a more Julian naming convention would also be nice (the original names could be kept for compatibility).

Perhaps there’s some other implementations (in Python or others) that are less problematic licence-wise for reference ?

Nosferican · March 13, 2018, 2:38am

If I am not mistaken RMath is written in C, no? One of the biggest challenges in R (and Python) was that the implementations were obscure (e.g., in C). I would be quite happy with a Julian implementation that sacrifices some efficiency for transparency and accessibility (one of the biggest gains from having it in Julia) at least in its beginnings. I am familiar with the API as an R user and of Distributions.jl, so if there is some organized structure I would be happy to take on some of the work.

dpsanders · March 13, 2018, 4:39am

A good option for doing tricky integrals may be TaylorIntegration.jl

cc @PerezHz, @lbenet

stevengj · March 13, 2018, 1:44pm

I seriously doubt it.

Generic quadrature schemes are usually at least 1000x slower than specialized implementations for special functions (using standard techniques like continued fraction expansions, taylor/asymptotic series, and rational approximants), in my experience.

TaylorIntegration can also compute high-order Taylor series approximations, but these are usually employed only in a small region of the parameter space (usually near zeros). And the Taylor series around zeros of special functions are usually only a Google search away, anyway.

ApproxFun can compute Chebyshev approximants (but not minimax or rational approximants last I checked), which can be useful on intervals of the domain of single-variable functions, but usually this is at best a small part of the tricks used in a good special-function implementation, and is not particularly useful for two (or more) variable functions (you can use a product of series, but there are usually much better methods for multivariate special functions).

Implementing special functions well is a matter of finding good approximations for the functions in various portions of their domain, and figuring out the cutoff points at which to stitch together different approximations, and figuring out the corner cases where you need special care. I don’t think it’s been successfully automated except for approximating single-variable functions on finite intervals.

Fortunately, this kind of mathematical knowledge is not copyrightable, so a clean-room re-implementation of libRmath should be possible. If someone just goes through the source code and pulls out a description of which approximants are used in which portions of the domain for each function, along with any mathematical references that are cited, that solves the hardest part of the problem.

dmbates · March 13, 2018, 3:40pm

AFAIK the uses of libRmath functions in the Distributions package go through the StatsFuns package. Instead of calling, say, the libRmath function dpois directly one would call poispdf or poislogpdf from StatsFuns. The replacement should take place in the StatsFuns package. For example, the libRmath functions for the Normal distribution are no longer used. The Julia functions for the Normal distribution are in StatsFuns/src/distrs/norm.jl.

I believe that the intention is to include more native Julia code in StatsFuns/src/distrs/*.jl to the point where libRmath is no longer needed. I’m not sure if the long term plan is to keep the Julia code in this package or to migrate it to the Distributions package. The current recommendation is not to use the distribution-related functions in this package directly. That is, instead of

julia> StatsFuns.norminvcdf(0.975)
1.9599639845400583

use

julia> Distributions.quantile(Normal(), 0.975)
1.9599639845400583

dmbates · March 13, 2018, 3:49pm

My vision of how things would work is that there be one or more issues in the StatsFuns package related to the native Julia replacements. (I hope @simonbyrne or @andreasnoack or @ararslan will weigh in on this)

I completely agree with @stevengj that the way to implement many of these functions is to have someone read the libRmath or SciPy or … code and describe the approach so that someone else can implement it. I can take on the first role for some cases but i don’t think I am a good choice for the second role.

simonbyrne · March 13, 2018, 6:16pm

I’m glad this is getting some attention! This issue has been around a long time.

While we can’t use the GPL’d Rmath code, most implementations (including Rmath itself) descend from the “NSWC Mathematics Subroutine Library”. Unfortunately, its licence status is a little unclear, but I think we ended up coming to the conclusion that it was a work of a US Government employee, and thus in the public domain. The other issue is that it was written for a very old architectures (pre-IEEE754 floating point), so the code is a bit odd. @andreasnoack has translated a large chunk of it, but it probably needs some more review, cleanup and checking.

Some other useful resources:

The documentation from the NSWC code (warning: 20MB pdf)
The R help includes references in its documents. There has been some subsequent publications since, so it is worth chasing down forward references.
Caltech/JPL have released an updated version of the NSWC code in C under a more standard licence.
The Boost C++ library seems fairly solid, and uses some different techniques (notable Lanczos approximations for the gamma functions). It also has fairly good documentation.

Nosferican · March 13, 2018, 9:23pm

I filed issue StatsFuns#37 to keep track based on @dmbates suggested course of action.

Nosferican · March 13, 2018, 10:10pm

What about porting from the Boost C++ Libraries? Their license is compatible with MIT’s.

dmbates · March 14, 2018, 5:45pm

I’m not very enthused about the Boost implementation. I know how libRmath was created and there was a great deal of attention paid to corner cases, accuracy even for absurd parameter values, etc. (Most of the work was done by Martin Maechler who is a German-speaking Swiss. I figure that if you looking for a person who will be meticulous and precise that’s a good choice.) I don’t see the same level of care in the Boost code.

For example, it helps to evaluate probabilities on the log-probability scale then convert back to the probability scale if desired. The Boost code just evaluates pdf and cdf and quantile.

One of the first cases I check is Poisson because the formulas are relatively simple but the corner cases abound. The Boost code for the pdf includes

      // Special cases:
      if (mean == 0)
      { // Probability for any k is zero.
        return 0;
      }
      if (k == 0)
      { // return pdf(dist, static_cast<RealType>(0));
        // but mean (and k) have already been checked,
        // so this avoids unnecessary repeated checks.
       return exp(-mean);
      }

which is wrong. You should reverse those checks because a Poisson with λ = 0 is a point mass at 0. The pdf should be 1. at 0 and zero elsewhere. This code returns 0. at 0.

Ralph_Smith · March 15, 2018, 1:52pm

Another relevant project is the Stan math library (BSD license). It includes extensive support for computing derivatives, FWIW. They implement their own log-densities etc., but do use Boost for some special functions (which appear more solid than Boost stats).

By the way, has Maechler addressed the notable list of problems that the Boost team found with Rmath implementations?

dmbates · March 15, 2018, 4:17pm

I’m not sure where I should look in the the Stan math library. It seems to me that the functions related to distributions are all from Boost but I didn’t look in every directory. Are there some functions written by the Stan authors that I should check?

Ralph_Smith · March 16, 2018, 3:54am

Yes, they hid them pretty well. The basics are here: https://github.com/stan-dev/math/tree/develop/stan/math/prim/scal/prob (warning: templated C++ written with efficiency as a primary goal - maybe second to accuracy?).

They have some notes on their extensive test suite which start out by admitting a lack of documentation. I can’t easily tell whether the kinds of edge cases you mentioned are checked. (The tests I looked at had precisely zero comments about what they verify.)

This brings up an important point: Many of us are ready, willing, and able to code up algorithms, but real domain expertise is needed to make sure that the test suite is adequate. Are the Rmath tests suitable for description as a guide? (I won’t look at R source so that I may help on the Julia side.)

dmbates · March 16, 2018, 4:56pm

Thanks for the pointer on the Stan sources.

As an aside, there is nothing like reading heavily templated C++,code such as in Stan or Boost, to make you appreciate the design of Julia! I am so glad to be able to use Julia and no longer feel the need to write that sort of gibberish for performance.

The sources for the tests in R are in https://github.com/wch/r-source/blob/trunk/tests/d-p-q-r-tests.R However, I don’t know the copyright status of that file.

In a comment in https://github.com/JuliaStats/StatsFuns.jl/issues/37 I suggested that there be two types of tests for the density, cumulative, etc. functions. The tests on coverage, accuracy etc. are, and should continue to be, in the Distributions package. I think it might be worthwhile having a separate package to check consistency with R or SciPy or Boost. For R or SciPy it could use Rcall and PyCall or ccalls to functions in libRmath.

stevengj · March 16, 2018, 5:01pm

Technically, it should be possible to have GPLed test code for an MIT-licensed package. The test code depends on the module, not the other way around, so other packages using the module would not be affected by the license of the test code.

oschulz · March 25, 2018, 7:16pm

A pure Julia replacement will hopefully also enable us to choose the RNG on call to the various functions/distributions, instead of being tied to the internal RNG of a third-party library. This is especially relevant for parallel and/or distributed applications that need to control RNG seeding/distribution.

Statsman · February 13, 2022, 8:02pm

I realize that this is a 4-year old thread, but I am new, and have not found a more recent discussion of these issues.

While Boost may have boundary issues, it seems to be generally more accurate than R. AFAIK, the best double precision probability calculations (not previously mentioned in this thread) is open source VBA available at https://iandjmsmith.wordpress.com/ In 2019, Smith and Boost gave generally comparable accuracy, that was usually better than R, except where all three gave essentially machine accuracy (JSM 2019 Online Program).

dmbates · February 14, 2022, 2:33pm

I would be perfectly happy discussing it on this discourse thread for a while. It could also become a discussion item on the github site of whatever package is felt to be most appropriate for coding up such a replacement. Perhaps StatsFuns.jl? I am not up-to-date on exactly what the plans are for repositories owned by the JuliaStats group so others may have better suggestions.

I took a look at the VBA code by Ian Smith that you linked and I didn’t see a license. Do you know if it is covered by a license? That’s the issue with the Rmath library - the C code is licensed under LGPL so we can’t base a package covered by an MIT license on that code. If Ian Smith doesn’t put a license on his code I don’t think we can base an MIT-licensed Julia package on his code either.

Topic		Replies	Views
Random variables in Julia (working list) Statistics distributions	36	6107	November 27, 2022
Doubts about language General Usage	26	5342	August 23, 2017
Choosing a numerical programming language for economic research: Julia, Community blog , blog-post	135	6311	May 30, 2023
Questions about contributing to Distributions.jl Statistics question	41	2712	June 19, 2020
Replace Rmath RNGs Statistics distributions	1	213	March 17, 2023

Where to discuss replacement of Rmath?

Related topics