Where to discuss replacement of Rmath?


#1

A long-standing objective related to the Distributions and StatsFuns packages is to provide native Julia versions of various pdf (probability density or mass) and cdf (cumulative distribution) functions. The Rmath package is used to evaluate many of these but it has a couple of disadvantages: its license (GPLv2) and its lack of flexibility with respect to types.

Where is a good place to discuss replacing these functions? I would have suggested creating an issue in StatsFuns but recently there were additions to very old issue in Distributions related to this?

A general question is to what extent can a native Julia implementation be based on the Rmath code without being in violation of the GPL? I realize that a Julia developer reading the Rmath code then implementing a similar function in Julia would be a violation of the license. Can a third-party read the code and provide a textual description of the method?

If there are references to papers they can sometimes be used but some journals have rather bizarre conditions. I believe ACM Trans. Math. Software was the one of the worst for imposing conditions on algorithms that it published.

I would like to see native Julia implementations because of the flexibility and maintainability but I don’t want to taint such an effort by participating if there would be a problem with my being involved in some of the original code in R.


#2

Yes indeed, that works perfectly well within the terms of the licenses and I believe is a fairly common way to get around GPL poisoning.

I don’t think that would be a problem so long as the Julia code is not derived from the R code.


#3

I’ve also been frustrated with Rmath sometimes. I think the way to go is to create a package with the specifications (list of functions with a short description and references if available, that would be quite similar to Rmath.jl) and tests, then we can start fill in the functions. Once everything works it could be swapped in Distributions.

Maybe have of a more Julian naming convention would also be nice (the original names could be kept for compatibility).

Perhaps there’s some other implementations (in Python or others) that are less problematic licence-wise for reference ?


#4

If I am not mistaken RMath is written in C, no? One of the biggest challenges in R (and Python) was that the implementations were obscure (e.g., in C). I would be quite happy with a Julian implementation that sacrifices some efficiency for transparency and accessibility (one of the biggest gains from having it in Julia) at least in its beginnings. I am familiar with the API as an R user and of Distributions.jl, so if there is some organized structure I would be happy to take on some of the work.


#5

A good option for doing tricky integrals may be TaylorIntegration.jl

cc @PerezHz, @lbenet


#6

I seriously doubt it.

Generic quadrature schemes are usually at least 1000x slower than specialized implementations for special functions (using standard techniques like continued fraction expansions, taylor/asymptotic series, and rational approximants), in my experience.

TaylorIntegration can also compute high-order Taylor series approximations, but these are usually employed only in a small region of the parameter space (usually near zeros). And the Taylor series around zeros of special functions are usually only a Google search away, anyway.

ApproxFun can compute Chebyshev approximants (but not minimax or rational approximants last I checked), which can be useful on intervals of the domain of single-variable functions, but usually this is at best a small part of the tricks used in a good special-function implementation, and is not particularly useful for two (or more) variable functions (you can use a product of series, but there are usually much better methods for multivariate special functions).

Implementing special functions well is a matter of finding good approximations for the functions in various portions of their domain, and figuring out the cutoff points at which to stitch together different approximations, and figuring out the corner cases where you need special care. I don’t think it’s been successfully automated except for approximating single-variable functions on finite intervals.

Fortunately, this kind of mathematical knowledge is not copyrightable, so a clean-room re-implementation of libRmath should be possible. If someone just goes through the source code and pulls out a description of which approximants are used in which portions of the domain for each function, along with any mathematical references that are cited, that solves the hardest part of the problem.


#7

AFAIK the uses of libRmath functions in the Distributions package go through the StatsFuns package. Instead of calling, say, the libRmath function dpois directly one would call poispdf or poislogpdf from StatsFuns. The replacement should take place in the StatsFuns package. For example, the libRmath functions for the Normal distribution are no longer used. The Julia functions for the Normal distribution are in StatsFuns/src/distrs/norm.jl.

I believe that the intention is to include more native Julia code in StatsFuns/src/distrs/*.jl to the point where libRmath is no longer needed. I’m not sure if the long term plan is to keep the Julia code in this package or to migrate it to the Distributions package. The current recommendation is not to use the distribution-related functions in this package directly. That is, instead of

julia> StatsFuns.norminvcdf(0.975)
1.9599639845400583

use

julia> Distributions.quantile(Normal(), 0.975)
1.9599639845400583

#8

My vision of how things would work is that there be one or more issues in the StatsFuns package related to the native Julia replacements. (I hope @simonbyrne or @andreasnoack or @ararslan will weigh in on this)

I completely agree with @stevengj that the way to implement many of these functions is to have someone read the libRmath or SciPy or … code and describe the approach so that someone else can implement it. I can take on the first role for some cases but i don’t think I am a good choice for the second role.


#10

I’m glad this is getting some attention! This issue has been around a long time.

While we can’t use the GPL’d Rmath code, most implementations (including Rmath itself) descend from the “NSWC Mathematics Subroutine Library”. Unfortunately, its licence status is a little unclear, but I think we ended up coming to the conclusion that it was a work of a US Government employee, and thus in the public domain. The other issue is that it was written for a very old architectures (pre-IEEE754 floating point), so the code is a bit odd. @andreasnoack has translated a large chunk of it, but it probably needs some more review, cleanup and checking.

Some other useful resources:


#11

I filed issue StatsFuns#37 to keep track based on @dmbates suggested course of action.


#12

What about porting from the Boost C++ Libraries? Their license is compatible with MIT’s.


#13

I’m not very enthused about the Boost implementation. I know how libRmath was created and there was a great deal of attention paid to corner cases, accuracy even for absurd parameter values, etc. (Most of the work was done by Martin Maechler who is a German-speaking Swiss. I figure that if you looking for a person who will be meticulous and precise that’s a good choice.) I don’t see the same level of care in the Boost code.

For example, it helps to evaluate probabilities on the log-probability scale then convert back to the probability scale if desired. The Boost code just evaluates pdf and cdf and quantile.

One of the first cases I check is Poisson because the formulas are relatively simple but the corner cases abound. The Boost code for the pdf includes

      // Special cases:
      if (mean == 0)
      { // Probability for any k is zero.
        return 0;
      }
      if (k == 0)
      { // return pdf(dist, static_cast<RealType>(0));
        // but mean (and k) have already been checked,
        // so this avoids unnecessary repeated checks.
       return exp(-mean);
      }

which is wrong. You should reverse those checks because a Poisson with λ = 0 is a point mass at 0. The pdf should be 1. at 0 and zero elsewhere. This code returns 0. at 0.


#14

Another relevant project is the Stan math library (BSD license). It includes extensive support for computing derivatives, FWIW. They implement their own log-densities etc., but do use Boost for some special functions (which appear more solid than Boost stats).

By the way, has Maechler addressed the notable list of problems that the Boost team found with Rmath implementations?


#15

I’m not sure where I should look in the the Stan math library. It seems to me that the functions related to distributions are all from Boost but I didn’t look in every directory. Are there some functions written by the Stan authors that I should check?


#16

Yes, they hid them pretty well. The basics are here: https://github.com/stan-dev/math/tree/develop/stan/math/prim/scal/prob (warning: templated C++ written with efficiency as a primary goal - maybe second to accuracy?).

They have some notes on their extensive test suite which start out by admitting a lack of documentation. I can’t easily tell whether the kinds of edge cases you mentioned are checked. (The tests I looked at had precisely zero comments about what they verify.)

This brings up an important point: Many of us are ready, willing, and able to code up algorithms, but real domain expertise is needed to make sure that the test suite is adequate. Are the Rmath tests suitable for description as a guide? (I won’t look at R source so that I may help on the Julia side.)


#17

Thanks for the pointer on the Stan sources.

As an aside, there is nothing like reading heavily templated C++,code such as in Stan or Boost, to make you appreciate the design of Julia! I am so glad to be able to use Julia and no longer feel the need to write that sort of gibberish for performance.

The sources for the tests in R are in https://github.com/wch/r-source/blob/trunk/tests/d-p-q-r-tests.R However, I don’t know the copyright status of that file.

In a comment in https://github.com/JuliaStats/StatsFuns.jl/issues/37 I suggested that there be two types of tests for the density, cumulative, etc. functions. The tests on coverage, accuracy etc. are, and should continue to be, in the Distributions package. I think it might be worthwhile having a separate package to check consistency with R or SciPy or Boost. For R or SciPy it could use Rcall and PyCall or ccalls to functions in libRmath.


#18

Technically, it should be possible to have GPLed test code for an MIT-licensed package. The test code depends on the module, not the other way around, so other packages using the module would not be affected by the license of the test code.


#19

A pure Julia replacement will hopefully also enable us to choose the RNG on call to the various functions/distributions, instead of being tied to the internal RNG of a third-party library. This is especially relevant for parallel and/or distributed applications that need to control RNG seeding/distribution.