On beauty of Distributions.jl pdf interface

thisismygitrepo · September 8, 2021, 7:58pm

Julia is often touted for its syntax that resembles math notation. I have a kind of objection here.

Getting values of this density function with this syntax is awkward to say the least. Shouldn’t this be simply p(0:0.01:1) ? In fact, this is the case in almost all other languages.

Paulo_Jabardo · September 8, 2021, 8:18pm

This really is a matter of taste. The issue is that you are considering p a density function. I see it as a distribution where different things can be done with it, including, but not limited to, calculating the density function or generating random numbers from this distribution.

Now, if you really want to use as a density function, you could resort to type piracy and define the function yourself:

(p::Distribution)(x) = pdf(p, x)

And now,

julia> p = Beta(2,3)
Beta{Float64}(α=2.0, β=3.0)

julia> p.(x) == pdf.(p, x)
true

But I wouldn’t recommend it. It will work with the simpler distributions.

mbauman · September 8, 2021, 8:41pm

This is a choice a package (Distributions.jl) made for its API; it very well could have chosen to use p(x) for this.

Since that syntax is available for packages to use, I’ve moved this out of #dev and into #usage and I’ve made the title a bit more focused.

Oscar_Smith · September 8, 2021, 9:10pm

No. The other API clearly could work (since you can emulate it with 1 line of code). The question is just what it should mean to call a distribution. IMO, sampling would be the more natural behavior of a function call, and the reason Distributions chooses to make it explicit is that there isn’t a clear right answer.

thisismygitrepo · September 8, 2021, 9:16pm

Thanks for the comment. But there is a way of resolving this, right? We aim at making this look like math notiation. In statistics, p, a distribution, is nothing but a function. p(0) is unanimously agreed on to represent density at x=0. Sampling would be represented by something like x ~ p. Back to the point, pdf(p, 2) is not remotely close to what we write in math notation.

jzr · September 8, 2021, 9:28pm

There are various notations in statistics. For example, f_X(5) is pretty similar to pdf(x, 5). It could be better if we had a nicer way to make partial functions like pdf(x::Distribution). That is a Julia syntax issue.

jlperla · September 8, 2021, 9:34pm

So why wouldn’t p() be the proper interface for sampling? Or maybe p(0.5) should be the logpdf instead (as is often used by bayesians)? A distribution is a mathematical object which you can “do stuff” to. And in Julia, “stuff” usually means functions with multiple-dispatch.

Personally, I think the p(x) etc. should only be used for function-like use, but there are probably some DSLs where that is worth deviating from.

FWIW, I often write F \sim N(0,1) etc. and then \int g(x) d F(x) so to me the CDF is more natural than the PDF?

Oscar_Smith · September 8, 2021, 9:35pm

not really a Julia syntax issue, all you need to do is define

pdf(d::Distribution) = p(x)->pdf(d, x)
p = pdf(Beta(2,3))
p.(x)

jzr · September 8, 2021, 9:36pm

Fair enough. Though Distributions.jl didn’t do that, and we can’t do it without piracy.

DNF · September 8, 2021, 10:42pm

Which languages are they? Certainly neither Python, R, Matlab or Mathematica, to mention the most comparable.

That’s not what I learned in my statistics courses. There are a number of functions that can describe a probability distribution, such as the pdf or cdf, but they aren’t the distribution itself.

I don’t think I’ve ever heard this before. The density function of a distribution has this property, not the distribution itself.

amrods · September 8, 2021, 10:52pm

Here’s a illustrative table prepared by @Albert_Zevelev :

DNF · September 8, 2021, 10:52pm

If Normal() just returned the pdf function of the standard normal distribution, it would be far less powerful. Calculating the cdf from this function would mean having to numerically integrate the pdf. Even just calculating the mean or other moments would entail numerical integration, because you wouldn’t have access to the distribution parameters, only the function itself.

amrods · September 8, 2021, 11:07pm

I really like that characterization. It’s more elegant, and more flexible.

thisismygitrepo · September 8, 2021, 11:29pm

Thanks for that. I personally see that scipy is the most elegant, least amount of composition and brackets.

ptoche · September 8, 2021, 11:36pm

I don’t find your arguments for making the pdf representation of a distribution the fundamental object particularly compelling. Every probability distribution on a subset of the reals has a cumulative distribution function, which uniquely defines the distribution. By contrast, a probability density function exists only for (absolutely) continuous probability distributions. It’s a matter of taste, but I find the API of Distributions.jl well inspired.

amrods · September 8, 2021, 11:38pm

Suppose you want to switch the distribution in some model. In Julia (and Mathematica) it does not take much. Simply switch out the distribution in say cdf(Normal(4, 10), x) to cdf(TDist(r), x). In other frameworks you have to worry about how many parameters the distribution has, and then order, etc, no?
eg:

const dist = Normal(4, 10)

# compute probability -3 <= x <= 3:
prob = cdf(dist, 3) - cdf(dist, -3)

Simply redefine the distribution at the top and nothing else has to change in the rest of the script.

DNF · September 9, 2021, 7:46am

I think that blog post actually gives scipy a bit of an unfair advantage.

# scipy
norm.cdf(x)

vs

# Distributions.jl
cdf(Normal(0,1),x)

as it assumes default values being used for scipy but not for Distributions.jl. A more fair comparison would be:

scipy:

import scipy.stats as st
st.norm(2, 3).cdf(0.9)
# with default values
st.norm.cdf(0.9)

I could perhaps do

from scipy.stats import norm

But that isn’t normally idiomatic python as far as I know.

Distributions.jl

using Distributions
cdf(Normal(2, 3), 0.9)
# with default values
cdf(Normal(), 0.9)

There’s no less punctuation in the scipy version. Actually, scipy and Distributions.jl are not that different, in that you have a distribution object that you call various functions/methods on. This is very nice compared to the Matlab approach:

>> normcdf(0.9, 2, 3) 
# with default values
>> normcdf(0.9)

It is terse, but awkward, since you need to know the name of the distribution to get its various properties.

A brief comment on the idea that the pdf contains all the information about a distribution:
This may be true in principle (for most distributions, at least), but the information is very hard to retrieve. Let’s do a thought experiment: I give you a variable, called f. I tell you that it holds the pdf of some statistical distribution. How would you go about finding the mean of the distribution described by f? I would contend that, in general, this is impossible, or would take infinite time. If you are lucky, it is centered around zero, and you could start doing some indefinite integration, searching for the outer limits of the pdf. But, what if it is centered around -10^64 instead, and is extremely narrow? And maybe it’s a mixed distribution? And maybe it’s multi-dimensional?

So what do you do?

thisismygitrepo · September 9, 2021, 9:02pm

Thanks for the perspective. This restored my belief in Julia.
Hopefully the accident was not caused by Julia.

thisismygitrepo · September 9, 2021, 9:04pm

Thank you for the detailed response. This has certainly dissuaded my position.

TheLateKronos · September 14, 2021, 6:52am

I just wrote some code last week, because I also felt that the pdf(X, 1) syntax was tedious.
My premise is that the f_X syntax means the pdf of the distribution X. I also assume that I will mostly define my distributions as X. As a final thing, it seemed like 0-based indexing was beneficial, so I constructed the return value with an OffsetArray with default offset = -1:

using OffsetArrays
using Distributions

f_X(; offset = -1) = OffsetArray((pdf(eval(:X))), offset)
f_X(ind::Real;           offset = -1) = OffsetArray(pdf(eval(:X)), offset)[ind]
f_X(inds::AbstractArray; offset = -1) = f_X.(inds; offset)

It is possible that this only makes sense for discrete distributions - my lecture on continuos ones are today

This allows

julia> X = Binomial(1000, 0.9)
Binomial{Float64}(n=1000, p=0.9)
julia> f_X() == pdf(X)  # Acutally f_X(offset=0) == pdf(X)
true

Now if I redefine X and call the function again, it evaluates to whatever X is currently defined as through metaprogramming:

julia> X = DiscreteUniform()
DiscreteUniform(a=0, b=1)
julia> f_X() == pdf(X)  # Acutally f_X(offset=0) == pdf(X)
true

Note that the offsets are set to 0 for comparison - otherwise I prefer 0-based indexing for this technical domain.

Part of the reason for all this is that I am currently more interested in indexing the pdf than evaluating it at a given value. I was expecting the second argument to pdf to be the index, but it is instead the input value. So what I want is

julia> pdf(X)[1:3]
3-element Vector{Float64}:
 0.34867844010000004
 0.387420489
 0.19371024449999996

This can be acieved with only one set of brackets and zero-based indexing with my function:

julia> f_X(0:2) .== pdf(X)[1:3]
3-element BitVector:
 1
 1
 1

In case you should find that useful

Topic		Replies	Views
Where is the `pdf` function in Distribution.jl implemented? General Usage	6	5138	February 28, 2023
Random variables in Julia (working list) Statistics distributions	36	6106	November 27, 2022
User-defined distribution function example Statistics question	14	4885	March 21, 2019
Reworking Distributions.jl Statistics proposal	51	5363	April 27, 2020
Questions about contributing to Distributions.jl Statistics question	41	2712	June 19, 2020

On beauty of Distributions.jl pdf interface

Related topics