Is there any package supporting multivariate distributions and basic operations on them like sampling and log pdf, but working on GPU?
Or, what would it take to get Distributions.jl working on the GPU?
The issue might be that Distributions.jl still uses some C libraries (Rmath, openspecfun). There was some discussion to port them in Julia but Iām not sure whatās the status of that.
Sometimes I just translate scipy code for probability distributions when I need automatic differentiation.
Distributions
is indeed made with CPUs and Array
s in mind. It would be great to modify the code to work more generally even if the support on GPUs would initially be sparser than on CPUs. As mentioned, we use Rmath
for many of the functions in Distributions
but quite a few of them already have fallbacks written in Julia. However, since the number types on the GPU are the same as the CPU number types, it would be a problem to dispatch to the Julia implementation. This is a general problem.
We might want to consider a split of Distributions
into a package with all the basic API definitions, say DistributionsBase
, and then make that a dependency of Distributions
. In that way, it would be possible for CuArray
s to use (at least most of) the same API as Distributions
without depending on all of Distributions
.
I considered modifying Distributions.jl to work on GPU, but it seems easier to go the other way  start with some basic GPUcompatible implementation and gradually move to existing API of Distributions.jl.
I also think only multivariate distributions since putting scalars to GPU looks quite meaningless.
ā¦but it seems easier to go the other way  start with some basic GPUcompatible implementation and gradually move to existing API of Distributions.jl.
True but it would be a shame to develop a separate API for GPU. Is would nice to be able to use e.g. Normal(Ī¼,Ļ)
and Binomial(n,p)
on the GPU.
I also think only multivariate distributions since putting scalars to GPU looks quite meaningless.
I donāt understand why you wouldnāt like to work with scalar valued distributions on the GPU, e.g. for things like
rand!(Normal(), dA)
to also work on the GPU.
I donāt understand why you wouldnāt like to work with scalar valued distributions on the GPU, e.g. for things like [ā¦]
Ah, I thought more about generating separate numbers (e.g. something like rand(Normal())
) when I said about univariate distributions, generating a vector of such numbers on GPU makes perfect sense to me.
But in the case of rand!(Normal(), dA)
it seems like we donāt have a problem with dispatching since we can always dispatch on type of dA
. Or am I missing something?
I think the specific example might be fine but I suspect that something like
pdf.(Normal(), dA)
could be a problem because the broadcast function would end up calling e.g. pdf(Normal(), dA[i])
and the elements would be, say Float32
s, so pdf
would dispatch to the wrong method.
Indexing of individual elements in CuArrays
is inefficient and disabled by default if I remember correctly, so dispatching is the least of the problems I would say. On the other hand, in Julia 1.0 broadcasting is relatively easy to override.
I was probably too brief. The call to pdf(Normal(), dA[i])
would happen on the GPU device where indexing is, of course, not disabled. This is already what CuArray broadcasting does so the problem can be illustrated with
julia> cdf.(Normal(), cu(rand(3)))
ERROR: InvalidIRError: compiling #25(CuArrays.CuKernelState, CUDAnative.CuDeviceArray{Float64,1,CUDAnative.AS.Global}, Base.Broadcast.Broadcasted{Nothing,Tupl
e{Base.OneTo{Int64}},typeof(cdf),Tuple{CuArrays.CuRefValue{Normal{Float64}},Base.Broadcast.Extruded{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global},T
uple{Bool},Tuple{Int64}}}}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
[1] erfc at /home/andreasnoack/.julia/packages/SpecialFunctions/fvheQ/src/erf.jl:8
[2] normcdf at /home/andreasnoack/.julia/packages/StatsFuns/0W2sM/src/distrs/norm.jl:15
[3] normcdf at /home/andreasnoack/.julia/packages/StatsFuns/0W2sM/src/distrs/norm.jl:16
[4] cdf at /home/andreasnoack/.julia/packages/Distributions/WHjOk/src/univariates.jl:528
[5] _broadcast_getindex_evalf at broadcast.jl:574
[6] _broadcast_getindex at broadcast.jl:547
[7] getindex at broadcast.jl:507
[8] macro expansion at /home/andreasnoack/.julia/packages/GPUArrays/hzyWn/src/abstract_gpu_interface.jl:69
[9] #25 at /home/andreasnoack/.julia/packages/GPUArrays/hzyWn/src/broadcast.jl:58
Reason: unsupported call to the Julia runtime (call to jl_type_error_rt)
Stacktrace:
[1] erfc at /home/andreasnoack/.julia/packages/SpecialFunctions/fvheQ/src/erf.jl:8
[2] normcdf at /home/andreasnoack/.julia/packages/StatsFuns/0W2sM/src/distrs/norm.jl:15
[3] normcdf at /home/andreasnoack/.julia/packages/StatsFuns/0W2sM/src/distrs/norm.jl:16
[4] cdf at /home/andreasnoack/.julia/packages/Distributions/WHjOk/src/univariates.jl:528
[5] _broadcast_getindex_evalf at broadcast.jl:574
[6] _broadcast_getindex at broadcast.jl:547
[7] getindex at broadcast.jl:507
[8] macro expansion at /home/andreasnoack/.julia/packages/GPUArrays/hzyWn/src/abstract_gpu_interface.jl:69
[9] #25 at /home/andreasnoack/.julia/packages/GPUArrays/hzyWn/src/broadcast.jl:58
It took me some time to learn internals of Distributions.jl, itās dependencies and theory behind them. I tried to imagine what it would take to make a single distribution  namely, MvNormal
 working on CuArrays (or any other form of GPU arrays). By āworkingā I mean:
 parameters are represented as CuArrays or wrappers (e.g.
Diagonal{T, CuArray{T},1}
) 
logpdf
or any similar method is implemented
The most acute problem that Iāve encountered is that under the hood Distributions.jl heavily uses LinearAlgebra types such as UpperTriangular
matrix and Cholesky
factorization. These types, however, use concrete Array
instead of AbstractArray
. I didnāt dive deeper, but I believe using concrete types there is intended and will be hard to change in Base. Also, Cholesky factorization for CuArrays is better to do using cuSOLVER.
Unless I miss some important detail, this means we will have to reimplement half of the LinearAlgebra types, which sounds like a terrible idea. Iāll continue thinking about it, fresh idea are welcome too.
It seems like I misinterpreted what I saw in REPL while inspecting types in LinearAlgebra  at least Cholesky factorization turns to work perfectly well:
A = cu(rand(3,3))
A = A*A' # make positivedefinite
cholesky(A)
Iād argue the right place to start for this effort is not with the multivariate distributions but the univariates. The multivariate distributions are more complicated and would probably some work before theyāll work for a broader range of array types.
Iām not so sure about this. As I see it:

You canāt represent a single number on GPU, the closest you can do is a singleelement
CuArray
. Even if you could, it only makes sense as a part of processing of a larger batch of data  either broadcasting or batch operations like matrix multiplication. 
Broadcasting on univariate distribution type, e.g.
cdf.(Normal(), cu(rand(3)))
, is hard since you need to pass a custom Julia struct (likeNormal
) to GPU which is poorly documented in C and seems to be completely missing from CUDAnative. So working towards this approach means improving CUDAnative runtime itself, which is, well, quite lowlevel.
An alternative approach is to overload broadcasting and rewrite such calls into e.g.cdfnormal(mu, sigma, cu(rand(3)))
. 
Multivariate distributions, on the other hand, can operate directly on highlevel
CuArray
s. In the simplest case, you just useCuArray
s for distribution type fields (e.g.mu
andsigma
forMvNormal
) and apply generic functions to them. Perhaps, most of the work in this case goes on reimplementing CPUonly and thirdparty code, which sounds to me like an easier starting point than digging into internals of CUDAnative.
Perhaps you have another vision, so Iād be happy to learn about it!
Broadcasting on univariate distribution type, e.g.
cdf.(Normal(), cu(rand(3)))
, is hard since you need to pass a custom Julia struct (likeNormal
) to GPU which is poorly documented in C and seems to be completely missing fromCUDAnative
.
I think you might be overthinking this. What Iām talking about would be something like
julia> using CuArrays
julia> struct Normal{T}
Ī¼::T
Ļ::T
end
julia> pdf(D::Normal{T}, x::T) where T = CuArrays.CUDAnative.exp(((xD.Ī¼)/D.Ļ)^2/2)/(sqrt(2*T(Ļ))*D.Ļ)
pdf (generic function with 1 method)
julia> X = cu(rand(10))
10element CuArray{Float32,1}:
0.49907735
0.0075366762
0.4265667
0.044595595
0.6838826
0.24200504
0.95713586
0.8394181
0.8209742
0.46985078
julia> Base.Broadcast.broadcastable(D::Normal) = Ref(D)
julia> pdf.(Normal(0.3f0, 0.9f0), X)
10element CuArray{Float32,1}:
0.43255657
0.42047203
0.4389076
0.42577493
0.40472618
0.44234982
0.3395486
0.37039313
0.3748919
0.43544522
CUDAnative can do codegen with broadcast. You can make and return any Julia struct. You can internally use mutable structs as long as they go away. This makes pureJulia codes ājust workā on the GPU as long as they are fully inferred and return isbits types.
The issue that you will run into with CuArrays is that you need to use the RNG from CUDA if you want to do rand.(dist)
, and without a way to pass a rng that wonāt work. But the other functions should be fine.