Inverse transform sampling (discrete distributions sampling)?

riegel_gestr · March 1, 2021, 3:23pm

I am currently working on inverse transform sampling:
Inverse transform sampling - Wikipedia, also here http://dept.stat.lsa.umich.edu/~jasoneg/Stat406/lab5.pdf
I have implemented the following version of the algo:

using Random
function very_bad_its(intervals::Array{Float64,1})
    j = rand()
    idx = findfirst(x-> j <= x,intervals)
    return idx
end

where intervals are the extremes of intervals of the cdf. I have stored only the extremes to (maybe) speedup.
Lets consider an example of discrete probability distributions (to better understand what I am referring):

p1  = 0.4, p2 = 0.2, p3 = 0.3, p4 = 0.1

so in this case intervals is the following array:

[0.4,0.6,0.9,1.0]

The algo perse has interesting ways of speedup, like for example ordering the probabilities in descending order. I have to sample the distributions many times, like 10^6.
So the first question is if there is yet an implementation in julia of this algo, clearly faster than the previous implementation.
Second question is if in the previous function I can modify a part of the code to speedup.
If the answer to the second question is no, then can i speedup by parallelizing the code?
I am asking because I have actually tried with @distributed and stored the result in an array (next in the code i have to use the samples), but the results are not worth. Clearly the assumption is that the parallel version is faster, and it can be false.

Thanks in advance for your time

Tamas_Papp · March 1, 2021, 4:19pm

Possibly, but if your distributions are discrete, Distributions.jl has an implementation of the alias method, which is AFAIK more efficient. See

github.com

JuliaStats/Distributions.jl/blob/master/src/samplers/aliastable.jl

struct AliasTable <: Sampleable{Univariate,Discrete}
    accept::Vector{Float64}
    alias::Vector{Int}
end
ncategories(s::AliasTable) = length(s.alias)

function AliasTable(probs::AbstractVector)
    n = length(probs)
    n > 0 || throw(ArgumentError("The input probability vector is empty."))
    accp = Vector{Float64}(undef, n)
    alias = Vector{Int}(undef, n)
    StatsBase.make_alias_table!(probs, 1.0, accp, alias)
    AliasTable(accp, alias)
end

function rand(rng::AbstractRNG, s::AliasTable)
    i = rand(rng, 1:length(s.alias)) % Int
    u = rand(rng)
    @inbounds r = u < s.accept[i] ? i : s.alias[i]
    r

This file has been truncated. show original

riegel_gestr · March 1, 2021, 4:47pm

I didnt know, thanks!
I used Distributions.jl in the past, but DiscreteDistributions not AliasTable (didnt know of this).

I am comparing the two implementations with the following code:

using BenchmarkTools
using Distributions
function very_bad_its(intervals::Array{Float64,1})
    j = rand()
    idx = findfirst(x-> j <= x,intervals)
    return idx
end
function create_intervals_cdf(weight::Array{Float64,1})
    neighbor_weight = [0.0]
    neighbor_weight = vcat(neighbor_weight,weight)
    intervals = [[sum(neighbor_weight[1:i]),sum(neighbor_weight[1:i+1])] for i in 1:length(neighbor_weight)-1]
    res_intervals = [intervals[i][2] for i in 1:length(intervals)]
    return res_intervals
end
function create_test_probs()
    return [0.1,0.4,0.2,0.3]
end
function dist_its()
    probs = create_test_probs()
    aliastable = Distributions.AliasTable(probs)
    N = 10^6
    results = rand(aliastable,N)
end
function my_its()
    probs = create_test_probs()
    intervals = create_intervals_cdf(probs)
    N = 10^6
    results = Array{Int64}(undef,N)
    for x in 1:N
        results[x] = very_bad_its(intervals)
    end
end
@btime my_its()
@btime dist_its()

The results:

  20.713 ms (43 allocations: 7.63 MiB)
  19.661 ms (8 allocations: 7.63 MiB)

dpsanders · March 1, 2021, 5:08pm

You shoukd use the searchsortedfirst function. This does bisection to find the right entry.

riegel_gestr · March 1, 2021, 6:29pm

I don’t know if there is an error in the following code, because searchsortedfirst is actually what I was looking (doing some test with only arrays, it is faster than findfirst). Maybe its a problem in N

using BenchmarkTools
using Distributions
function create_intervals_cdf(weight::Array{Float64,1})
    neighbor_weight = [0.0]
    neighbor_weight = vcat(neighbor_weight,weight)
    intervals = [[sum(neighbor_weight[1:i]),sum(neighbor_weight[1:i+1])] for i in 1:length(neighbor_weight)-1]
    res_intervals = [intervals[i][2] for i in 1:length(intervals)]
    return res_intervals
end
function create_test_probs()
    return [0.1,0.4,0.2,0.3]
end
function my_bad_its(intervals::Array{Float64,1})
    j = rand()
    idx = findfirst(x-> j <= x,intervals)
    return idx
end
function new_its(intervals::Array{Float64,1})
    j = rand()
    idx = searchsortedfirst(intervals,j)
    return idx
end
function init_dist_its()
    probs = create_test_probs()
    aliastable = Distributions.AliasTable(probs)
    return aliastable
end
function init_my_its()
    probs = create_test_probs()
    intervals = create_intervals_cdf(probs)
    return intervals
end
function test_distributions(aliastable::Distributions.AliasTable)
    N = 10^6
    results = rand(aliastable,N)
    return results
end
function test_new_its(intervals::Array{Float64,1})
    N = 10^6
    results = Array{Int64}(undef,N)
    for x in 1:N
        results[x] = new_its(intervals)
    end
    return results
end
function test_my_bad_its(intervals::Array{Float64,1})
    N = 10^6
    results = Array{Int64}(undef,N)
    for x in 1:N
        results[x] = my_bad_its(intervals)
    end
    return results
end
intervals = init_my_its()
aliastable = init_dist_its()
@btime test_my_bad_its(intervals)
@btime test_distributions(aliastable)
@btime test_new_its(intervals)

The results are:

  23.026 ms (2 allocations: 7.63 MiB)
  23.054 ms (2 allocations: 7.63 MiB)
  27.182 ms (2 allocations: 7.63 MiB)

I think that I am doing a stupid mistake…

Topic		Replies	Views
Sampler for arbitrary univariate distribution? General Usage	8	592	April 23, 2022
AliasTables.jl Package Announcements	3	685	April 12, 2024
Inverse of cdf General Usage	6	3208	May 28, 2024
Approximating the inverse of an expensive CDF function Statistics	2	272	May 28, 2024
Faster Bernoulli sampling Statistics	7	2008	February 27, 2020

Inverse transform sampling (discrete distributions sampling)?

Related topics