How to create sample random numbers from a custom distribution?

Shayan · November 24, 2022, 9:45am

I want to create a series of random numbers from a custom distribution. Suppose I have a data set named iris:

julia> using RDatasets

julia> iris = dataset("datasets", "iris");

Now I want to generate 100 random numbers sampled from the distribution of iris.SepalLength with a presumption that no one knows about the distribution of this vector. So in this case I should literally create a new distribution first; For this, there is an option by using Distributions.jl here. But I couldn’t follow the steps correctly. The doc says:

To implement a univariate sampler, one can define a subtype (say Spl) of Sampleable{Univariate,S} (where S can be Discrete or Continuous), and provide a rand method, as

function rand(rng::AbstractRNG, s::Spl)
    # ... generate a single sample from s
end

But I don’t know how to provide the s. I tried:

julia> using Distributions

julia> mysamp = Sampleable(iris.SepalLength)
ERROR: MethodError: no method matching Sampleable(::Vector{Float64})
Stacktrace:
 [1] top-level scope
   @ REPL[20]:1

Any help would be appreciated. I want to create a vector of random samples from a custom distribution. The custom distribution should be created from specific data.

DNF · November 24, 2022, 10:19am

You presumably have to decide on which distribution you think is the right one for that dataset, is it a gaussian distribution, for example?

Then you fit that distribution to your data, to get the right parameters (e.g. mean and variance): Distribution Fitting · Distributions.jl

Finally you can sample from the fitted distribution:

using Random, Distributions
julia> r = randn(10)./5 .+ 2;  # starting data

julia> d = fit(Normal, r) 
Normal{Float64}(μ=2.091694546378728, σ=0.09721707142006106)

julia> rand(d, 5)
5-element Vector{Float64}:
 2.077566588816171
 1.9793350650939816
 2.319975404778217
 2.1013889428372607
 1.8952925440083732

If you have no idea which distribution to start with, then I guess you should try various, but the Normal distribution should be a good starting point for your case.

Shayan · November 24, 2022, 10:21am

Ah, thanks.
What if none of the available distributions were able to explain the dataset? Actually, this is my point.

DNF · November 24, 2022, 10:56am

Well, you can manipulate distributions, truncate them, mix them. If none of them are a good fit, do you have some idea what it could be?

Maybe you can conjure up a distribution based on the histogram of the samples? (I don’t know how to do this, BTW).

PharmCat · November 24, 2022, 11:53am

Sampleable is an abstract type. You should make you own realization.

Try:

using Distributions, Random
v = [1,2,5,4,3,6,7,5,4,4,2,3,4,5]

struct Spl <: Sampleable{Univariate , Discrete } 
    vec
end

samp =  Spl(v)
function Base.rand(spl::Spl)
    v[rand(1:length(v))]
end

rand(samp)

So also you can use Categorical distribution.

Shayan · November 24, 2022, 12:00pm

Sorry but this picks some values from the v at last. But I didn’t mean it. I want to generate random numbers that follow the v’s distribution!
As I said:

PharmCat · November 24, 2022, 12:23pm

This follow univariate discrete distribution v.

Dan · November 24, 2022, 12:27pm

Perhaps the EmpiricalDistributions package will help:

using EmpiricalDistributions, StatsBase, RDatasets

iris = dataset("datasets", "iris")
sepalhist = fit(Histogram, iris.SepalLength)
sepaldist = UvBinnedDist(sepalhist)

and now:

julia> rand(sepaldist, 5)
5-element Vector{Float64}:
 5.959300677381709
 6.667052727022748
 5.049564685000477
 5.067785085943993
 5.55095464638566

julia> mean(sepaldist)
5.8933333333333335

Didn’t check the actual fit of the distribution, but the name of the package sounds promising enough to reproduce sepal-like values.

Links:
https://oschulz.github.io/EmpiricalDistributions.jl/dev/
https://github.com/oschulz/EmpiricalDistributions.jl

DNF · November 24, 2022, 1:27pm

But v isn’t a distribution, it’s just a bunch of samples from the underlying distribution, which is almost certainly continuous.

This is the same as rand(v), (except if v isn’t a one-indexed array).

tchebycheff · November 24, 2022, 4:03pm

I would avoid downloading mega bytes of packages for simple tasks. Use standard library functions as much as possible.

For example:

function rand_empirical(x, m)
    n = length(x)
    # kde bandwidth h ≈ 1.06*σ*n^(-1/5)
    h = 1.06*sqrt(n*sum(x .* x) - sum(x)^2)/n/n^(0.2)
    # generate from categorical and add variates from kernel density
    # to get variates from empirical distribution, for example,
    # for gaussian kernel with width h
    r1 = x[rand(1:n, m)]
    r2 = h*randn(m)
    r = r1 .+ r2
    return r
end

kernel density estimation

PharmCat · November 24, 2022, 4:34pm

If I correctly understand task:

Now I want to generate 100 random numbers sampled from the distribution of iris.SepalLength with a presumption that no one knows about the distribution of this vector.

If no one knows about the distribution of this vector - better way to bootstrap 100 samples from this vector. Yes, v - not a distribution, but if you take random element from you will follow this unknown distribution.

Than you can make some assumption, for example - distribution is continuous, find appropriate pdf/cdf functions. Or you can thought that this distribution is discrete, and fit Categorical distribution.

But you know nothing, you have only vector.

DNF · November 24, 2022, 6:57pm

The task is to find the underlying distribution. Then sample from that.

_stla · December 2, 2022, 9:59am

You seem to be lost

There are two possibilities:

you can sample from the empirical distribution of the vector; this is achieved by picking at random some elements of the vector, with replacement
you can fit a continuous distribution to the vector; for example a parametric distribution such as the normal distribution, and then you sample from this distribution, or a non-parametric distribution, but I don’t know whether there exists a Julia package to do that

Topic		Replies	Views
Rand and Sampler interface for multiple samples General Usage distributions , random	2	534	August 21, 2021
Distributions.jl using a custom random number generator Statistics	2	1747	July 29, 2017
Sampling from a normal distribution Statistics	4	16162	April 1, 2020
Basic question about Distributions, drawing random sample and related documentaion New to Julia question	2	1164	January 19, 2020
Custom Sampler with multiple parameters General Usage question , random	4	240	November 23, 2023

How to create sample random numbers from a custom distribution?

Related topics