How to create sample random numbers from a custom distribution?

I want to create a series of random numbers from a custom distribution. Suppose I have a data set named iris:

julia> using RDatasets

julia> iris = dataset("datasets", "iris");

Now I want to generate 100 random numbers sampled from the distribution of iris.SepalLength with a presumption that no one knows about the distribution of this vector. So in this case I should literally create a new distribution first; For this, there is an option by using Distributions.jl here. But I couldn’t follow the steps correctly. The doc says:

To implement a univariate sampler, one can define a subtype (say Spl) of Sampleable{Univariate,S} (where S can be Discrete or Continuous), and provide a rand method, as

function rand(rng::AbstractRNG, s::Spl)
    # ... generate a single sample from s
end

But I don’t know how to provide the s. I tried:

julia> using Distributions

julia> mysamp = Sampleable(iris.SepalLength)
ERROR: MethodError: no method matching Sampleable(::Vector{Float64})
Stacktrace:
 [1] top-level scope
   @ REPL[20]:1

Any help would be appreciated. I want to create a vector of random samples from a custom distribution. The custom distribution should be created from specific data.

You presumably have to decide on which distribution you think is the right one for that dataset, is it a gaussian distribution, for example?

Then you fit that distribution to your data, to get the right parameters (e.g. mean and variance): Distribution Fitting · Distributions.jl

Finally you can sample from the fitted distribution:

using Random, Distributions
julia> r = randn(10)./5 .+ 2;  # starting data

julia> d = fit(Normal, r) 
Normal{Float64}(μ=2.091694546378728, σ=0.09721707142006106)

julia> rand(d, 5)
5-element Vector{Float64}:
 2.077566588816171
 1.9793350650939816
 2.319975404778217
 2.1013889428372607
 1.8952925440083732

If you have no idea which distribution to start with, then I guess you should try various, but the Normal distribution should be a good starting point for your case.

1 Like

Ah, thanks.
What if none of the available distributions were able to explain the dataset? Actually, this is my point.

Well, you can manipulate distributions, truncate them, mix them. If none of them are a good fit, do you have some idea what it could be?

Maybe you can conjure up a distribution based on the histogram of the samples? (I don’t know how to do this, BTW).

Sampleable is an abstract type. You should make you own realization.

Try:

using Distributions, Random
v = [1,2,5,4,3,6,7,5,4,4,2,3,4,5]

struct Spl <: Sampleable{Univariate , Discrete } 
    vec
end

samp =  Spl(v)
function Base.rand(spl::Spl)
    v[rand(1:length(v))]
end

rand(samp)

So also you can use Categorical distribution.

Sorry but this picks some values from the v at last. But I didn’t mean it. I want to generate random numbers that follow the v’s distribution!
As I said:

This follow univariate discrete distribution v.

Perhaps the EmpiricalDistributions package will help:

using EmpiricalDistributions, StatsBase, RDatasets

iris = dataset("datasets", "iris")
sepalhist = fit(Histogram, iris.SepalLength)
sepaldist = UvBinnedDist(sepalhist)

and now:

julia> rand(sepaldist, 5)
5-element Vector{Float64}:
 5.959300677381709
 6.667052727022748
 5.049564685000477
 5.067785085943993
 5.55095464638566

julia> mean(sepaldist)
5.8933333333333335

Didn’t check the actual fit of the distribution, but the name of the package sounds promising enough to reproduce sepal-like values.

Links:
https://oschulz.github.io/EmpiricalDistributions.jl/dev/
https://github.com/oschulz/EmpiricalDistributions.jl

But v isn’t a distribution, it’s just a bunch of samples from the underlying distribution, which is almost certainly continuous.

This is the same as rand(v), (except if v isn’t a one-indexed array).

I would avoid downloading mega bytes of packages for simple tasks. Use standard library functions as much as possible.

For example:

function rand_empirical(x, m)
    n = length(x)
    # kde bandwidth h ≈ 1.06*σ*n^(-1/5)
    h = 1.06*sqrt(n*sum(x .* x) - sum(x)^2)/n/n^(0.2)
    # generate from categorical and add variates from kernel density
    # to get variates from empirical distribution, for example,
    # for gaussian kernel with width h
    r1 = x[rand(1:n, m)]
    r2 = h*randn(m)
    r = r1 .+ r2
    return r
end

kernel density estimation

1 Like

If I correctly understand task:

Now I want to generate 100 random numbers sampled from the distribution of iris.SepalLength with a presumption that no one knows about the distribution of this vector.

If no one knows about the distribution of this vector - better way to bootstrap 100 samples from this vector. Yes, v - not a distribution, but if you take random element from you will follow this unknown distribution.

Than you can make some assumption, for example - distribution is continuous, find appropriate pdf/cdf functions. Or you can thought that this distribution is discrete, and fit Categorical distribution.

But you know nothing, you have only vector.

1 Like

The task is to find the underlying distribution. Then sample from that.

1 Like