How to use `ordered` keyword equals to `true` with `StatsBase.sample()` function when sampling 2 or more dimensional array from a population?

I was using sample() function from StatsBase module for sampling with replacement from a population a and sampling probabilities weight vector wv as below:

a = [31, 12, 29, 9, 15]
wv = weights([0.1, 0.4, 0.2, 0.1, 0.2])
sample(a, wv, (3, 4), ordered=false)

It provides a random matrix that looks like:

3×4 Array{Int64,2}:
 31  12  29  29    
 15   9   9  31    
 31   9  15  15

But when I write the code as:

sample(a, wv, (3, 4), ordered=true)

instead of providing any ordered (by row or column) random matrix, it throws the error:

ERROR: UndefKeywordError: keyword argument dims not assigned

How to use the keyword argument ordered=true here for sampling a 2 or more dimensional array from a population array?

1 Like

That error message could definitely be made better, but I think the reason why your example doesn’t work is because there isn’t a clear way to order the values of a matrix. For example, if your sampled values are (1, 2, 3, 4), is the correct order for a 2x2 matrix this,

1 2
3 4

or this,

1 3
2 4

or even this,

4 3
2 1

What you could do is sample a vector and then reshape it into a matrix:

a = [31, 12, 29, 9, 15]
wv = weights([0.1, 0.4, 0.2, 0.1, 0.2])
x = sample(a, wv, 12, ordered=true)
julia> reshape(x, 3, 4)
3×4 Array{Int64,2}:
 12  12  15  29
 12  15  15  29
 12  15  15  29
2 Likes

I opened a Github issue to improve the error message:

2 Likes

Thanks for your so quick reply and sharing your thoughts.

Similar thing I also tried. This is also not a very straightforward way. If we write:

sort(sample(a, wv, (3, 4)), dims=2)   # For ordering column-wise.
3×4 Array{Int64,2}:
 12  15  15  29
  9   9  12  31
 12  12  15  15

sort(sample(a, wv, (3, 4)), dims=1)   # For ordering row-wise.
3×4 Array{Int64,2}:
 12   9  12  12
 12  15  12  15
 29  29  29  31

Obviously, in these ways we are not getting the desired output directly from the sample() function.

A basic issue here is that there is a bug in sample(a, ordered=true, replace=true): incorrect ordering for `sample(replace=true, ordered=true)` · Issue #675 · JuliaStats/StatsBase.jl · GitHub

You’ll see that if you sample a 1d output it gives a non-sequential order (i.e. not matching the ordering in a):

julia> sample(a, wv, 3, ordered=true)
3-element Array{Int64,1}:
  9
 12
 29

The natural way in Julia would be iteration order, i.e. column-major order.

Except for the abovementioned bug you could do this via reshape:

julia> reshape(sample(a, wv, 12, ordered=true), 3,4)
3×4 Array{Int64,2}:
  9  12  15  15
 12  12  15  15
 12  12  15  29

A workaround to get a correct ordering would be to sample indices:

julia> reshape(a[sample(1:length(a), wv, 12, ordered=true)], 3,4)
3×4 Array{Int64,2}:
 31  12  29  15
 31  29   9  15
 12  29  15  15
2 Likes

(This is not the same thing as ordered/sequential sampling. Which is fine, if that’s what you want.)

Ah, I see what you mean. I just assumed that the ordered keyword argument meant that the output samples should be sorted (e.g., lexicographically). However, from a quick google search I can’t find a reference for the type of sorting you refer to, where the output order respects the order in the original vector. The closest I’ve found is this:

which describes the case where you sample and then sort the samples (for the naive algorithm).

For example, Vitter (1984) defines “sequential” sampling this way, and the StatsBase docs define “ordered” as equivalent to “sequential”.

Moreover, StatsBase.sample uses this definition for the replace=false case, so it makes no sense to use a completely different definition of “ordered” for sampling with replacement.

Finally, sequential sampling as Vitter defined it is strictly more general. If you want sorted results from sample(a, x, ordered=true), you can simply sort!(a) first in any ordering you want. (And the definition you are using fails completely if the elements of a do not have an isless ordering.)

4 Likes

Yes, that’s true. What I tried, is not the same thing as we see in sequential sampling.