Sampling an index from a boolean array

Hello,
I often need to sample an index from a boolean array. For example:

julia> A = rand(Bool,3,3)
3×3 Matrix{Bool}:
 1  1  1
 1  0  0
 1  0  0
julia> idx = rand(findall(A))
CartesianIndex(2, 1)

Can I somehow sample without findall, in some non-allocating way?
Thanks

in order to “sample” a collection, you need to hold that collection somewhere;

sure you can optimize it a bit by:

  1. first count how many true there is
  2. rand(1:num_of_true)
  3. go find which location that true is at.

but idk if it’s worth it

Another way would be to sample from all the indices, and reject when the answer is false:

julia> A = rand(Bool, 3, 30);

julia> @btime rand(findall($A))
  min 132.477 ns, mean 147.641 ns (1 allocation, 816 bytes)
CartesianIndex(3, 5)

julia> function _myrand(A)
         i = rand(eachindex(A))
         A[i] ? i : _myrand(A)
       end;
       myrand(A) = count(A)==0 ? nothing : _myrand(A) 
myrand (generic function with 1 method)

julia> @btime myrand($A)
  min 36.809 ns, mean 40.934 ns (0 allocations)
35

I inserted a check for the case of all-false A, but perhaps this still behaves badly when almost all entries are false.

1 Like

How large are your arrays typically? What kind of distribution do you have between falses and trues? Do you need to sample once from a given array or repeatedly? Is speed or memory more important?

1 Like

The distribution of trues and falses is close enough to uniform.Typically the array is of the size 20x600x20. I then sample 1000 times per slice of this array, taking 10000 different slices of this array according to some symmetry. Sampling from slices also make it less efficient. A simplified code would be like this:

using MaskedArrays: mask
A = rand(Bool, 20, 600, 20) 
function fun1(A)
    for i in size(A,1), j in size(A,2)
        input_clusters = A[i, j, :]
        possible_nodes = mask(A[:,:,i], input_clusters, :) |> findall
        for _ in 1:1000
            node = rand(possible_nodes) # this is what I actually use
        end
    end
end

julia> @btime fun1($A)
  585.338 μs (12011 allocations: 316.83 KiB)

In fact, using masked arrays, although seeming to me to be the appropriate thing to do, is not efficient

function fun2(A)
    for i in size(A,1), j in size(A,2)
        input_clusters = A[i, j, :] |> findall
        possible_nodes = A[input_clusters, :, i] |> findall
        for _ in 1:1000
            row_idx, col = Tuple(rand(possible_nodes))
            row = input_clusters[row_idx]
            node = CartesianIndex(row, col) # this is what I actually use
        end
    end
end

julia> @btime fun2($A)
  22.545 μs (5 allocations: 26.66 KiB)

This does not at all do what I think you think it does😉

size(A, 1) just returns a single number, not a range. You probably want

for i in 1:size(A,1), j in 1:size(A,2)

or better:

for i in axes(A,1), j in axes(A,2)

I can’t say I really understand how A and input_clusters interact but I assume it has something to do with the symmetries you mentioned. Amortizing findall over 1000 samples doesn’t sound unreasonable but if you also have an outer loop you could consider writing your own findall loop which writes into a preexisting vector and reuses it throughout the loop. Actually writing your own loop would probably be a good idea anyway since you could more directly incorporate input_clusters without making temporary arrays.

If you have guarantees that your trues density doesn’t fall too low it may very well be worth trying the already proposed rejection sampling.

1 Like

You’re correct of course, now the running time is much more realistic as well. Can’t edit anymore, so I’ll repost here:

using MaskedArrays: mask
A = rand(Bool, 20, 600, 20) 
function fun1(A)
    for i in axes(A,1), j in axes(A,2)
        input_clusters = A[i, j, :]
        possible_nodes = mask(A[:,:,i], input_clusters, :) |> findall
        for _ in 1:1000
            node = rand(possible_nodes) # this is what I actually use
        end
    end
end

julia> @btime fun1($A)
  7.155 s (199689892 allocations: 4.93 GiB)

function fun2(A)
    for i in axes(A,1), j in axes(A,2)
        input_clusters = A[i, j, :] |> findall
        possible_nodes = A[input_clusters, :, i] |> findall
        for _ in 1:1000
            row_idx, col = Tuple(rand(possible_nodes))
            row = input_clusters[row_idx]
            node = CartesianIndex(row, col) # this is what I actually use
        end
    end
end

julia> @btime fun2($A)
  396.924 ms (49892 allocations: 431.63 MiB)