Sampling an index from a boolean array

yuvalw · September 28, 2022, 1:30pm

Hello,
I often need to sample an index from a boolean array. For example:

julia> A = rand(Bool,3,3)
3×3 Matrix{Bool}:
 1  1  1
 1  0  0
 1  0  0
julia> idx = rand(findall(A))
CartesianIndex(2, 1)

Can I somehow sample without findall, in some non-allocating way?
Thanks

jling · September 28, 2022, 1:38pm

in order to “sample” a collection, you need to hold that collection somewhere;

sure you can optimize it a bit by:

first count how many true there is
rand(1:num_of_true)
go find which location that true is at.

but idk if it’s worth it

mcabbott · September 28, 2022, 2:33pm

Another way would be to sample from all the indices, and reject when the answer is false:

julia> A = rand(Bool, 3, 30);

julia> @btime rand(findall($A))
  min 132.477 ns, mean 147.641 ns (1 allocation, 816 bytes)
CartesianIndex(3, 5)

julia> function _myrand(A)
         i = rand(eachindex(A))
         A[i] ? i : _myrand(A)
       end;
       myrand(A) = count(A)==0 ? nothing : _myrand(A) 
myrand (generic function with 1 method)

julia> @btime myrand($A)
  min 36.809 ns, mean 40.934 ns (0 allocations)
35

I inserted a check for the case of all-false A, but perhaps this still behaves badly when almost all entries are false.

GunnarFarneback · September 28, 2022, 3:07pm

How large are your arrays typically? What kind of distribution do you have between falses and trues? Do you need to sample once from a given array or repeatedly? Is speed or memory more important?

yuvalw · September 29, 2022, 1:34pm

The distribution of trues and falses is close enough to uniform.Typically the array is of the size 20x600x20. I then sample 1000 times per slice of this array, taking 10000 different slices of this array according to some symmetry. Sampling from slices also make it less efficient. A simplified code would be like this:

using MaskedArrays: mask
A = rand(Bool, 20, 600, 20) 
function fun1(A)
    for i in size(A,1), j in size(A,2)
        input_clusters = A[i, j, :]
        possible_nodes = mask(A[:,:,i], input_clusters, :) |> findall
        for _ in 1:1000
            node = rand(possible_nodes) # this is what I actually use
        end
    end
end

julia> @btime fun1($A)
  585.338 μs (12011 allocations: 316.83 KiB)

In fact, using masked arrays, although seeming to me to be the appropriate thing to do, is not efficient

function fun2(A)
    for i in size(A,1), j in size(A,2)
        input_clusters = A[i, j, :] |> findall
        possible_nodes = A[input_clusters, :, i] |> findall
        for _ in 1:1000
            row_idx, col = Tuple(rand(possible_nodes))
            row = input_clusters[row_idx]
            node = CartesianIndex(row, col) # this is what I actually use
        end
    end
end

julia> @btime fun2($A)
  22.545 μs (5 allocations: 26.66 KiB)

DNF · September 29, 2022, 3:01pm

This does not at all do what I think you think it does😉

size(A, 1) just returns a single number, not a range. You probably want

for i in 1:size(A,1), j in 1:size(A,2)

or better:

for i in axes(A,1), j in axes(A,2)

GunnarFarneback · September 29, 2022, 7:45pm

I can’t say I really understand how A and input_clusters interact but I assume it has something to do with the symmetries you mentioned. Amortizing findall over 1000 samples doesn’t sound unreasonable but if you also have an outer loop you could consider writing your own findall loop which writes into a preexisting vector and reuses it throughout the loop. Actually writing your own loop would probably be a good idea anyway since you could more directly incorporate input_clusters without making temporary arrays.

If you have guarantees that your trues density doesn’t fall too low it may very well be worth trying the already proposed rejection sampling.

yuvalw · October 3, 2022, 11:15am

You’re correct of course, now the running time is much more realistic as well. Can’t edit anymore, so I’ll repost here:

using MaskedArrays: mask
A = rand(Bool, 20, 600, 20) 
function fun1(A)
    for i in axes(A,1), j in axes(A,2)
        input_clusters = A[i, j, :]
        possible_nodes = mask(A[:,:,i], input_clusters, :) |> findall
        for _ in 1:1000
            node = rand(possible_nodes) # this is what I actually use
        end
    end
end

julia> @btime fun1($A)
  7.155 s (199689892 allocations: 4.93 GiB)

function fun2(A)
    for i in axes(A,1), j in axes(A,2)
        input_clusters = A[i, j, :] |> findall
        possible_nodes = A[input_clusters, :, i] |> findall
        for _ in 1:1000
            row_idx, col = Tuple(rand(possible_nodes))
            row = input_clusters[row_idx]
            node = CartesianIndex(row, col) # this is what I actually use
        end
    end
end

julia> @btime fun2($A)
  396.924 ms (49892 allocations: 431.63 MiB)

Topic		Replies	Views
Efficient way to get indices of true elements from BitArray New to Julia indexing	5	1777	February 3, 2022
Boolean Indexing in dataframes Data question , dataframes	3	521	December 14, 2023
Sampling from a list of integers without repetition New to Julia	3	6841	June 15, 2020
Sample without replacement.. and without StatsBase or shuffle General Usage	12	332	May 3, 2024
Randomly select x% of elements in a array/matrix New to Julia arrays , random	11	2759	July 19, 2022

Sampling an index from a boolean array

Related topics