I have a 2d BitArray and I want to count the number of distinct columns in that array.
And I want to do it fast.
I guess that countmap
(of StatsBase) is the most efficient way for such operations. However it only works for the entire array.
To solve that I am converting the 2d array into 1d array of arrays.
Then I countmap
that 1d array.
However, the problem is that the conversion is not very efficient (since I would like to perform this operation many times on larger sets I need it to be fast).
Is there a more efficient way to do it?
(BTW - I saw this post suggesting to use DataStructures.jl, but I did not understand how to use it for my specific purpose).
Here is a MWE (the @time
commands were ran twice for the initial compilation):
julia> using StatsBase
julia> my_arr = rand(Bool, 4, 1000)
4×1000 Array{Bool,2}:
1 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1 0 0 … 1 0 1 1 0 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1
0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1
1 1 1 1 1 0 1 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 0 1 0 1 0
0 0 1 0 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0 1 0 0 1 1 1
julia> @time countmap(my_arr) # This is not what I want!
0.104725 seconds (240.05 k allocations: 11.587 MiB)
Dict{Bool,Int64} with 2 entries:
false => 1998
true => 2002
julia> function slicematrix(A::AbstractMatrix) # a function to get a 1d array of arrays
return [A[:, i] for i in 1:size(A,2)]
end
slicematrix (generic function with 1 method)
julia> @time my_arr_1d = slicematrix(my_arr) ;
0.066759 seconds (177.00 k allocations: 8.931 MiB)
julia> @time my_arr_1d = slicematrix(my_arr) ;
0.000102 seconds (1.00 k allocations: 101.734 KiB)
julia> @time countmap(my_arr_1d);
0.160801 seconds (394.11 k allocations: 19.167 MiB, 8.69% gc time)
julia> @time countmap(my_arr_1d) # this is what I want!
0.000205 seconds (7 allocations: 1.969 KiB)
Dict{Array{Bool,1},Int64} with 16 entries:
Bool[1, 1, 1, 1] => 63
Bool[0, 0, 1, 0] => 60
Bool[0, 1, 1, 1] => 68
Bool[0, 1, 0, 0] => 69
Bool[0, 1, 1, 0] => 55
Bool[1, 1, 0, 1] => 56
Bool[0, 0, 0, 1] => 63
Bool[0, 1, 0, 1] => 61
Bool[0, 0, 0, 0] => 65
Bool[1, 0, 0, 0] => 48
Bool[0, 0, 1, 1] => 71
Bool[1, 0, 1, 0] => 51
Bool[1, 0, 0, 1] => 78
Bool[1, 1, 0, 0] => 70
Bool[1, 0, 1, 1] => 65
Bool[1, 1, 1, 0] => 57