 # Count occurrences of columns in a 2d array (using countmap)

I have a 2d BitArray and I want to count the number of distinct columns in that array.
And I want to do it fast.

I guess that `countmap` (of StatsBase) is the most efficient way for such operations. However it only works for the entire array.

To solve that I am converting the 2d array into 1d array of arrays.
Then I `countmap` that 1d array.
However, the problem is that the conversion is not very efficient (since I would like to perform this operation many times on larger sets I need it to be fast).

Is there a more efficient way to do it?
(BTW - I saw this post suggesting to use DataStructures.jl, but I did not understand how to use it for my specific purpose).

Here is a MWE (the `@time` commands were ran twice for the initial compilation):

``````julia> using StatsBase

julia> my_arr = rand(Bool, 4, 1000)
4×1000 Array{Bool,2}:
1  0  1  1  0  0  0  1  1  1  0  0  0  1  0  0  1  0  1  1  0  0  …  1  0  1  1  0  0  1  0  1  0  0  1  1  0  1  0  1  1  0  1  1
0  1  0  1  0  1  1  1  1  0  1  1  1  0  0  0  1  0  0  1  1  0     1  0  1  0  0  1  1  0  0  1  0  1  0  1  0  1  0  0  1  0  1
1  1  1  1  1  0  1  1  0  0  0  1  1  1  0  0  1  1  1  1  0  1     0  0  1  0  1  1  1  1  1  1  0  0  0  0  1  1  0  1  0  1  0
0  0  1  0  0  1  0  1  0  0  1  1  1  1  0  1  0  1  0  0  1  1     0  0  1  1  0  1  0  0  1  0  1  1  1  1  0  1  0  0  1  1  1

julia> @time countmap(my_arr) # This is not what I want!
0.104725 seconds (240.05 k allocations: 11.587 MiB)
Dict{Bool,Int64} with 2 entries:
false => 1998
true  => 2002

julia> function slicematrix(A::AbstractMatrix) # a function to get a 1d array of arrays
return [A[:, i] for i in 1:size(A,2)]
end
slicematrix (generic function with 1 method)

julia> @time my_arr_1d = slicematrix(my_arr) ;
0.066759 seconds (177.00 k allocations: 8.931 MiB)

julia> @time my_arr_1d = slicematrix(my_arr) ;
0.000102 seconds (1.00 k allocations: 101.734 KiB)

julia> @time countmap(my_arr_1d);
0.160801 seconds (394.11 k allocations: 19.167 MiB, 8.69% gc time)

julia> @time countmap(my_arr_1d) # this is what I want!
0.000205 seconds (7 allocations: 1.969 KiB)
Dict{Array{Bool,1},Int64} with 16 entries:
Bool[1, 1, 1, 1] => 63
Bool[0, 0, 1, 0] => 60
Bool[0, 1, 1, 1] => 68
Bool[0, 1, 0, 0] => 69
Bool[0, 1, 1, 0] => 55
Bool[1, 1, 0, 1] => 56
Bool[0, 0, 0, 1] => 63
Bool[0, 1, 0, 1] => 61
Bool[0, 0, 0, 0] => 65
Bool[1, 0, 0, 0] => 48
Bool[0, 0, 1, 1] => 71
Bool[1, 0, 1, 0] => 51
Bool[1, 0, 0, 1] => 78
Bool[1, 1, 0, 0] => 70
Bool[1, 0, 1, 1] => 65
Bool[1, 1, 1, 0] => 57

``````

Note that in your MWE, you don’t, you have an `Array{Bool}`.

I would suggest something like

``````countmap(collect(eachcol(my_arr)))
``````

You need `collect` because of

otherwise it would be even faster.

1 Like

Right… I use sometimes
`my_arr = convert(BitArray{}, my_arr)`
But I am not sure if this is needed at all…

This is nice! saves me the slicing of the array.

It depends on the trade-offs between storage and speed — maybe you want to benchmark both options. Also, cf `Random.bitrand` which will generate it directly.

Nice. Thanks!
Actually, I am sampling the bits from a non-uniform distribution. So I use

``````    strings = Array{Int16,2}(undef, n, M);
for i = 1:n
strings[i, :] = sample([Bool(0), Bool(1)], Weights([1-ps[i], ps[i]]), M)
end
strings = convert(BitArray{}, strings)
``````

where `ps` is the probability array, `M` is the sample size, `n` is the bit-string length.

I tried to find some direct way to sample these bits, but couldn’t find any…

I would recommend something like

``````strings = [rand(Float64) <= p[i] for i in 1:n, j in 1:M] # untested
``````

It is unclear to me why you allocate an array of `Int16`s.

That’s a typo This gives roughly factor 10 worse performance regarding time:

``````0.217236 seconds (4.09 M allocations: 67.357 MiB, 10.04% gc time)
``````

as opposed to

which gives

``````0.042701 seconds (206 allocations: 4.062 MiB)
``````

How are you benchmarking it? It seems highly unlikely that the loop should be faster than the comprehension in this case.

``````function foo(ps, M)
n = length(ps)
strings = Array{Int16,2}(undef, n, M);
for i = 1:n
strings[i, :] = sample([Bool(0), Bool(1)], Weights([1-ps[i], ps[i]]), M)
end
strings = convert(BitArray, strings)
return strings
end

bar(ps, M) = BitArray([rand() <= p for p in ps, j in 1:M])
``````

Benchmarking:

``````julia> using BenchmarkTools

julia> ps = 0.1:0.1:0.9
0.1:0.1:0.9

julia> @benchmark foo(\$ps, 1000)
BenchmarkTools.Trial:
memory estimate:  30.78 KiB
allocs estimate:  58
--------------
minimum time:     177.022 μs (0.00% GC)
median time:      184.739 μs (0.00% GC)
mean time:        193.568 μs (0.79% GC)
maximum time:     3.064 ms (91.51% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark bar(\$ps, 1000)
BenchmarkTools.Trial:
memory estimate:  10.31 KiB
allocs estimate:  5
--------------
minimum time:     94.637 μs (0.00% GC)
median time:      96.440 μs (0.00% GC)
mean time:        102.460 μs (0.83% GC)
maximum time:     4.643 ms (97.27% GC)
--------------
samples:          10000
evals/sample:     1
``````

I’m actually surprised that the difference isn’t bigger.

BTW, you can write `[false, true]` instead of `[Bool(0), Bool(1)]`

2 Likes

If you write

``````bar(ps, M) = BitArray(rand() <= p for p in ps, j in 1:M)
``````

(without brackets `[]`) you can save allocations, but it doesn’t seem to be faster.

I was checking it with `@time` but I guess it is better to check using `BenchmarkTools`.
Thanks for that comment!

Hi, I propose two versions to make a sample.

``````function sam1(sM,pS,nS)
@inbounds @simd for j in axes(sM,2)
for i in eachindex(pS)
sM[i,j] = rand() <= pS[i];
end
end
end

function sam2(sM,pS,nS)
@avx for j in axes(sM,2)
for i in eachindex(pS)
sM[i,j] = rand() <= pS[i]
end
end
end
``````

Below are the computing times with my computer.

``````using BenchmarkTools, LoopVectorization
pS = collect(0.1:0.1:0.9)
nS = 1000
sM = trues(length(pS),nS)
@btime sam1(\$sM,\$pS, \$nS)
46.100 μs (0 allocations: 0 bytes)
@btime sam2(\$sM,\$pS, \$nS);
1.090 μs (0 allocations: 0 bytes)
``````

Thanks!
It seems that `sam2` does not really work - it does not provide a random sample according to `ps`… I am not sure what the problem is.