Count occurrences of columns in a 2d array (using countmap)

I have a 2d BitArray and I want to count the number of distinct columns in that array.
And I want to do it fast.

I guess that countmap (of StatsBase) is the most efficient way for such operations. However it only works for the entire array.

To solve that I am converting the 2d array into 1d array of arrays.
Then I countmap that 1d array.
However, the problem is that the conversion is not very efficient (since I would like to perform this operation many times on larger sets I need it to be fast).

Is there a more efficient way to do it?
(BTW - I saw this post suggesting to use DataStructures.jl, but I did not understand how to use it for my specific purpose).

Here is a MWE (the @time commands were ran twice for the initial compilation):

julia> using StatsBase

julia> my_arr = rand(Bool, 4, 1000)
4×1000 Array{Bool,2}:
 1  0  1  1  0  0  0  1  1  1  0  0  0  1  0  0  1  0  1  1  0  0  …  1  0  1  1  0  0  1  0  1  0  0  1  1  0  1  0  1  1  0  1  1
 0  1  0  1  0  1  1  1  1  0  1  1  1  0  0  0  1  0  0  1  1  0     1  0  1  0  0  1  1  0  0  1  0  1  0  1  0  1  0  0  1  0  1
 1  1  1  1  1  0  1  1  0  0  0  1  1  1  0  0  1  1  1  1  0  1     0  0  1  0  1  1  1  1  1  1  0  0  0  0  1  1  0  1  0  1  0
 0  0  1  0  0  1  0  1  0  0  1  1  1  1  0  1  0  1  0  0  1  1     0  0  1  1  0  1  0  0  1  0  1  1  1  1  0  1  0  0  1  1  1

julia> @time countmap(my_arr) # This is not what I want!
  0.104725 seconds (240.05 k allocations: 11.587 MiB)
Dict{Bool,Int64} with 2 entries:
  false => 1998
  true  => 2002

julia> function slicematrix(A::AbstractMatrix) # a function to get a 1d array of arrays
           return [A[:, i] for i in 1:size(A,2)]
       end
slicematrix (generic function with 1 method)

julia> @time my_arr_1d = slicematrix(my_arr) ;
  0.066759 seconds (177.00 k allocations: 8.931 MiB)

julia> @time my_arr_1d = slicematrix(my_arr) ;
  0.000102 seconds (1.00 k allocations: 101.734 KiB)

julia> @time countmap(my_arr_1d);
  0.160801 seconds (394.11 k allocations: 19.167 MiB, 8.69% gc time)

julia> @time countmap(my_arr_1d) # this is what I want!
  0.000205 seconds (7 allocations: 1.969 KiB)
Dict{Array{Bool,1},Int64} with 16 entries:
  Bool[1, 1, 1, 1] => 63
  Bool[0, 0, 1, 0] => 60
  Bool[0, 1, 1, 1] => 68
  Bool[0, 1, 0, 0] => 69
  Bool[0, 1, 1, 0] => 55
  Bool[1, 1, 0, 1] => 56
  Bool[0, 0, 0, 1] => 63
  Bool[0, 1, 0, 1] => 61
  Bool[0, 0, 0, 0] => 65
  Bool[1, 0, 0, 0] => 48
  Bool[0, 0, 1, 1] => 71
  Bool[1, 0, 1, 0] => 51
  Bool[1, 0, 0, 1] => 78
  Bool[1, 1, 0, 0] => 70
  Bool[1, 0, 1, 1] => 65
  Bool[1, 1, 1, 0] => 57

Note that in your MWE, you don’t, you have an Array{Bool}.

I would suggest something like

countmap(collect(eachcol(my_arr)))

You need collect because of

otherwise it would be even faster.

1 Like

Right… I use sometimes
my_arr = convert(BitArray{}, my_arr)
But I am not sure if this is needed at all…

This is nice! saves me the slicing of the array.

It depends on the trade-offs between storage and speed — maybe you want to benchmark both options. Also, cf Random.bitrand which will generate it directly.

Nice. Thanks!
Actually, I am sampling the bits from a non-uniform distribution. So I use

    strings = Array{Int16,2}(undef, n, M);
    for i = 1:n
        strings[i, :] = sample([Bool(0), Bool(1)], Weights([1-ps[i], ps[i]]), M)
    end
    strings = convert(BitArray{}, strings)

where ps is the probability array, M is the sample size, n is the bit-string length.

I tried to find some direct way to sample these bits, but couldn’t find any…

I would recommend something like

strings = [rand(Float64) <= p[i] for i in 1:n, j in 1:M] # untested

It is unclear to me why you allocate an array of Int16s.

That’s a typo :confused:

This gives roughly factor 10 worse performance regarding time:

0.217236 seconds (4.09 M allocations: 67.357 MiB, 10.04% gc time)

as opposed to

which gives

0.042701 seconds (206 allocations: 4.062 MiB)

How are you benchmarking it? It seems highly unlikely that the loop should be faster than the comprehension in this case.

function foo(ps, M)
    n = length(ps)
    strings = Array{Int16,2}(undef, n, M);
    for i = 1:n
        strings[i, :] = sample([Bool(0), Bool(1)], Weights([1-ps[i], ps[i]]), M)
    end
    strings = convert(BitArray, strings)
    return strings
end

bar(ps, M) = BitArray([rand() <= p for p in ps, j in 1:M])

Benchmarking:

julia> using BenchmarkTools

julia> ps = 0.1:0.1:0.9
0.1:0.1:0.9

julia> @benchmark foo($ps, 1000)
BenchmarkTools.Trial: 
  memory estimate:  30.78 KiB
  allocs estimate:  58
  --------------
  minimum time:     177.022 μs (0.00% GC)
  median time:      184.739 μs (0.00% GC)
  mean time:        193.568 μs (0.79% GC)
  maximum time:     3.064 ms (91.51% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark bar($ps, 1000)
BenchmarkTools.Trial: 
  memory estimate:  10.31 KiB
  allocs estimate:  5
  --------------
  minimum time:     94.637 μs (0.00% GC)
  median time:      96.440 μs (0.00% GC)
  mean time:        102.460 μs (0.83% GC)
  maximum time:     4.643 ms (97.27% GC)
  --------------
  samples:          10000
  evals/sample:     1

I’m actually surprised that the difference isn’t bigger.

BTW, you can write [false, true] instead of [Bool(0), Bool(1)]

2 Likes

If you write

bar(ps, M) = BitArray(rand() <= p for p in ps, j in 1:M)

(without brackets []) you can save allocations, but it doesn’t seem to be faster.

I was checking it with @time but I guess it is better to check using BenchmarkTools.
Thanks for that comment!

Hi, I propose two versions to make a sample.

function sam1(sM,pS,nS)
  @inbounds @simd for j in axes(sM,2)
    for i in eachindex(pS)
      sM[i,j] = rand() <= pS[i];
    end
  end    
end

function sam2(sM,pS,nS)
  @avx for j in axes(sM,2)
    for i in eachindex(pS)
      sM[i,j] = rand() <= pS[i]
    end
  end    
end

Below are the computing times with my computer.

using BenchmarkTools, LoopVectorization
pS = collect(0.1:0.1:0.9)
nS = 1000
sM = trues(length(pS),nS)
@btime sam1($sM,$pS, $nS)
46.100 μs (0 allocations: 0 bytes)
@btime sam2($sM,$pS, $nS);
1.090 μs (0 allocations: 0 bytes)

Thanks!
It seems that sam2 does not really work - it does not provide a random sample according to ps… I am not sure what the problem is.

Would there be a way to return the most common occurring bit column wise? For example:
arr = [0 1 0; 0 1 1; 1 1 1; 0 1 1]

4×3 Matrix{Int64}:
0 1 0
0 1 1
1 1 1
0 1 1

Return 011

One way:

using StatsBase             
arr = [0 1 0; 0 1 1; 1 1 1; 0 1 1];
mode.(eachcol(arr))

 0
 1
 1
3 Likes