How to efficiently find columns of the matrix which are the same?

Hi all,

I was trying to optimize my naive code I wrote for an algorithm I’m working on, and I’ve found this part takes up the most runtime:

num_of_the_same_columns = 0
 num_of_columns = size(matrix, 2)
    for column_idx in 1:num_of_columns
        for next_column_idx in (column_idx + 1):num_of_columns
            if all(matrix[:, column_idx] .== matrix[:, next_column_idx])
                num_of_the_same_columns += 1
            end
        end
    end

What it is trying to do is to count columns in a matrix, which have the same values in corresponding cells, e.g. here:

[1, 1, 1, 1, 0, 0]
[1, 1, 1, 0, 1, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]

columns 1, 2, 3 are “the same” so the num_of_the_same_columns should add up to 3.

Is there a more efficient way to search for something like this in Julia than these nasty nested loops?

Depending on the data, it may be faster to compute a hash value for each column, which then only needs a lookup of duplicate hashes.

How big are your matrices? length(unique(eachcol(matrix))) is almost 500x faster than your version on my machine for a random 1000x1000 matrix (i.e. all unequal columns). (And that’s if I put your code in a function; in global scope your code is even worse.) This is the number of unique columns — I’m not entirely clear on what you are trying to compute, but the number of columns which are duplicates is size(matrix, 2) - number_unique_cols.

There’s nothing wrong with loops in performance-critical Julia code, it’s just that you are using a suboptimal algorithm. The basic reason is that unique constructs a Set of columns, which uses a hash table so it avoids comparing whole columns in the common case of unequal columns with distinct hashes. So, the time is roughly linear in the number of columns rather than quadratic as in your algorithm. (Also, it uses column views rather than copied slices, and it avoids allocating unnecessary intermediate arrays like the result of your .== operation.)

You can also use size(unique(matrix, dims=2), 2), but that is slower because it allocates a whole new matrix to store the result of unique, rather than an array of views as in unique(eachcol(matrix)).

unique does this for you. (If you implement this strategy yourself, you have to be wary of accidental hash collisions.)

See also Finding unique columns of a matrix

PS. Note that you can compare two columns of a matrix for equality with A[:,i] == A[:,j] rather than using all(A[:,i] .== A[:,j]) which allocates an array of booleans first (from .==) and then checks that they are all true. And put @views in front (or better yet, on a whole block of code like a whole function) to make slices return views rather than copies.

9 Likes
length(unique(hash.(eachcol(matrix))))

Is there something missing from this expression to be equivalent (in terms of the result obtained) to this other one?

length(unique(eachcol(matrix)))

Yes, you are assuming that hashes for distinct columns never collide.

When you store columns (or rather, column views) in a Set, it uses hashes only as an optimization — if two hashes are different, it knows the columns are different and does no further comparison. If two hashes are the same, however, it checks for collisions by comparing the whole columns.

2 Likes

Your code indeed returns 3 for this matrix. However, if I make a matrix with 10 identical columns, e.g. matrix = zeros(10,10), then your code returns 45. Is this the intended result?

To get the same results as your code, but with a Set-like algorithm similar to unique, one would need a multiset data structure that keeps track of the multiplicity of each column, e.g. implemented by a Dict mapping columns to multiplicities.

using DataStructures
function do_count(A)
    cols = eachcol(A)
    counts = Dict{eltype(cols), Int}()
    num = 0
    for col in cols
        count = get(counts, col, 0) # current multiplicity
        num += count # number of previous columns equal to col
        counts[col] = count + 1
    end
    return num
end

It’s about 250x faster than your algorithm for a 1000x1000 matrix. (This is still suboptimal because it hashes each column twice; this is a longstanding issue#24454 in Julia where no one has yet agreed on a better API for updating dictionaries :cry:.)

1 Like

The question is a little vague on the desired output, but I think the following will succintly include the OP’s desired information
(just noticed this is pretty similar to stevengj’s answer):

julia> using StatsBase

julia> M = permutedims(reshape([[1, 1, 1, 1, 0, 0]
       [1, 1, 1, 0, 1, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]], (6,6)))
6×6 Matrix{Int64}:
 1  1  1  1  0  0
 1  1  1  0  1  0
 0  0  0  0  0  0
 0  0  0  0  0  0
 0  0  0  0  0  0
 0  0  0  0  0  0
julia> countmap(eachcol(M))
Dict{SubArray{Int64, 1, Matrix{Int64}, Tuple{...} with 4 entries:
  [1, 1, 0, 0, 0, 0] => 3
  [1, 0, 0, 0, 0, 0] => 1
  [0, 1, 0, 0, 0, 0] => 1
  [0, 0, 0, 0, 0, 0] => 1
julia> sort(values(countmap(eachcol(M))); rev = true)
4-element Vector{Int64}:
 3
 1
 1
 1

The last expression’s first element is the count of the most popular column in the matrix. If several columns appear multiple times, the vector will reflect this.

3 Likes
julia> m=rand(1:3,10,1000)
10×1000 Matrix{Int64}:
 2  2  2  1  3  1  3  3  3  …  3  1  2  1  3  3  2  1   
 2  1  1  2  3  3  2  3  3     1  1  2  1  3  3  1  2   
....

julia> @btime length(unique(eachcol(m)))
  69.700 μs (29 allocations: 327.73 KiB)
997

julia> @btime length(unique(hash.(eachcol(m))))
  37.500 μs (28 allocations: 79.05 KiB)
997

then the following expression It’s the correct one and it seems very fast too

julia> @btime length(Set(eachcol(m)))
  31.900 μs (9 allocations: 83.16 KiB)
997
1 Like

Thank you for your comprehensive answer.

By no means is it the intended result. My initial code worked for few examples I considered, but now I see what I’m trying to do is a bit more complicated than that. However, you answer gives a good insight I have to take time to analyze.
Btw, the matrices I would work with could probably get as large as 100x100, no more.

1 Like

how does countmap run twice as fast as colmap?

function colmap(m)
    dd=Dict{Array{Int64}, Int64}()
    for  c in eachcol(m)
        dd[c]=get(dd,c,0)+1
    end
    dd
end

countmap uses Dict internals to avoid hashing each column twice (working around julia#24454), and in this problem hashing the columns is the dominant cost for a sufficiently large matrix.

(Indeed, since hashing is the dominant cost, you could probably speed things up further by wrapping the column views in another type that implements a custom hash function. Standard numeric hashing in Julia pays an extra cost so that different types of the same value, e.g. 1.0 vs 1 vs 1.0f0, yield the same hash. For this application you don’t need that, assuming that all of the matrix elements have the same type, i.e. the eltype is concrete. But this kind of voodoo might not be worth the trouble here.)