How to efficiently find columns of the matrix which are the same?

dgsob · November 6, 2023, 6:56pm

Hi all,

I was trying to optimize my naive code I wrote for an algorithm I’m working on, and I’ve found this part takes up the most runtime:

num_of_the_same_columns = 0
 num_of_columns = size(matrix, 2)
    for column_idx in 1:num_of_columns
        for next_column_idx in (column_idx + 1):num_of_columns
            if all(matrix[:, column_idx] .== matrix[:, next_column_idx])
                num_of_the_same_columns += 1
            end
        end
    end

What it is trying to do is to count columns in a matrix, which have the same values in corresponding cells, e.g. here:

[1, 1, 1, 1, 0, 0]
[1, 1, 1, 0, 1, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0]

columns 1, 2, 3 are “the same” so the num_of_the_same_columns should add up to 3.

Is there a more efficient way to search for something like this in Julia than these nasty nested loops?

jstrube · November 6, 2023, 7:11pm

Depending on the data, it may be faster to compute a hash value for each column, which then only needs a lookup of duplicate hashes.

stevengj · November 6, 2023, 7:27pm

How big are your matrices? length(unique(eachcol(matrix))) is almost 500x faster than your version on my machine for a random 1000x1000 matrix (i.e. all unequal columns). (And that’s if I put your code in a function; in global scope your code is even worse.) This is the number of unique columns — I’m not entirely clear on what you are trying to compute, but the number of columns which are duplicates is size(matrix, 2) - number_unique_cols.

There’s nothing wrong with loops in performance-critical Julia code, it’s just that you are using a suboptimal algorithm. The basic reason is that unique constructs a Set of columns, which uses a hash table so it avoids comparing whole columns in the common case of unequal columns with distinct hashes. So, the time is roughly linear in the number of columns rather than quadratic as in your algorithm. (Also, it uses column views rather than copied slices, and it avoids allocating unnecessary intermediate arrays like the result of your .== operation.)

You can also use size(unique(matrix, dims=2), 2), but that is slower because it allocates a whole new matrix to store the result of unique, rather than an array of views as in unique(eachcol(matrix)).

unique does this for you. (If you implement this strategy yourself, you have to be wary of accidental hash collisions.)

See also Finding unique columns of a matrix

PS. Note that you can compare two columns of a matrix for equality with A[:,i] == A[:,j] rather than using all(A[:,i] .== A[:,j]) which allocates an array of booleans first (from .==) and then checks that they are all true. And put @views in front (or better yet, on a whole block of code like a whole function) to make slices return views rather than copies.

rocco_sprmnt21 · November 7, 2023, 12:22pm

length(unique(hash.(eachcol(matrix))))

Is there something missing from this expression to be equivalent (in terms of the result obtained) to this other one?

length(unique(eachcol(matrix)))

stevengj · November 7, 2023, 12:24pm

Yes, you are assuming that hashes for distinct columns never collide.

When you store columns (or rather, column views) in a Set, it uses hashes only as an optimization — if two hashes are different, it knows the columns are different and does no further comparison. If two hashes are the same, however, it checks for collisions by comparing the whole columns.

stevengj · November 7, 2023, 12:30pm

Your code indeed returns 3 for this matrix. However, if I make a matrix with 10 identical columns, e.g. matrix = zeros(10,10), then your code returns 45. Is this the intended result?

To get the same results as your code, but with a Set-like algorithm similar to unique, one would need a multiset data structure that keeps track of the multiplicity of each column, e.g. implemented by a Dict mapping columns to multiplicities.

using DataStructures
function do_count(A)
    cols = eachcol(A)
    counts = Dict{eltype(cols), Int}()
    num = 0
    for col in cols
        count = get(counts, col, 0) # current multiplicity
        num += count # number of previous columns equal to col
        counts[col] = count + 1
    end
    return num
end

It’s about 250x faster than your algorithm for a 1000x1000 matrix. (This is still suboptimal because it hashes each column twice; this is a longstanding issue#24454 in Julia where no one has yet agreed on a better API for updating dictionaries .)

Dan · November 7, 2023, 1:51pm

The question is a little vague on the desired output, but I think the following will succintly include the OP’s desired information
(just noticed this is pretty similar to stevengj’s answer):

julia> using StatsBase

julia> M = permutedims(reshape([[1, 1, 1, 1, 0, 0]
       [1, 1, 1, 0, 1, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]
       [0, 0, 0, 0, 0, 0]], (6,6)))
6×6 Matrix{Int64}:
 1  1  1  1  0  0
 1  1  1  0  1  0
 0  0  0  0  0  0
 0  0  0  0  0  0
 0  0  0  0  0  0
 0  0  0  0  0  0

julia> countmap(eachcol(M))
Dict{SubArray{Int64, 1, Matrix{Int64}, Tuple{...} with 4 entries:
  [1, 1, 0, 0, 0, 0] => 3
  [1, 0, 0, 0, 0, 0] => 1
  [0, 1, 0, 0, 0, 0] => 1
  [0, 0, 0, 0, 0, 0] => 1

julia> sort(values(countmap(eachcol(M))); rev = true)
4-element Vector{Int64}:
 3
 1
 1
 1

The last expression’s first element is the count of the most popular column in the matrix. If several columns appear multiple times, the vector will reflect this.

rocco_sprmnt21 · November 7, 2023, 3:22pm

julia> m=rand(1:3,10,1000)
10×1000 Matrix{Int64}:
 2  2  2  1  3  1  3  3  3  …  3  1  2  1  3  3  2  1   
 2  1  1  2  3  3  2  3  3     1  1  2  1  3  3  1  2   
....

julia> @btime length(unique(eachcol(m)))
  69.700 μs (29 allocations: 327.73 KiB)
997

julia> @btime length(unique(hash.(eachcol(m))))
  37.500 μs (28 allocations: 79.05 KiB)
997

then the following expression It’s the correct one and it seems very fast too

julia> @btime length(Set(eachcol(m)))
  31.900 μs (9 allocations: 83.16 KiB)
997

dgsob · November 7, 2023, 5:54pm

Thank you for your comprehensive answer.

By no means is it the intended result. My initial code worked for few examples I considered, but now I see what I’m trying to do is a bit more complicated than that. However, you answer gives a good insight I have to take time to analyze.
Btw, the matrices I would work with could probably get as large as 100x100, no more.

rocco_sprmnt21 · November 7, 2023, 9:09pm

how does countmap run twice as fast as colmap?

function colmap(m)
    dd=Dict{Array{Int64}, Int64}()
    for  c in eachcol(m)
        dd[c]=get(dd,c,0)+1
    end
    dd
end

stevengj · November 7, 2023, 9:39pm

countmap uses Dict internals to avoid hashing each column twice (working around julia#24454), and in this problem hashing the columns is the dominant cost for a sufficiently large matrix.

(Indeed, since hashing is the dominant cost, you could probably speed things up further by wrapping the column views in another type that implements a custom hash function. Standard numeric hashing in Julia pays an extra cost so that different types of the same value, e.g. 1.0 vs 1 vs 1.0f0, yield the same hash. For this application you don’t need that, assuming that all of the matrix elements have the same type, i.e. the eltype is concrete. But this kind of voodoo might not be worth the trouble here.)

Topic		Replies	Views
Count occurances for matrix rows (where column order does not matter) General Usage question , count	30	858	December 13, 2022
Finding unique columns of a matrix New to Julia	3	744	March 18, 2021
Choosing only different vectors from a matrix New to Julia	8	534	September 1, 2019
Fastest way possible to find index of value equals 1 across a matrix column Performance question , performance , speed-optimization	7	1015	October 6, 2023
How do I get the indexes for matrix columns whose elements are all the same? General Usage	12	603	April 13, 2023

How to efficiently find columns of the matrix which are the same?

Related topics