How to efficiently find columns of the matrix which are the same?

How big are your matrices? length(unique(eachcol(matrix))) is almost 500x faster than your version on my machine for a random 1000x1000 matrix (i.e. all unequal columns). (And that’s if I put your code in a function; in global scope your code is even worse.) This is the number of unique columns — I’m not entirely clear on what you are trying to compute, but the number of columns which are duplicates is size(matrix, 2) - number_unique_cols.

There’s nothing wrong with loops in performance-critical Julia code, it’s just that you are using a suboptimal algorithm. The basic reason is that unique constructs a Set of columns, which uses a hash table so it avoids comparing whole columns in the common case of unequal columns with distinct hashes. So, the time is roughly linear in the number of columns rather than quadratic as in your algorithm. (Also, it uses column views rather than copied slices, and it avoids allocating unnecessary intermediate arrays like the result of your .== operation.)

You can also use size(unique(matrix, dims=2), 2), but that is slower because it allocates a whole new matrix to store the result of unique, rather than an array of views as in unique(eachcol(matrix)).

unique does this for you. (If you implement this strategy yourself, you have to be wary of accidental hash collisions.)

See also Finding unique columns of a matrix

PS. Note that you can compare two columns of a matrix for equality with A[:,i] == A[:,j] rather than using all(A[:,i] .== A[:,j]) which allocates an array of booleans first (from .==) and then checks that they are all true. And put @views in front (or better yet, on a whole block of code like a whole function) to make slices return views rather than copies.

9 Likes