I have a ~1000x800,000 sparse matrix, and I would like to calculate pairwise distance metrics for the columns. The vast majority of these will have distance 1, so I thought that I might store them as similarity (1 - distance) in a sparse matrix to make the memory footprint feasible.
Unfortunately, the pairwise() function from Distances.jl gives me an out-of-memory error (it’s generating a dense matrix with lots of 1s), and the serial version (where I calculate them one-at-a-time and add them to a sparse matrix) is painfully slow.
using SparseArrays
using Distances
using Combinatorics
function sparse_distance(mat, df=jaccard)
amat = spzeros(size(mat, 2), size(mat, 2))
for (i,j) in Combinatorics.combinations(1:size(mat,2), 2)
j % 100_000 == 0 && @info i,j
d = df(mat[:,i], mat[:,j])
d == 1. && continue
amat[i,j] = 1. - d
end
amat
end
Normally is situations like this, I would do one of the following, but I don’t think they’ll work
- Use threads to make things a bit more parallel - this won’t work because updating the sparse matrix isn’t threadsafe
- Use a generator to take some advantage of SIMD - can’t figure out a way to do this that doesn’t either store every entry or calculate the distance twice (eg
[(jaccard(mat[:,i], mat[:,j]), i, j) for (i,j) in Combinatorics(1:size(mat,1), 2) if jaccard(mat[:,i], mat[:,j]) < 1]
Maybe I need to just accept the slowness and / or partition the calculations and store them in files or something, but if anyone has any fancy tricks, I would love to hear them ![]()