I have a ~1000x800,000 sparse matrix, and I would like to calculate pairwise distance metrics for the columns. The vast majority of these will have distance 1, so I thought that I might store them as similarity (1 - distance) in a sparse matrix to make the memory footprint feasible.
Unfortunately, the pairwise() function from Distances.jl gives me an out-of-memory error (it’s generating a dense matrix with lots of 1s), and the serial version (where I calculate them one-at-a-time and add them to a sparse matrix) is painfully slow.
using SparseArrays
using Distances
using Combinatorics
function sparse_distance(mat, df=jaccard)
amat = spzeros(size(mat, 2), size(mat, 2))
for (i,j) in Combinatorics.combinations(1:size(mat,2), 2)
j % 100_000 == 0 && @info i,j
d = df(mat[:,i], mat[:,j])
d == 1. && continue
amat[i,j] = 1. - d
end
amat
end
Normally is situations like this, I would do one of the following, but I don’t think they’ll work
Use threads to make things a bit more parallel - this won’t work because updating the sparse matrix isn’t threadsafe
Use a generator to take some advantage of SIMD - can’t figure out a way to do this that doesn’t either store every entry or calculate the distance twice (eg [(jaccard(mat[:,i], mat[:,j]), i, j) for (i,j) in Combinatorics(1:size(mat,1), 2) if jaccard(mat[:,i], mat[:,j]) < 1]
Maybe I need to just accept the slowness and / or partition the calculations and store them in files or something, but if anyone has any fancy tricks, I would love to hear them
using SparseArrays
using Distances
using Combinatorics
using ChunkSplitters
using Base.Threads
@views function sparse_distance0(mat, df=jaccard)
amat_total = spzeros(size(mat, 2), size(mat, 2))
lk = ReentrantLock()
@sync for ic in chunks(1:size(mat,2); n=Threads.nthreads())
@spawn begin
amat_local = spzeros(size(mat, 2), size(mat, 2))
for i in ic
for j in axes(mat, 2)
i >= j && continue
j % 100_000 == 0 && @info i, j
d = df(mat[:, i], mat[:, j])
d == 1. && continue
amat_local[i, j] = 1. - d
end
end
@lock lk amat_total += amat_local
end
end
amat_total
end
I get:
julia> s0 = sparse_distance(mat); @b sparse_distance(mat) # original with @views
7.961 s (999084 allocs: 71.928 MiB, 0.03% gc time, without a warmup)
julia> println(sparse_distance0(mat) ≈ s0); @b sparse_distance0(mat)
true
1.645 s (328 allocs: 42.364 MiB, 0.08% gc time, without a warmup)
julia> Threads.nthreads()
4
ps: I don’t think you get much more from SIMD there, because the computation of the distance is probably using that.
the chunks function splits the column indices into nthreads groups, eg 1:100, 101:200 (if I had 1000 columns and 10 threads)
then the loop calculates a block that’s N/threads x N, where N is the total number of columns. Each of these can be done in parallel because of @spawn
The @lock lk ... allows updating the sparse matrix as the loops complete, by allowing only one of these to happen at a time?
I just tried this with 10k columns, and it took a few minutes, which is definitely a huge improvement, but I’m a bit wary of scaling that up to the full size all in-memory. I think I will modify this to do smaller chunks and then save them in CSVs as i,j,v.
Do you need all distances or only nearest neighbors? Because e.g. NearestNeighbors.jl could be used to reduce the computational complexity a lot in that case.
Wow, I didn’t expect going to files to also provide a huge speedup
@views function sparse_distance_files(mat, dist_func=jaccard; outdir="dmats")
lk = ReentrantLock()
@sync for ic in chunks(1:size(mat,2); n=1000)
filename = joinpath(outdir, "chunk_$ic.csv")
#isfile(filename) && continue
@spawn begin
amat_local = spzeros(size(mat, 2), size(mat, 2))
for i in ic
for j in axes(mat, 2)
i >= j && continue
d = dist_func(mat[:, i], mat[:, j])
d == 1. && continue
amat_local[i, j] = 1. - d
end
end
df = [(; i, j, v=amat_local[i,j]) for i in ic for j in axes(mat,2) if amat_local[i,j] > 0]
@lock lk CSV.write(filename, df)
end
end
return nothing
end
julia> @btime sparse_distance0(cmat[:,1:5000]); # @Imiq's version
2.675 s (1329 allocations: 352.15 MiB)
julia> @btime sparse_distance_files(cmat[:,1:5000]);
872.644 ms (95241 allocations: 4.22 GiB)
I don’t think you’ll have memory problems with that. But you can check how it scales with the number of columns before. The loop itself does not scale.
Assigning one element at a time is an incredibly expensive way to construct a sparse matrix in CSC format (what Julia uses).
It should be much more efficient to accumulate a vector of (i, j, 1-d) triplets (or three separate vectors I, J, V) — probably one for each parallel chunk and then concatenated afterwards — and then pass this to the sparse constructor at the end to get a sparse CSC matrix.
exact equality between floats is usually a bad idea. I would not be surprised if the condition d == 1.never is true. It’s probably better to allow some small tolerace or use isapprox (which uses sqrt(eps(1)) as default)
the 1. - d leads to very bad accuracy if d is close to 1. So of you care about very small similaritirs, then you should rethink your current approach and probably come up with a more accurate way of computing 1 - dist_func(x, y) to avoid the cancellation.
in Julia you generally can and should just write 1 and the conversion to float is handled by the compiler. It is usually worse to write 1. instead because that enforcesFloat64 and thus reduces the generality because even if someone calls the function with Float32 or Rational then the usage of a Float64 constant would convert the whole calculation to Float64. Floats are said to be contagious. So using integer constants is preferred.
your file based implementation does not require the lock bevause writes to different files don’t interfere