Need memory and/or parallel optimization for pairwise distance metrics

I have a ~1000x800,000 sparse matrix, and I would like to calculate pairwise distance metrics for the columns. The vast majority of these will have distance 1, so I thought that I might store them as similarity (1 - distance) in a sparse matrix to make the memory footprint feasible.

Unfortunately, the pairwise() function from Distances.jl gives me an out-of-memory error (it’s generating a dense matrix with lots of 1s), and the serial version (where I calculate them one-at-a-time and add them to a sparse matrix) is painfully slow.

using SparseArrays
using Distances
using Combinatorics

function sparse_distance(mat, df=jaccard)
  amat = spzeros(size(mat, 2), size(mat, 2))

  for (i,j) in Combinatorics.combinations(1:size(mat,2), 2)
    j % 100_000 == 0 && @info i,j
    d = df(mat[:,i], mat[:,j])
    d == 1. && continue
    amat[i,j] = 1. - d
  end
  amat
end

Normally is situations like this, I would do one of the following, but I don’t think they’ll work

  1. Use threads to make things a bit more parallel - this won’t work because updating the sparse matrix isn’t threadsafe
  2. Use a generator to take some advantage of SIMD - can’t figure out a way to do this that doesn’t either store every entry or calculate the distance twice (eg [(jaccard(mat[:,i], mat[:,j]), i, j) for (i,j) in Combinatorics(1:size(mat,1), 2) if jaccard(mat[:,i], mat[:,j]) < 1]

Maybe I need to just accept the slowness and / or partition the calculations and store them in files or something, but if anyone has any fancy tricks, I would love to hear them :slight_smile:

The first thing to do here is @views d = df(mat[:,i], mat[:,j]). Otherwise you’re copying a Vector out every single time.

3 Likes

This seems to work:

using SparseArrays
using Distances
using Combinatorics
using ChunkSplitters
using Base.Threads

@views function sparse_distance0(mat, df=jaccard)
    amat_total = spzeros(size(mat, 2), size(mat, 2))
    lk = ReentrantLock()
    @sync for ic in chunks(1:size(mat,2); n=Threads.nthreads()) 
        @spawn begin
            amat_local = spzeros(size(mat, 2), size(mat, 2))
            for i in ic
                for j in axes(mat, 2)
                    i >= j && continue
                    j % 100_000 == 0 && @info i, j
                    d = df(mat[:, i], mat[:, j])
                    d == 1. && continue
                    amat_local[i, j] = 1. - d
                end
            end
            @lock lk amat_total += amat_local
        end
    end
    amat_total
end

I get:

julia> s0 = sparse_distance(mat); @b sparse_distance(mat) # original with @views
7.961 s (999084 allocs: 71.928 MiB, 0.03% gc time, without a warmup)

julia> println(sparse_distance0(mat) ≈ s0); @b sparse_distance0(mat)
true
1.645 s (328 allocs: 42.364 MiB, 0.08% gc time, without a warmup)

julia> Threads.nthreads()
4

ps: I don’t think you get much more from SIMD there, because the computation of the distance is probably using that.

ps2: I tested that with a rand(1000,1000) matrix.

1 Like

Neat! Inferring here, are the following correct?

  1. the chunks function splits the column indices into nthreads groups, eg 1:100, 101:200 (if I had 1000 columns and 10 threads)
  2. then the loop calculates a block that’s N/threads x N, where N is the total number of columns. Each of these can be done in parallel because of @spawn
  3. The @lock lk ... allows updating the sparse matrix as the loops complete, by allowing only one of these to happen at a time?

I just tried this with 10k columns, and it took a few minutes, which is definitely a huge improvement, but I’m a bit wary of scaling that up to the full size all in-memory. I think I will modify this to do smaller chunks and then save them in CSVs as i,j,v.

1 Like

Do you need all distances or only nearest neighbors? Because e.g. NearestNeighbors.jl could be used to reduce the computational complexity a lot in that case.

1 Like

Wow, I didn’t expect going to files to also provide a huge speedup

@views function sparse_distance_files(mat, dist_func=jaccard; outdir="dmats")
    lk = ReentrantLock()
    @sync for ic in chunks(1:size(mat,2); n=1000) 
        filename = joinpath(outdir, "chunk_$ic.csv")
        #isfile(filename) && continue
        @spawn begin
            amat_local = spzeros(size(mat, 2), size(mat, 2))
            for i in ic
                for j in axes(mat, 2)
                    i >= j && continue
                    d = dist_func(mat[:, i], mat[:, j])
                    d == 1. && continue
                    amat_local[i, j] = 1. - d
                end
            end
            df = [(; i, j, v=amat_local[i,j]) for i in ic for j in axes(mat,2) if amat_local[i,j] > 0]
            @lock lk CSV.write(filename, df)
        end
    end
    return nothing
end
julia> @btime sparse_distance0(cmat[:,1:5000]); # @Imiq's version
  2.675 s (1329 allocations: 352.15 MiB)

julia> @btime sparse_distance_files(cmat[:,1:5000]);
  872.644 ms (95241 allocations: 4.22 GiB)

Hmm, that’s a good thought. I’m not sure TBH, but good to know about this!

You also may get some additional speedup by using a different chunking strategy: Multithreading · ChunkSplitters.jl

I don’t think you’ll have memory problems with that. But you can check how it scales with the number of columns before. The loop itself does not scale.

Assigning one element at a time is an incredibly expensive way to construct a sparse matrix in CSC format (what Julia uses).

It should be much more efficient to accumulate a vector of (i, j, 1-d) triplets (or three separate vectors I, J, V) — probably one for each parallel chunk and then concatenated afterwards — and then pass this to the sparse constructor at the end to get a sparse CSC matrix.

2 Likes

Just a few thoughts:

  • exact equality between floats is usually a bad idea. I would not be surprised if the condition d == 1. never is true. It’s probably better to allow some small tolerace or use isapprox (which uses sqrt(eps(1)) as default)
  • the 1. - d leads to very bad accuracy if d is close to 1. So of you care about very small similaritirs, then you should rethink your current approach and probably come up with a more accurate way of computing 1 - dist_func(x, y) to avoid the cancellation.
  • in Julia you generally can and should just write 1 and the conversion to float is handled by the compiler. It is usually worse to write 1. instead because that enforces Float64 and thus reduces the generality because even if someone calls the function with Float32 or Rational then the usage of a Float64 constant would convert the whole calculation to Float64. Floats are said to be contagious. So using integer constants is preferred.
  • your file based implementation does not require the lock bevause writes to different files don’t interfere