Need memory and/or parallel optimization for pairwise distance metrics

In this case I’d recommend using a vector of BitSet as input, where the BitSet contain the indices that are set. These use only 1 bit per entry and are very fast to compare for dissimilarity (just bitwise xor followed by a count_ones which is a single assembly instruction).

The dissimilarity would be just:

dissimilarity(set1, set2) = length(union(set1, set2)) - length(intersection(set1, set2))

Which probably doesn’t quite compile down to the aforementioned xor but should be rather fast nonetheless. Total input size would be 1kx800k bits = 100MB

2 Likes

Try counting the nonzero entries first.

That is, show us the output of

@views function sparse_distance_count(mat, df=jaccard)
    lk = ReentrantLock()
    counter_all = Ref(0)
    @sync for ic in chunks(1:size(mat,2); n=Threads.nthreads()) 
        @spawn begin
            counter = 0
            for i in ic
                for j in axes(mat, 2)
                    i >= j && continue
                    j % 100_000 == 0 && @info i, j
                    d = df(mat[:, i], mat[:, j])
                    d == 1 && continue
                    counter += 1
                end
            end
            lock(lk) do 
                counter_all[] += counter
            end
        end
    end
    return nnz(mat), counter_all[]
end

together with info whether the timing is acceptable to you, and how much RAM you have on your laptop.

You can definitely stream the output to disk. To do that, you simply need to write a file once you have enough data for it. For example, you could do

@views function sparse_distance(mat, dist_func=jaccard)
    @sync for ic in chunks(1:size(mat,2); n=round(Int, size(mat,2)/1000)) 
        suffix = 1
        filename = "dmats/chunk_$ic_$suffix.csv"

        isfile(filename) && continue
        @spawn begin
            is = Int[]; js = Int[]; vs = Float64[]
            for i in ic
                for j in axes(mat, 2)
                    i >= j && continue
                    j % 100_000 == 0 && @info i,j
                    d = dist_func(mat[:, i], mat[:, j])
                    d == 1. && continue
                    push!(is, i)
                    push!(js, j)
                    push!(vs, 1. - d)
                    if length(is) > 10_000_000
                        CSV.write(filename, (; i=is, j=js, v=vs))
                        empty!(is); empty!(js); empty!(vs);
                        suffix += 1
                        filename = "dmats/chunk_$ic_$suffix.csv"
                    end
                end
            end
            isempty(is) || CSV.write(filename, (; i=is, j=js, v=vs))
        end
    end
    return nothing
end

Even better would be to use a streaming interface of CSV.jl, but I don’t think that exists.

Not quite; per Distances.jl Readme.md, it’s supposed to compute 1 - sum(min(x, y)) / sum(max(x, y)); and it actually does compute that if all entries are nonnegative (they are nonnegative, right?).

Jaccard should be exactly one if and only if the two vectors have disjoint support. And by my superficial reading of Distances.jl that should also hold for floating point, and the compiler should not have leeway to break that property by re-associating; so exact comparison to 1 should be correct, unless somebody sneaked in a dirty @simd or @fastmath or something like that. (floating point does guarantee that x-x==0.0 and x+0 == x exactly, and that’s all we should need)

Alright, thanks for the help everyone! Here’s the final version - this worked to keep the memory footprint from exploding. I did a little file optimization too, since the outputs were pretty large - I removed the x vector (also saves some allocations, I guess), since each file is now a single index (and can be recovered from the filename), and I rounded the floats to 4 digits to save some bytes. This is 242Gb of files :grimacing:

@views function sparse_distance(mat, dist_func=jaccard)
    lk = ReentrantLock()
    for i in axes(mat, 2)
        filename = "dmats/chunk_$i.csv"
        isfile(filename) && continue
        i % 100 == 0 && @info "Column $i"
        # create file for this i
        yi,zi = Int32[], Float32[]
        @sync for jc in chunks(axes(mat, 2); n=Threads.nthreads())
            @spawn begin
                y, z = Int32[], Float32[]
                for j in jc
                    i >= j && continue
                    d = dist_func(mat[:, i], mat[:, j])
                    push!(y, j)
                    push!(z, round(1 - d; digits=4))
                end
                lock(lk) do
                    append!(yi, y)
                    append!(zi, z)
                end
            end
        end
        CSV.write(filename, (; j=yi, v=zi))
    end
    return nothing
end
1 Like

If you don’t need to have them ‘human-readable’, it is probably easier to use some binary storage format. JLD2.jl makes that straight-forward. Just replace

CSV.write(filename, (; j=yi, v=zi))

with

JLD2.save(filename; j=yi, v=zi)

Then you don’t need to round :slight_smile:

To load the data back, you can do

j, v = JLD2.load(filename, "j", "v")
3 Likes

Heh, good thinking! I have a student working on the outputs they’re using python to do the downstream stuff, so this will have to do for now. We’re just working on a prototype model, but this is worth considering for when/if it works and we need to do more “production” models.

Actually JLD2 is HDF5 compatible, so you could use the files from Python as well :slight_smile: should take just a few mins to figure out how exactly the info needs to be extracted using h5py.

1 Like