Fast hamming distance function in Julia that returns a distance matrix

giac · February 21, 2021, 1:34pm

I am trying to compute the hamming distance on a very large dataset. I need to get back a distance matrix between rows in order to run further analysis on this matrix.

For my purposes, it is useful that the data are stored in a DataFrame type.

using DataFrames

The data looks something like this

a = [1 0 1 0 ; 1 1 1 1; 0 0 0 0; 0 0 0 0 ; 0 0 0 1]

df = convert(DataFrame, a);

nrows = size(df, 1)
ncols = size(df, 2)

I made a function in Julia that returns a distance matrix

function hamjulia(df)

    nrows = size(df, 1)
    ncols = size(df, 2)
    m, n = nrows, nrows
    A = fill(0, (m, n))

    for i in 1:nrows
    for k in 1:nrows
        
        v = 0
        for j in 1:ncols
            if df[i,j] != df[k,j]
            v += 1
            end
        end
        A[i,k] = v
    end
    end
    return A
end

p = hamjulia(df)
p

My issue is that this code is slow compared to some R packages. For instance, when I compared this function to the rdist package, rdist(df, metric = 'hamming'), R is faster.

How could I make this code really efficient? Especially that I would need to run it on a very large dataframe. I tried the package Distances but the documentation is too scarce.

Also, does anyone know if there is an efficient code for the Needleman–Wunsch algorithm?

Thanks.

stevengj · February 21, 2021, 1:56pm

I would first convert your dataset into an array of BitVectors. Then you can compute the Hamming distance between two BitVectors very cheaply using bitwise operations — for example, see the code posted here. Since conversion is O(n) and computing the distance matrix is O(n^2) where n is the size of your dataset, it is worth it to pay the conversion cost in order to speed up the distance computation.

(Note also that you can trivially double the speed of your code since A is symmetric — you only need to compute the Hamming distance once for each pair.)

(How large is your dataset? It can’t have that many rows or you wouldn’t be able to store the distance matrix. If you have ≤ 64 columns you can convert your data to an array of UInt64 bitstrings, which will be even faster than BitVectors.)

giac · February 21, 2021, 2:02pm

Thanks for your reply. I will try this conversion.
What do you mean by unable to store the distance matrix? It has about a million rows.

stevengj · February 21, 2021, 2:03pm

If you have a million rows, then your distance matrix is 10^6 \times 10^6, which will require 8 terabytes of memory. Probably you need to re-think your algorithms.

giac · February 21, 2021, 2:09pm

Thanks. What do you think would be a reasonable size then with this code? 10,000? 100,000?

stevengj · February 21, 2021, 3:56pm

The memory usage for an n\times n matrix of 64-bit integers is 8n^2 bytes. If you can afford to use 16GB of memory, that comes out to about n = 44,000 rows.

As I said, you really need to re-think your algorithms if you are relying on computing this matrix explicitly for huge datasets. Think about how you are going to use this matrix (you haven’t said), e.g whether you can compute the entries as needed, and in general how the computational and memory costs grow with n. If you are working with large data, you need to think like a computer scientist.

Also, does anyone know if there is an efficient code for the Needleman–Wunsch algorithm?

Maybe look at GitHub - BioJulia/BioAlignments.jl: Sequence alignment tools

juliohm · February 21, 2021, 4:19pm

As an additional comment, please consider searching packages with this functionality already before re-inventing the wheel. For example, the widely used Distances.jl package has a Hamming distance that produces the distance as you want: GitHub - JuliaStats/Distances.jl: A Julia package for evaluating distances (metrics) between vectors.

giac · February 21, 2021, 4:37pm

I did try to use this package, but I couldn’t find in the documentation on how to retrieve a distance matrix. The documentation of this package is sparse. But thanks.

juliohm · February 21, 2021, 4:49pm

pairwise(Hamming(…), dataframe, dims=1) should do it. Maybe you need to convert the DataFrame to Matrix(dataframe). The dims=1 specifies the calculation over rows

stevengj · February 21, 2021, 5:00pm

(It doesn’t seem to have an optimized Hamming method for BitArrays; this might make a useful PR.)

cgeoga · February 21, 2021, 5:33pm

If you can’t store the whole matrix, there’s also a few convenient options for an extra abstraction that might make working with the matrix feel the same. The package LazyArrays.jl implements something like this, I’m pretty sure, and KernelMatrices.jl (disclaimer: this is my project) provides exactly this abstraction because I am constantly working with matrices that are way too large to store in RAM but whose elements I need to access in a consistent way with regular Matrix objects:

using KernelMatrices

# This is the optimmized hamming distance that stevengj referenced.
function hamming(A::BitArray, B::BitArray)
    #size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
    Ac,Bc = A.chunks, B.chunks
    W = 0
    for i = 1:(length(Ac)-1)
        W += count_ones(Ac[i] ⊻ Bc[i])
    end
    W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
    return W
end

example_points = [BitArray(rand(Bool, 8)) for _ in 1:1000]
K = KernelMatrix(example_points, example_points, hamming)

# K[1:10, 1:10], K[3,6], K[:,10], etc works as expected. 
# But K is really just a struct that stores the function and the points, 
# not a big  memory buffer for the matrix elements.

As a disclaimer, KM is GPL licensed, not officially registered, and probably not as optimized as it could be. So maybe LazyArrays or some other similar package would be a better choice. I just am not familiar with those, so maybe take the KM example here primarily as a demonstration of this particular abstraction.

EDIT: if you think you might use this, let me know and I’ll do a pass over the source and plop in whatever low-hanging optimizations aren’t in there currently that would apply to what you’re doing. Of course I welcome contributions from anybody and would love that even more, but to some degree I mostly make improvements to the base functionality when I have an excuse to, so that’s always invited.

lungben · February 21, 2021, 6:13pm

I do not know if this is the most efficient way for matrices, but for pair-wise Hamming distance this way was the fastest I could find:

giac · February 21, 2021, 6:20pm

Thank you for this. My next step is to run clustering analysis on the distance matrix. Is this approach helpful for this? Sorry I am not a computer scientist, so I don’t think I am able to contribute to your project, but it seems really interesting.