I am trying to compute the hamming distance on a very large dataset. I need to get back a distance matrix between rows in order to run further analysis on this matrix.
For my purposes, it is useful that the data are stored in a DataFrame type.
I made a function in Julia that returns a distance matrix
function hamjulia(df)
nrows = size(df, 1)
ncols = size(df, 2)
m, n = nrows, nrows
A = fill(0, (m, n))
for i in 1:nrows
for k in 1:nrows
v = 0
for j in 1:ncols
if df[i,j] != df[k,j]
v += 1
end
end
A[i,k] = v
end
end
return A
end
p = hamjulia(df)
p
My issue is that this code is slow compared to some R packages. For instance, when I compared this function to the rdist package, rdist(df, metric = 'hamming'), R is faster.
How could I make this code really efficient? Especially that I would need to run it on a very large dataframe. I tried the package Distances but the documentation is too scarce.
Also, does anyone know if there is an efficient code for the Needleman–Wunsch algorithm?
I would first convert your dataset into an array of BitVectors. Then you can compute the Hamming distance between two BitVectors very cheaply using bitwise operations — for example, see the code posted here. Since conversion is O(n) and computing the distance matrix is O(n^2) where n is the size of your dataset, it is worth it to pay the conversion cost in order to speed up the distance computation.
(Note also that you can trivially double the speed of your code since A is symmetric — you only need to compute the Hamming distance once for each pair.)
(How large is your dataset? It can’t have that many rows or you wouldn’t be able to store the distance matrix. If you have ≤ 64 columns you can convert your data to an array of UInt64 bitstrings, which will be even faster than BitVectors.)
If you have a million rows, then your distance matrix is 10^6 \times 10^6, which will require 8 terabytes of memory. Probably you need to re-think your algorithms.
The memory usage for an n\times n matrix of 64-bit integers is 8n^2 bytes. If you can afford to use 16GB of memory, that comes out to about n = 44,000 rows.
As I said, you really need to re-think your algorithms if you are relying on computing this matrix explicitly for huge datasets. Think about how you are going to use this matrix (you haven’t said), e.g whether you can compute the entries as needed, and in general how the computational and memory costs grow with n. If you are working with large data, you need to think like a computer scientist.
Also, does anyone know if there is an efficient code for the Needleman–Wunsch algorithm?
I did try to use this package, but I couldn’t find in the documentation on how to retrieve a distance matrix. The documentation of this package is sparse. But thanks.
pairwise(Hamming(…), dataframe, dims=1) should do it. Maybe you need to convert the DataFrame to Matrix(dataframe). The dims=1 specifies the calculation over rows
If you can’t store the whole matrix, there’s also a few convenient options for an extra abstraction that might make working with the matrix feel the same. The package LazyArrays.jl implements something like this, I’m pretty sure, and KernelMatrices.jl (disclaimer: this is my project) provides exactly this abstraction because I am constantly working with matrices that are way too large to store in RAM but whose elements I need to access in a consistent way with regular Matrix objects:
using KernelMatrices
# This is the optimmized hamming distance that stevengj referenced.
function hamming(A::BitArray, B::BitArray)
#size(A) == size(B) || throw(DimensionMismatch("sizes of A and B must match"))
Ac,Bc = A.chunks, B.chunks
W = 0
for i = 1:(length(Ac)-1)
W += count_ones(Ac[i] ⊻ Bc[i])
end
W += count_ones(Ac[end] ⊻ Bc[end] & Base._msk_end(A))
return W
end
example_points = [BitArray(rand(Bool, 8)) for _ in 1:1000]
K = KernelMatrix(example_points, example_points, hamming)
# K[1:10, 1:10], K[3,6], K[:,10], etc works as expected.
# But K is really just a struct that stores the function and the points,
# not a big memory buffer for the matrix elements.
As a disclaimer, KM is GPL licensed, not officially registered, and probably not as optimized as it could be. So maybe LazyArrays or some other similar package would be a better choice. I just am not familiar with those, so maybe take the KM example here primarily as a demonstration of this particular abstraction.
EDIT: if you think you might use this, let me know and I’ll do a pass over the source and plop in whatever low-hanging optimizations aren’t in there currently that would apply to what you’re doing. Of course I welcome contributions from anybody and would love that even more, but to some degree I mostly make improvements to the base functionality when I have an excuse to, so that’s always invited.
Thank you for this. My next step is to run clustering analysis on the distance matrix. Is this approach helpful for this? Sorry I am not a computer scientist, so I don’t think I am able to contribute to your project, but it seems really interesting.