I am trying to compute the hamming distance on a very large dataset. I need to get back a distance matrix between rows in order to run further analysis on this matrix.
For my purposes, it is useful that the data are stored in a DataFrame type.
The data looks something like this
a = [1 0 1 0 ; 1 1 1 1; 0 0 0 0; 0 0 0 0 ; 0 0 0 1] df = convert(DataFrame, a); nrows = size(df, 1) ncols = size(df, 2)
I made a function in
Julia that returns a distance matrix
function hamjulia(df) nrows = size(df, 1) ncols = size(df, 2) m, n = nrows, nrows A = fill(0, (m, n)) for i in 1:nrows for k in 1:nrows v = 0 for j in 1:ncols if df[i,j] != df[k,j] v += 1 end end A[i,k] = v end end return A end p = hamjulia(df) p
My issue is that this code is slow compared to some R packages. For instance, when I compared this function to the
rdist(df, metric = 'hamming'),
R is faster.
How could I make this code really efficient? Especially that I would need to run it on a very large dataframe. I tried the package
Distances but the documentation is too scarce.
Also, does anyone know if there is an efficient code for the