I am trying to compute the hamming distance on a very large dataset. I need to get back a distance matrix between rows in order to run further analysis on this matrix.

For my purposes, it is useful that the data are stored in a DataFrame type.

```
using DataFrames
```

The data looks something like this

```
a = [1 0 1 0 ; 1 1 1 1; 0 0 0 0; 0 0 0 0 ; 0 0 0 1]
df = convert(DataFrame, a);
nrows = size(df, 1)
ncols = size(df, 2)
```

I made a function in `Julia`

that returns a distance matrix

```
function hamjulia(df)
nrows = size(df, 1)
ncols = size(df, 2)
m, n = nrows, nrows
A = fill(0, (m, n))
for i in 1:nrows
for k in 1:nrows
v = 0
for j in 1:ncols
if df[i,j] != df[k,j]
v += 1
end
end
A[i,k] = v
end
end
return A
end
p = hamjulia(df)
p
```

My issue is that this code is slow compared to some R packages. For instance, when I compared this function to the `rdist`

package, `rdist(df, metric = 'hamming')`

, `R`

is faster.

How could I make this code really efficient? Especially that I would need to run it on a very large dataframe. I tried the package `Distances`

but the documentation is too scarce.

Also, does anyone know if there is an efficient code for the `Needlemanâ€“Wunsch algorithm`

?

Thanks.