I have a simple example in Python to get user similarities based on items they have interacted with.
import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
item=["a", "b", "c", "a", "b", "a","b","c"]; user=[1,1,1,2,2,3,3,3]; value=[0,1,1,0,1,0,0,0]
df = pd.DataFrame({"item":item, "user":user, "value":value})
uimat = df.pivot("user", "item", "value").fillna(0)
uiusers = uimat.index
print(pd.DataFrame(euclidean_distances(uimat), index=uiusers, columns=uiusers))
And my equivalent code feels very clunky in comparison especially around handling missing values
import Pkg; Pkg.activate("dataframes")
using DataFrames, Distances, AxisArrays
item=["a", "b", "c", "a", "b", "a","b","c"]; user=[1,1,1,2,2,3,3,3]; value=[0,1,1,0,1, 0,0,0]
df = DataFrame(item =item, user=user, value=value)
ui = unstack(df, :user, :item, :value)
[replace!(ui[!, col], missing => 0) for col in names(ui)];
disallowmissing!(ui);
uiusers = ui.user
uiitems = names(ui)[2:end]
uiarray = AxisArray(Matrix(ui[!,2:end]), uiusers, uiitems)
AxisArray(pairwise(Euclidean(), uiarray, dims = 1), uiusers, uiusers)
I wonder if there is something I can improve upon? Maybe and alternative to switching from DataFrames to AxisArrays (my understanding is DataFrames does not support row names)?
Thanks!