Simple user similarity in Julia vs Python

I have a simple example in Python to get user similarities based on items they have interacted with.

import pandas as pd
from sklearn.metrics.pairwise import euclidean_distances
item=["a", "b", "c", "a", "b", "a","b","c"]; user=[1,1,1,2,2,3,3,3]; value=[0,1,1,0,1,0,0,0]
df = pd.DataFrame({"item":item, "user":user, "value":value})
uimat = df.pivot("user", "item", "value").fillna(0)
uiusers = uimat.index
print(pd.DataFrame(euclidean_distances(uimat), index=uiusers, columns=uiusers))

And my equivalent code feels very clunky in comparison especially around handling missing values

import Pkg; Pkg.activate("dataframes")
using DataFrames, Distances, AxisArrays
item=["a", "b", "c", "a", "b", "a","b","c"]; user=[1,1,1,2,2,3,3,3]; value=[0,1,1,0,1, 0,0,0]
df = DataFrame(item =item, user=user, value=value)
ui = unstack(df, :user, :item, :value)
[replace!(ui[!, col], missing => 0) for col in names(ui)];
uiusers = ui.user
uiitems = names(ui)[2:end]
uiarray = AxisArray(Matrix(ui[!,2:end]), uiusers, uiitems)
AxisArray(pairwise(Euclidean(), uiarray, dims = 1), uiusers, uiusers)

I wonder if there is something I can improve upon? Maybe and alternative to switching from DataFrames to AxisArrays (my understanding is DataFrames does not support row names)?

@bkamins: Maybe something you can help simplify? (hope you don’t mind the ping!)

1 Like

There are two issues here:

  1. replacing missing with 0 in unstack; allowing to specify a sentinel in unstack is something I planned to add in the future. Currently, if you want something shorter you could do: df .= coalesce.(df, 0)
  2. regading the pairwise part - @nalimilan was recently working on improving it, but I do not remember now where the PR is - he probably can comment on the status of this work.

My column-name-aware pairwise implementation is in FreqTables currently, but we first need to add the name-unaware version in StatsBase, find a lightweight interface package where to define pairwise so that Distances can override it without depending on StatsBase, and then decide whether FreqTables should use a more general name.


Thank you! coalesce is definitely nicer. I will keep track of the issues with pairwise.

1 Like