I have a situation where I’m keeping counts of certain visits, and in some scenarios the person visits multiple places during one trip. However, I can’t have this as part of my data, and need it to be redistributed. Let me get to the example so this doesn’t sound like complete jargon!
What I would like to do is scan all the columns that have a delimiter in it, and then redistribute those columns among existing columns in the dataframe (if the existing column exists). Here would be the desired output.
That’s a good solution, and I don’t have time to come up with a full one myself, but this seems to me like a case that might benefit from a wide-to-long transformation as it’s often painful to have relevant info encoded in column names like this. So I might start with:
julia> stack(df, names(df[:, Not(:person)]))
9×3 DataFrame
│ Row │ person │ variable │ value │
│ │ String │ Categorical… │ Int64 │
├─────┼────────┼──────────────┼───────┤
│ 1 │ bob │ london │ 1 │
│ 2 │ phil │ london │ 1 │
│ 3 │ nick │ london │ 0 │
│ 4 │ bob │ spain │ 1 │
│ 5 │ phil │ spain │ 0 │
│ 6 │ nick │ spain │ 0 │
│ 7 │ bob │ london,spain │ 1 │
│ 8 │ phil │ london,spain │ 1 │
│ 9 │ nick │ london,spain │ 1 │
function row_spread(row)
dict = Dict{String, Int}()
for (colname, val) in zip(keys(row), values(row))
if colname == :person
continue
end
for country in split(string(colname), ",")
dict[country] = get(dict, country, 0) + val
end
end
new_row = hcat(DataFrame(person = row.person), DataFrame(dict))
new_row
end
new_df = reduce(vcat, row_spread(row) for row in eachrow(df))