How would I remove a column from a dataframe by distributing its values among existing columns?

Hi how’s it going?

I have a situation where I’m keeping counts of certain visits, and in some scenarios the person visits multiple places during one trip. However, I can’t have this as part of my data, and need it to be redistributed. Let me get to the example so this doesn’t sound like complete jargon!

df = DataFrame(:person=>["bob","phil","nick"],:london=>[1,1,0],:spain=>[1,0,0],Symbol("london,spain")=>[1,1,1])

Screenshot from 2020-08-04 11-09-10

What I would like to do is scan all the columns that have a delimiter in it, and then redistribute those columns among existing columns in the dataframe (if the existing column exists). Here would be the desired output.

Screenshot from 2020-08-04 11-10-58

Thank you!

This is a tough question! Here is one solution

julia> df = DataFrame(:person=>["bob","phil","nick"],:london=>[1,1,0],:spain=>[1,0,0],Symbol("london,spain")=>[1,1,1]);

julia> function fix_name(df)
       for name in names(df)
           if occursin(",", name)
               vals = df[!, name]
               inner_names = split(name, ",")
                   for n in inner_names
                      df[!, n] = df[!, n] .+ vals
                   end
               select!(df, Not(name))
               return(df)
           end
       end
        return nothing
       end;

julia> let df = df
       while df != nothing
       df = fix_name(df)
       end
       end

julia> df
3×3 DataFrame
│ Row │ person │ london │ spain │
│     │ String │ Int64  │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ bob    │ 2      │ 2     │
│ 2   │ phil   │ 2      │ 1     │
│ 3   │ nick   │ 1      │ 1     │

That’s a good solution, and I don’t have time to come up with a full one myself, but this seems to me like a case that might benefit from a wide-to-long transformation as it’s often painful to have relevant info encoded in column names like this. So I might start with:

julia> stack(df, names(df[:, Not(:person)]))
9×3 DataFrame
│ Row │ person │ variable     │ value │
│     │ String │ Categorical… │ Int64 │
├─────┼────────┼──────────────┼───────┤
│ 1   │ bob    │ london       │ 1     │
│ 2   │ phil   │ london       │ 1     │
│ 3   │ nick   │ london       │ 0     │
│ 4   │ bob    │ spain        │ 1     │
│ 5   │ phil   │ spain        │ 0     │
│ 6   │ nick   │ spain        │ 0     │
│ 7   │ bob    │ london,spain │ 1     │
│ 8   │ phil   │ london,spain │ 1     │
│ 9   │ nick   │ london,spain │ 1     │

and go from there

2 Likes
function row_spread(row)
    dict = Dict{String, Int}()
    for (colname, val) in zip(keys(row), values(row))
        if colname == :person
            continue
        end
        for country in split(string(colname), ",")
            dict[country] = get(dict, country, 0) + val
        end
    end
    new_row = hcat(DataFrame(person = row.person), DataFrame(dict))
    new_row
end

new_df = reduce(vcat, row_spread(row) for row in eachrow(df))

Thanks a lot for the replies guys.