DataFrame transformation question

What I want to do is something like “unstack” but I do not know how to do it.

A simple example:

l = DataFrame(a = ["a", "a", "b", "b"], c = ["ZA", "ZM", "BW", "ZA"]) 
4×2 DataFrame
│ Row │ a      │ c      │
│     │ String │ String │
├─────┼────────┼────────┤
│ 1   │ a      │ ZA     │
│ 2   │ a      │ ZM     │
│ 3   │ b      │ BW     │
│ 4   │ b      │ ZA     │

I want to convert l to a dataframe that looks something like this:

| Row | x1 | x2           |
|-----+----+--------------|
|   1 | a  | ["ZA", "ZM"] |
|   2 | b  | ["BW", "ZA"] |

In the end I want to work with l[:x2].
In the real world the array in x2 will be of unpredictable length.

At the moment I am doing something very inefficient like this on a DataFrame with more than 400000 rows where z is the dataframe and values an array with the unique values that would be l[:a] in the above example and combinations would be the equivalent of l[:x2] in the example above.

 combinations = []
 
@time for v in values
         push!(combinations, Set(filter(row -> row[:v] == v, z)[:code]))
     end

There must be a more efficient way of doing this.

by(l, :a, c_values = :c => x -> [unique(x)])
2×2 DataFrame
│ Row │ a      │ c_values     │
│     │ String │ Array…       │
├─────┼────────┼──────────────┤
│ 1   │ a      │ ["ZA", "ZM"] │
│ 2   │ b      │ ["BW", "ZA"] │

Will do the trick.

5 Likes

A slightly more concise version with Query.jl would be this:

l |> @groupby(_.a) |> @map({a=key(_), cv=unique(_.c)}) |> DataFrame
2 Likes