JuliaDB flatten multiple columns

I want to “flatten” a table with multiple cells that contain vectors (these have equal lengths per table-row):

t = table((group = [1,1,2,2], x = [1,2,3,4]))
t2 = groupby(t, :group, usekey = true) do k, r
    (x2 = r.x .+ k.group, y2 = rand(length(r)))
end

If I flatten it on say column x2, column y2 doesn’t get “flattened”:

julia> flatten(t2, :x2)
Table with 4 rows, 3 columns:
group  x2  y2
group  x2  y2
─────────────────────────────────
1      2   [0.154278, 0.31287]
1      3   [0.154278, 0.31287]
2      5   [0.0977201, 0.0372307]
2      6   [0.0977201, 0.0372307]

What I need is to “ungroup” table t2 by group

OK, this seems to work:

julia> merge(table.(rows(select(t2, Not(:group))))...)
Table with 4 rows, 2 columns:
x2  y2
─────────────
2   0.154278
3   0.31287
5   0.0977201
6   0.0372307

But alas, this works only if there were just two groups, cause merge works on just two tables.

OK, with foldl this actually works for any number of groups:

julia> t = table((group = [1,1,2,2,3,3], x = [1,2,3,4,5,6]))
Table with 6 rows, 2 columns:
group  x
────────
1      1
1      2
2      3
2      4
3      5
3      6

julia> t2 = groupby(t, :group, usekey = true) do k, r
           (x2 = r.x .+ k.group, y2 = rand(length(r)))
           end
Table with 3 rows, 3 columns:
group  x2      y2
────────────────────────────────────
1      [2, 3]  [0.0581229, 0.019784]
2      [5, 6]  [0.63852, 0.855233]
3      [8, 9]  [0.023636, 0.783005]

julia> foldl(merge, table.(rows(select(t2, Not(:group)))))
Table with 6 rows, 2 columns:
x2  y2
─────────────
2   0.0581229
3   0.019784
5   0.63852
6   0.855233
8   0.023636
9   0.783005

Or let your groupby produce only a single column which is a Tuple of values:

t = table((group = [1,1,2,2], x = [1,2,3,4]))
t2 = groupby(t, :group, usekey = true, flatten = true) do k, r
    (z = [(r.x[i] + k.group, rand()) for i in 1:length(r)])
end

yields:

Table with 4 rows, 3 columns:
1  2  3
──────────────
1  2  0.798469
1  3  0.301879
2  5  0.883682
2  6  0.965717

As proposed above you can return an object that iterates NamedTuples. The easiest is to use Columns, so this:

t = table((group = [1,1,2,2], x = [1,2,3,4]))
t2 = groupby(t, :group, usekey = true) do k, r
    (x2 = r.x .+ k.group, y2 = rand(length(r)))
end

Would be replaced by

t = table((group = [1,1,2,2], x = [1,2,3,4]))
t2 = groupby(t, :group, usekey = true, flatten = true) do k, r
    Columns(x2 = r.x .+ k.group, y2 = rand(length(r)))
end

That is the correct solution (for me). Awesome, thanks @piever!

BTW, why not rows, that iterates named tuples as well, no?

They return the same object: rows((x=col1, y=col2)) would also work. Either case actually returns a StructVector (a vector of structs stored as a struct of vectors). When porting IndexedTables to StructArrays, we kept the Columns name for backward compatibility (it is now an alias for StructVector). OTOH it is a bit confusing that Columns(t::NamedTuple)is the same asrows(t::NamedTuple)`, not sure what’s the correct way forward in terms of API.

1 Like