DataFrames.jl vcat column data from different rows

I have a number of networks packets with sampled data, that is fragmented and was wondering whether de-fragmentation can be done with DataFrames or if I have to pre-process the data first ?

sort!(df, [:frame, :subframe, :seq, :sampleno])

467×6 DataFrame
 Row │   frame  subframe  seq     sampleno   nsamples   samples
     │   UInt8  UInt16    UInt16  UInt32     UInt32     Array{Int8,
─────┼─────────────────────────────────────────────────────────────
   1 │       4         7      13          0      330    [-1, 0, 144… 
   2 │       4         7      13        330       16    [0, 0, 0, 0… 
   3 │       4         8       0          0      112    [-1, 0, 144… 
   4 │       4         8       0        112      234    [16, -118, … 
   5 │       4         8       1          0      346    [-1, 0, 144… 

The rows with the same (frame, subframe and seq) are fragmented into several rows (packets).

So I want all groups of rows with the same (frame, subframe and seq) to be collapsed to one row with the data in each samples columns concatenated.

I have read the docs and watched some Tutorials, but could not find much with vector-valued column data. So I tried this

gd = groupby(df,  [:frame, :subframe, :seq] )
dfm = combine(gd, :samples => vcat, :nsamples => sum)

But I get the same number of rows in the resulting DataFrame and the samples in samples_vcat are not concatenated. The only thing that worked as I expected is the nsamples_sum column, which seems to contain the sum of all nsamples in each group in gd.

Any help is much appreciated.

julia> df = DataFrame(x = ["a", "a", "b", "b"], y = [[1,2],[3,4],[5,6],[7,8]])
4×2 DataFrame
 Row │ x       y      
     │ String  Array… 
─────┼────────────────
   1 │ a       [1, 2]
   2 │ a       [3, 4]
   3 │ b       [5, 6]
   4 │ b       [7, 8]

julia> combine(groupby(df, :x), :y => Ref ∘ (x -> reduce(vcat, x)) => :y)
2×2 DataFrame
 Row │ x       y            
     │ String  Array…       
─────┼──────────────────────
   1 │ a       [1, 2, 3, 4]
   2 │ b       [5, 6, 7, 8]
2 Likes

Many thanks @nilshg, worked like a charm! Could you elaborate on the Ref ∘ ( ... ) ? Is this a standard trick you do with vector-valued columns ?

Thanks in advance!

Ref protects the result from being spread across multiple rows. Another way to do it is to wrap the output in [ ]:

julia> combine(groupby(df, :x), :y => (x -> [reduce(vcat, x)]) => :y)
2×2 DataFrame
 Row │ x       y
     │ String  Array…
─────┼──────────────────────
   1 │ a       [1, 2, 3, 4]
   2 │ b       [5, 6, 7, 8]

(in this case the vector is unwrapped, but its only element is another vector and unwrapping is not recursive)

3 Likes

Thanks @bkamins for your explanation, makes perfect sense!

I just encountered another issue with my de-fragmentation. I need each group in gd to be sorted according to sampleno (which indicates the start sample number in each packet), so that the resulting samples are properly ordered when concatenated.

I solved this by doing:

gd = groupby(df,  [:frame, :subframe, :seq] )

for g in gd
    sort!(g, :sampleno)
end

Is there a better / simpler way ?

Unless you need to preserve the order in the parent df you could just sort that?

1 Like

Ah, ok so the order is guaranteed to be preserved. Great! Thanks.

I think so, although @bkamins might correct me - I often get confused as to which operations are allowed to reorder things, but I think the within-group ordering of things should not get scrambled (otherwise things like calculating diffs, lags, or leads within group wouldn’t work).

1 Like

At least it seems so:

df = # ...
gd = groupby(df,  [:frame, :subframe, :seq] )
all(issorted(g, :sampleno) for g in gd)
# false

sort!(df, [:frame, :subframe, :seq, :sampleno])
gd = groupby(df,  [:frame, :subframe, :seq] )
all(issorted(g, :sampleno) for g in gd)
# true

Yes - within group ordering is preserved always.

What is not guaranteed currently in DataFrames.jl with respect to row order are only two things:

  • in groupby if you DO NOT pass sort kwarg then group order is undefined (i.e. DataFrames.jl picks the fastest algorithm it has available and can either sort groups or not sort them); use sort=true to sort groups and sort=false to preserve order of appearance
  • in joins (except leftjoin!) row order is undefined - we will likely add an option to define row order in the future releases
2 Likes

You could get the same result, by reversing (in a sense) the order of operations.

combine(groupby(flatten(df,:y), :x), :y=>Ref=>:y)

perhaps put in this form is more understandable.

combine(groupby(flatten(df,:y), :x), :y=>(x->[x])=>:y)

or in a less usual form

combine(x->[flatten(x,:y).y],  groupby(df, :x))

I wonder if it is possible with this last method

combine(f::Base.Callable, gd::GroupedDataFrame; args...)

to rename the column (s) involved, as is possible in the classic structure of mini-language.
And if it is not available, what are the possibilities and / or contraindications for its implementation?