DataFrames.jl vcat column data from different rows

ethomag · August 16, 2022, 6:08am

I have a number of networks packets with sampled data, that is fragmented and was wondering whether de-fragmentation can be done with DataFrames or if I have to pre-process the data first ?

sort!(df, [:frame, :subframe, :seq, :sampleno])

467×6 DataFrame
 Row │   frame  subframe  seq     sampleno   nsamples   samples
     │   UInt8  UInt16    UInt16  UInt32     UInt32     Array{Int8,
─────┼─────────────────────────────────────────────────────────────
   1 │       4         7      13          0      330    [-1, 0, 144… 
   2 │       4         7      13        330       16    [0, 0, 0, 0… 
   3 │       4         8       0          0      112    [-1, 0, 144… 
   4 │       4         8       0        112      234    [16, -118, … 
   5 │       4         8       1          0      346    [-1, 0, 144…

The rows with the same (frame, subframe and seq) are fragmented into several rows (packets).

So I want all groups of rows with the same (frame, subframe and seq) to be collapsed to one row with the data in each samples columns concatenated.

I have read the docs and watched some Tutorials, but could not find much with vector-valued column data. So I tried this

gd = groupby(df,  [:frame, :subframe, :seq] )
dfm = combine(gd, :samples => vcat, :nsamples => sum)

But I get the same number of rows in the resulting DataFrame and the samples in samples_vcat are not concatenated. The only thing that worked as I expected is the nsamples_sum column, which seems to contain the sum of all nsamples in each group in gd.

Any help is much appreciated.

nilshg · August 16, 2022, 6:12am

julia> df = DataFrame(x = ["a", "a", "b", "b"], y = [[1,2],[3,4],[5,6],[7,8]])
4×2 DataFrame
 Row │ x       y      
     │ String  Array… 
─────┼────────────────
   1 │ a       [1, 2]
   2 │ a       [3, 4]
   3 │ b       [5, 6]
   4 │ b       [7, 8]

julia> combine(groupby(df, :x), :y => Ref ∘ (x -> reduce(vcat, x)) => :y)
2×2 DataFrame
 Row │ x       y            
     │ String  Array…       
─────┼──────────────────────
   1 │ a       [1, 2, 3, 4]
   2 │ b       [5, 6, 7, 8]

ethomag · August 16, 2022, 6:24am

Many thanks @nilshg, worked like a charm! Could you elaborate on the Ref ∘ ( ... ) ? Is this a standard trick you do with vector-valued columns ?

Thanks in advance!

bkamins · August 16, 2022, 6:29am

Ref protects the result from being spread across multiple rows. Another way to do it is to wrap the output in [ ]:

julia> combine(groupby(df, :x), :y => (x -> [reduce(vcat, x)]) => :y)
2×2 DataFrame
 Row │ x       y
     │ String  Array…
─────┼──────────────────────
   1 │ a       [1, 2, 3, 4]
   2 │ b       [5, 6, 7, 8]

(in this case the vector is unwrapped, but its only element is another vector and unwrapping is not recursive)

ethomag · August 16, 2022, 7:20am

Thanks @bkamins for your explanation, makes perfect sense!

I just encountered another issue with my de-fragmentation. I need each group in gd to be sorted according to sampleno (which indicates the start sample number in each packet), so that the resulting samples are properly ordered when concatenated.

I solved this by doing:

gd = groupby(df,  [:frame, :subframe, :seq] )

for g in gd
    sort!(g, :sampleno)
end

Is there a better / simpler way ?

nilshg · August 16, 2022, 7:22am

Unless you need to preserve the order in the parent df you could just sort that?

ethomag · August 16, 2022, 7:29am

Ah, ok so the order is guaranteed to be preserved. Great! Thanks.

nilshg · August 16, 2022, 7:31am

I think so, although @bkamins might correct me - I often get confused as to which operations are allowed to reorder things, but I think the within-group ordering of things should not get scrambled (otherwise things like calculating diffs, lags, or leads within group wouldn’t work).

ethomag · August 16, 2022, 7:49am

At least it seems so:

df = # ...
gd = groupby(df,  [:frame, :subframe, :seq] )
all(issorted(g, :sampleno) for g in gd)
# false

sort!(df, [:frame, :subframe, :seq, :sampleno])
gd = groupby(df,  [:frame, :subframe, :seq] )
all(issorted(g, :sampleno) for g in gd)
# true

bkamins · August 16, 2022, 7:20pm

Yes - within group ordering is preserved always.

What is not guaranteed currently in DataFrames.jl with respect to row order are only two things:

in groupby if you DO NOT pass sort kwarg then group order is undefined (i.e. DataFrames.jl picks the fastest algorithm it has available and can either sort groups or not sort them); use sort=true to sort groups and sort=false to preserve order of appearance
in joins (except leftjoin!) row order is undefined - we will likely add an option to define row order in the future releases

rocco_sprmnt21 · August 18, 2022, 5:30pm

You could get the same result, by reversing (in a sense) the order of operations.

combine(groupby(flatten(df,:y), :x), :y=>Ref=>:y)

perhaps put in this form is more understandable.

combine(groupby(flatten(df,:y), :x), :y=>(x->[x])=>:y)

or in a less usual form

combine(x->[flatten(x,:y).y],  groupby(df, :x))

rocco_sprmnt21 · August 19, 2022, 5:57am

I wonder if it is possible with this last method

combine(f::Base.Callable, gd::GroupedDataFrame; args...)

to rename the column (s) involved, as is possible in the classic structure of mini-language.
And if it is not available, what are the possibilities and / or contraindications for its implementation?

Topic		Replies	Views
Vcat multiple DataFrames General Usage	1	2141	May 19, 2021
I have a list of dataframes with the same columns, how would I stack them on top of each other and make one dataframe? General Usage	3	1686	March 25, 2020
Vcat DataFrame columns based on multiple columns in Julia General Usage dataframes	6	217	August 20, 2023
Mutating version of vcat for data frames New to Julia dataframes	7	597	October 11, 2022
How to convert a dataframe into a 1-D vector, line by line? General Usage dataframes , vector	6	79	November 14, 2024

DataFrames.jl vcat column data from different rows

Related topics