How to insert columns in sub-tables for a grouped dataframe in `DataFrames.jl`?

anon56682678 · February 4, 2024, 9:42pm

What I want to do is:

Group a data-frame by column (“col1”) in DataFrames.jl, call it grouped_df
Send the sub-tables from the grouped_df to a user defined function
- The user-defined function will transform an existing column and should add a new column with relevant data

I want to transform an old column and add a new column in the same function because both action depend on basically same compute that is recoreded slightly differently.

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift .= scan_difference ) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

As you can see scan_difference is calculated for each group and is applied to the old column scan but I also want to record it for each group. Mentally, it feels like being able to do this in the same function would be the more effiecient way to do it.

But when I combine(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) I get this error: MethodError: no method matching ndims(::Type{Symbol})

I tried changing the function to

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift => fill(scan_difference, nrow(df))) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

And called the function as transform(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) but that produces this error

ArgumentError: Column shift is already present in the data frame which is not allowed when makeunique=true and only applies the new column data to one group (looks like the first group).

I do not fully understand what it means to set makeunique=false`.

What is the right way to do this in Julia DataFrames.jl or does this goes against the grain and is not advisable?

bkamins · February 4, 2024, 10:29pm

What should work is:

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    res_df = copy(df) # to dealias a data frame
    res_df.scan = df.scan .- scan_difference
    res_df.shift .= scan_difference # this assumes :shift column is not present in res_df yet
    return res_df
end

(cannot check it because your code is not runnable

anon56682678 · February 4, 2024, 10:50pm

Apologies for not including a real MWE! Working under a NDA with that I still learning to navigate and am not quite there with being able to generate good relevant dummy data.

Good news is that this works!

What is somewhat surprising to me is that this is easily 20 times faster than something I thought was similar, I tried



function apply_scan_difference_old(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    df.shfit .= scan_difference 
    
    return df::AbstractDataFrame
end

@time begin
    combine(groupby(data, :File)) do df
        apply_scan_difference_old(df,ref_scan)
    end
end

Outcome : 3.047295 seconds (406.54 k allocations: 14.865 GiB, 39.50% gc time, 1.80% compilation time)

Whereas your approach with

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    res_df = copy(df) # to dealias a data frame
    res_df.scan = df.scan .- scan_difference
    res_df.shift .= scan_difference # this assumes :shift column is not present in res_df yet
    return res_df
end

@time begin
    transform(sdf -> apply_scan_difference(sdf, ref_scan), grouped_df)
end

Outcome: 0.153589 seconds (309.57 k allocations: 190.494 MiB, 20.23% compilation time)

I dd not know that working with copies inside a function is faster that much faster than working with copies outside of it.

Or perhaps copy is a special implementation inside DataFrames.jl?

bkamins · February 5, 2024, 7:01am

without copy you technically operate on a whole (much bigger) data frame. This means that: 1) it can take longer, 2) there is a risk that the outcome is just incorrect (as you might mutate parts of data frame you might not have intended to)

rocco_sprmnt21 · February 5, 2024, 6:49pm

try this form (non mutating insertcols)


function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    df.scan = df.scan .- scan_difference
    insertcols(df, :shift => scan_difference) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
end
combine(sdf -> apply_scan_difference(sdf, ref_scan), groupby(data, :File))

use these expression to compare the different scripts (It seems surprising that adding a copy speeds up the operation, everything else being equal)


@benchmark combine(sdf -> apply_scan_difference(sdf, ref_scan), groupby(DF, :File))  setup=(DF=copy($data))

Topic		Replies	Views
Most efficient way to add new columns in each SubDataFrame of a GroupDataFrame Performance question , dataframes	6	676	October 27, 2022
Complex DataFrames.jl transform by row Data question , dataframes	5	279	April 23, 2024
Easier way to split-apply-combine in DataFrames.jl General Usage dataframes	5	1099	December 14, 2020
Group DataFrames by a function of a column Data package	4	1196	December 11, 2019
Transform in DataFrames General Usage dataframes	13	395	January 21, 2024

How to insert columns in sub-tables for a grouped dataframe in `DataFrames.jl`?

Related topics