How to insert columns in sub-tables for a grouped dataframe in `DataFrames.jl`?

What I want to do is:

  • Group a data-frame by column (“col1”) in DataFrames.jl, call it grouped_df

  • Send the sub-tables from the grouped_df to a user defined function

    • The user-defined function will transform an existing column and should add a new column with relevant data

I want to transform an old column and add a new column in the same function because both action depend on basically same compute that is recoreded slightly differently.

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift .= scan_difference ) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

As you can see scan_difference is calculated for each group and is applied to the old column scan but I also want to record it for each group. Mentally, it feels like being able to do this in the same function would be the more effiecient way to do it.

But when I combine(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) I get this error: MethodError: no method matching ndims(::Type{Symbol})

I tried changing the function to

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    insertcols!(df, ncol(df)+1, :shift => fill(scan_difference, nrow(df))) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
    
    return df::AbstractDataFrame
end

And called the function as transform(sdf -> apply_scan_different(sdf, ref_scan), grouped_df) but that produces this error

ArgumentError: Column shift is already present in the data frame which is not allowed when makeunique=true and only applies the new column data to one group (looks like the first group).

I do not fully understand what it means to set makeunique=false`.

What is the right way to do this in Julia DataFrames.jl or does this goes against the grain and is not advisable?

What should work is:

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    res_df = copy(df) # to dealias a data frame
    res_df.scan = df.scan .- scan_difference
    res_df.shift .= scan_difference # this assumes :shift column is not present in res_df yet
    return res_df
end

(cannot check it because your code is not runnable

1 Like

Apologies for not including a real MWE! Working under a NDA with that I still learning to navigate and am not quite there with being able to generate good relevant dummy data.

Good news is that this works!

What is somewhat surprising to me is that this is easily 20 times faster than something I thought was similar, I tried



function apply_scan_difference_old(df::AbstractDataFrame,ref_scan::Float64)
    
    current_scan::Float64 = calculate_current_scan(df)
    
    scan_difference::Float64 = current_scan - ref_scan
    
    df.scan = df.scan .- scan_difference

    df.shfit .= scan_difference 
    
    return df::AbstractDataFrame
end

@time begin
    combine(groupby(data, :File)) do df
        apply_scan_difference_old(df,ref_scan)
    end
end

Outcome : 3.047295 seconds (406.54 k allocations: 14.865 GiB, 39.50% gc time, 1.80% compilation time)

Whereas your approach with

function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    res_df = copy(df) # to dealias a data frame
    res_df.scan = df.scan .- scan_difference
    res_df.shift .= scan_difference # this assumes :shift column is not present in res_df yet
    return res_df
end

@time begin
    transform(sdf -> apply_scan_difference(sdf, ref_scan), grouped_df)
end

Outcome: 0.153589 seconds (309.57 k allocations: 190.494 MiB, 20.23% compilation time)

I dd not know that working with copies inside a function is faster that much faster than working with copies outside of it.

Or perhaps copy is a special implementation inside DataFrames.jl?

without copy you technically operate on a whole (much bigger) data frame. This means that: 1) it can take longer, 2) there is a risk that the outcome is just incorrect (as you might mutate parts of data frame you might not have intended to)

1 Like

try this form (non mutating insertcols)


function apply_scan_difference(df::AbstractDataFrame,ref_scan::Float64)
    current_scan::Float64 = calculate_current_scan(df)
    scan_difference::Float64 = current_scan - ref_scan
    df.scan = df.scan .- scan_difference
    insertcols(df, :shift => scan_difference) # `scan_difference` has already been calculated and being able to record it here would be efficient,right`
end
combine(sdf -> apply_scan_difference(sdf, ref_scan), groupby(data, :File))

use these expression to compare the different scripts (It seems surprising that adding a copy speeds up the operation, everything else being equal)


@benchmark combine(sdf -> apply_scan_difference(sdf, ref_scan), groupby(DF, :File))  setup=(DF=copy($data))