Efficient way to add a value of column to all rows conditioned on another column by group

My data has 3 columns (id, time, conc). I am looking for a concise way to add the column (using Chain.jl and DataFramesMeta.jl) as described in the title. Example code is below:

df = DataFrame(id = sort(repeat([1, 2, 3], 10)),
                time = repeat(0:10:90, 3),
                conc = rand(30))

@chain df begin
    @aside concs_20 = @chain _ begin
        @rsubset :time == 20
        @select :id :conc_20hr = :conc
    end
    leftjoin(_, concs_20, on=[:id])
end

For example, in R, I can do the task concisely (using the library data.table) as follows:

df = data.table(id = sort(rep(c(1, 2, 3), 10)),
               time = rep(seq(0,90,10), 3),
               conc = rnorm(30))
df[, ":="(conc_20hr = conc[time == 20]), by=.(id)]

This is an area of active development in DataFramesMeta, and I wish there were a better way of doing it.

The way to do the conditional transformation is

julia> @chain df begin 
           @rtransform :conc_20hr = :time == 20 ? :conc : missing
       end

but it looks like you also want to “spread” the result within the group

julia> @chain df begin 
           groupby(:id)
           @transform :conc_20hr = first(:conc[:time .== 20])
       end
1 Like

The last example matches what OP has written in data.table. I would just make a small substitution of first to only as it is safer (you are sure there is only one match for :time .== 20)

1 Like

Actually your leftjoin solution might be more efficient (if you don’t (want to) change the transform part):

df = DataFrame(id = sort(repeat(1:300000, 10)),
                  time = repeat(0:10:90, 300000),
                  conc = rand(3000000))

 @btime transform(groupby(df,1), [:time, :conc]=> (x,y)->first(y[x .== 20]))
  75.705 ms (900817 allocations: 302.45 MiB)

 @btime leftjoin(df, df[df.time .== 20, [1,3]], on = [:id], makeunique = true)
  59.999 ms (345 allocations: 175.26 MiB)
1 Like