Row wise median for julia dataframes

Think it’s worth concluding this thread with a few observations

  1. Defining safemedian as @aplavin has and installing Skipper.jl from source I get the following result
julia>  safemedian(y) = all(isnan, y) ? NaN : median(skip(isnan, y))
julia> @btime select(df, AsTable(:) => ByRow(safemedian) => "median")
  1.442 ms (30173 allocations: 4.66 MiB)

The overhead from using DataFrames dosn’t seem to be massive in this case as the following 2 tests show - a win for DataFrames.jl in my mind.

julia> m = Matrix(df)
julia> @btime mapslices(safemedian, m, dims=2)
  1.456 ms (30019 allocations: 4.65 MiB)

julia> tbl = rowtable(df) 
julia> @btime map(safemedian, tbl)
  1.353 ms (30003 allocations: 4.65 MiB)
  1. I thought it was worth exploring the missing approach suggested by @pdeffebach since it feels the most natural (don’t need to define any new functions or install extra packages). This is the result I achieved
df = DataFrame(rand(10000, 10), :auto)
allowmissing!(df)
df[1, :x3] = missing
df[20, [:x3, :x6]] .= missing
df[5, :] .= missing

@btime select(df, AsTable(:) => ByRow(emptymissing(median) ∘ skipmissing) => :median)
  3.568 ms (210174 allocations: 21.23 MiB)

I was actually a bit surprised about this one.

  1. @Dan 's approach came in at
julia> @btime select(df, AsTable(:) => ByRow(safemedian4) => :median)
  3.191 ms (190181 allocations: 18.09 MiB)

probably because we’re using an O(nlogn) algo?

@aplavin Thanks for extending Skipper.jl to support tuples. This is great.
@pdeffebach I learnt a few things about handling missing’s from your code. Thanks. It’s a tricky choice because some functions like Impute.nocb work with missings and some are more performant with NaN’s. I do believe missings is the future. @aplavin comments on the performance between NaNs and missings here. Have to say I was suprised by this result - just as I was above, so much more memory used in the ByRow(emptymissing(median) ∘ skipmissing) solution.

What an awesome community! Thanks all

PS. any of the solutions perform way better than pandas by a country mile :slight_smile:

2 Likes