Row wise median for julia dataframes

AUK1939 · November 30, 2023, 10:40pm

Think it’s worth concluding this thread with a few observations

Defining safemedian as @aplavin has and installing Skipper.jl from source I get the following result

julia>  safemedian(y) = all(isnan, y) ? NaN : median(skip(isnan, y))
julia> @btime select(df, AsTable(:) => ByRow(safemedian) => "median")
  1.442 ms (30173 allocations: 4.66 MiB)

The overhead from using DataFrames dosn’t seem to be massive in this case as the following 2 tests show - a win for DataFrames.jl in my mind.

julia> m = Matrix(df)
julia> @btime mapslices(safemedian, m, dims=2)
  1.456 ms (30019 allocations: 4.65 MiB)

julia> tbl = rowtable(df) 
julia> @btime map(safemedian, tbl)
  1.353 ms (30003 allocations: 4.65 MiB)

I thought it was worth exploring the missing approach suggested by @pdeffebach since it feels the most natural (don’t need to define any new functions or install extra packages). This is the result I achieved

df = DataFrame(rand(10000, 10), :auto)
allowmissing!(df)
df[1, :x3] = missing
df[20, [:x3, :x6]] .= missing
df[5, :] .= missing

@btime select(df, AsTable(:) => ByRow(emptymissing(median) ∘ skipmissing) => :median)
  3.568 ms (210174 allocations: 21.23 MiB)

I was actually a bit surprised about this one.

@Dan 's approach came in at

julia> @btime select(df, AsTable(:) => ByRow(safemedian4) => :median)
  3.191 ms (190181 allocations: 18.09 MiB)

probably because we’re using an O(nlogn) algo?

@aplavin Thanks for extending Skipper.jl to support tuples. This is great.
@pdeffebach I learnt a few things about handling missing’s from your code. Thanks. It’s a tricky choice because some functions like Impute.nocb work with missings and some are more performant with NaN’s. I do believe missings is the future. @aplavin comments on the performance between NaNs and missings here. Have to say I was suprised by this result - just as I was above, so much more memory used in the ByRow(emptymissing(median) ∘ skipmissing) solution.

What an awesome community! Thanks all

PS. any of the solutions perform way better than pandas by a country mile

Topic		Replies	Views
Iterate over all numeric columns in DataFrames Data	21	4854	February 11, 2018
Row wise median (or sum or mean) with missings New to Julia	3	1274	June 21, 2019
Accessing DataFrames - is there a simpler way? New to Julia dataframes	11	1184	April 26, 2021
Dealing with NaN's General Usage dataframes	21	5540	April 27, 2021
DataFramesMeta.jl and the state of the DataFrames ecosystem Data	36	4027	April 24, 2020

Row wise median for julia dataframes

Related topics