Think it’s worth concluding this thread with a few observations
- Defining safemedian as @aplavin has and installing Skipper.jl from source I get the following result
julia> safemedian(y) = all(isnan, y) ? NaN : median(skip(isnan, y))
julia> @btime select(df, AsTable(:) => ByRow(safemedian) => "median")
1.442 ms (30173 allocations: 4.66 MiB)
The overhead from using DataFrames dosn’t seem to be massive in this case as the following 2 tests show - a win for DataFrames.jl in my mind.
julia> m = Matrix(df)
julia> @btime mapslices(safemedian, m, dims=2)
1.456 ms (30019 allocations: 4.65 MiB)
julia> tbl = rowtable(df)
julia> @btime map(safemedian, tbl)
1.353 ms (30003 allocations: 4.65 MiB)
- I thought it was worth exploring the
missing
approach suggested by @pdeffebach since it feels the most natural (don’t need to define any new functions or install extra packages). This is the result I achieved
df = DataFrame(rand(10000, 10), :auto)
allowmissing!(df)
df[1, :x3] = missing
df[20, [:x3, :x6]] .= missing
df[5, :] .= missing
@btime select(df, AsTable(:) => ByRow(emptymissing(median) ∘ skipmissing) => :median)
3.568 ms (210174 allocations: 21.23 MiB)
I was actually a bit surprised about this one.
- @Dan 's approach came in at
julia> @btime select(df, AsTable(:) => ByRow(safemedian4) => :median)
3.191 ms (190181 allocations: 18.09 MiB)
probably because we’re using an O(nlogn) algo?
@aplavin Thanks for extending Skipper.jl to support tuples. This is great.
@pdeffebach I learnt a few things about handling missing’s from your code. Thanks. It’s a tricky choice because some functions like Impute.nocb
work with missings and some are more performant with NaN
’s. I do believe missings is the future. @aplavin comments on the performance between NaNs and missings here. Have to say I was suprised by this result - just as I was above, so much more memory used in the ByRow(emptymissing(median) ∘ skipmissing)
solution.
What an awesome community! Thanks all
PS. any of the solutions perform way better than pandas by a country mile