Row wise median for julia dataframes

I’ve been advised to ask this here vs stackoverflow

I want to compute the median values of all rows in a dataframe. Some columns contain NaN values. Some rows even have all NaN values. The problem with median is

  1. if there’s any NaN values in a vector it returns NaN. In this case I would like to skip NaNs (like in Pandas).
  2. it is undefined for empty vectors (throws an error). In this case I want to return a NaN (like in Pandas)

I came up with the following solution:

df = DataFrame(rand(100, 10), :auto) 
df[1, :x3] = NaN
df[20, [:x3, :x6]] .= NaN
df[5, :] .= NaN

safemedian(y) = all(isnan.(y)) ? NaN : median(filter(!isnan, y))
x = select(df, AsTable(:) => ByRow(safemedian∘collect) => "median")

This works however it’s rather slow.

Question 1) Is there a way to speed this up?

I think the collect method is causing the sluggish performance. But I need to use the collect method otherwise I get an error:

safemedian(y) = all(isnan.(y)) ? NaN : median(filter(!isnan, y))
x = select(df, AsTable(:) => ByRow(safemedian) => "median")

# results in ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

This is because AsTable(:) passes each row a named tuple.

Question 2) Is there a way to pass rows as vectors instead?

This way I could pass the row to any function that expects a vector (for example the nanmedian function from the NaNStatistics.jl Package). Note I would not need to use the collect method if the AsVector(:) method was implemented (see here). Unfortunately it didn’t get the go ahead and I’m not sure what the alternative is.

Question 3) This one is more philisophical. Coming from Python/Pandas some operations in Julia are hard to figure out. Pandas for example handles NaNs/None values seemlessly (for better or worse). In Julia I artificially replace the missing values in my dataframe using mapcols!(x -> coalesce.(x, NaN), df). This is because many package functions (and functions I’ve written) are implemented for AbstractArray{T} where {T<:Real} and not AbstractArray{Union{T, Missing}} where {T<:Real} (ie. they don’t propagate missings). But since there is no skipnan yet and there is a skipmissing function in Julia, I’m thinking I’ve got it all wrong. Is the idiomatic way to keep missing values in Julia and handle them where appropriate? Or is it ok to use NaN’s (and keep the type fixed as say Float64)?

Welcome to Julia!

I would suggest using missing instead of NaN. Julia provides lots of tools to help you navigate missing.

That said, what functions are throwing errors with missing? There is the skipmissing function to help you navigate those situations. You would write f(skipmissing(x)) if f doesn’t work when x contains missing values.

skipmissing doesn’t work in all instances, of course, and we still have more work to do to make it easy to use.

There is also passmissing for returning missing for functions that don’t accept missing as an input. passmissing(f)(x, missing) will return missing.

For your particular question, there is also emptymissing, which has the behavior you request, which is to return missing when all the values are missing.

No collect required. Though it is pretty clunky.

julia> df = DataFrame(a = [1, missing], b = [missing, missing])
2×2 DataFrame
 Row │ a        b
     │ Int64?   Missing
─────┼──────────────────
   1 │       1  missing
   2 │ missing  missing

julia> transform(df, AsTable(:) => ByRow(t -> emptymissing(median)(skipmissing(t))) => :m)
2×3 DataFrame
 Row │ a        b        m
     │ Int64?   Missing  Float64?
─────┼─────────────────────────────
   1 │       1  missing        1.0
   2 │ missing  missing  missing

DataFramesMeta.jl might make the syntax look a little nicer

julia> @rtransform df :m = emptymissing(median)(skipmissing(AsTable(:)))
2×3 DataFrame
 Row │ a        b        m
     │ Int64?   Missing  Float64?
─────┼─────────────────────────────
   1 │       1  missing        1.0
   2 │ missing  missing  missing

This is a scenario where we could definitely provide better syntax.

4 Likes

Yes to both!
It is idiomatic to use missing, and they are fundamentally more general and uniform. However, NaNs have clear advantages as well: better interoperability with other Julia code (this you mentioned also) or external code; better performance, differences can be huge aside from the simplest situations.

There’s the Skipper.jl package (and there is dedicated SkipNaN.jl as well):

using Skipper

skip(isnan, X)  # like skipmissing(X), but skips NaNs

Even minimal changes (all without broadcasting + skip instead of filter) should make the function faster:

safemedian(y) = all(isnan, y)) ? NaN : median(skip(isnan, y))
1 Like

This avoids explicit collect and speeds up a bit

julia> safemedian1(y) = all(isnan, y) ? NaN : median(filter(!isnan, collect(y)))
safemedian1 (generic function with 1 method)       


# times on my laptop:

julia> @btime x = select(df, AsTable(:) => ByRow(safemedian∘collect) => "median");
  43.900 μs (768 allocations: 63.70 KiB)

julia> @btime x = select(df, AsTable(:) => x->map(safemedian1,Tables.namedtupleiterator(x)));
  31.000 μs (442 allocations: 50.31 KiB)



julia> safemedian2(y) = all(isnan, y) ? NaN : median(v for v in y if !isnan(v))
safemedian2 (generic function with 1 method)     

julia> @btime x = select(df, AsTable(:) => x->map(safemedian2,Tables.namedtupleiterator(x)));     
  40.900 μs (540 allocations: 60.84 KiB)

or

mdf=Matrix(df)

safemedian2(y) = all(isnan, y) ? NaN : median(filter(!isnan, y))

julia> @btime safemedian2.(eachrow(mdf))
  15.800 μs (301 allocations: 42.67 KiB)
100-element Vector{Float64}:
   0.4007199069216456
   0.5676756235683315
   0.31085566888050437
   0.7303039504838529
 NaN
   0.49296025156614276
...
1 Like

Thanks. This is fast (incidentally way faster than the pandas equivalent). To make it slightly less verbose I wrote

transform(df, AsTable(:) => ByRow(emptymissing(median) ∘ skipmissing) => :m)

Given @aplavin 's advice I think i’ll stick with NaNs for now. Purely out of laziness on my side to have to rewrite my code (in particular not being fully aware of what the benefit might be?) and because I don’t know how to proceed with certain package functions such as nanmedian (the following will not work)

transform(df, AsTable(:) => ByRow(emptymissing(nanmedian) ∘ skipmissing) => :m)

That’s just wierd to me median seems to work fine with NamedTuples but nanmedian does not. I guess median’s been written for more general iterators.

1 Like

Thanks for sharing your package. Definitely worth knowing about this one.

Also appreciate your advice about both being idiomatic approaches! Good to have the flexibility

Unfortunately even with the help of skipna - I don’t see how we can avoid the collect method in this case

Thanks - funny enough I found the alternative approaches become more competitive as the dataframe grows. I guess we get penalized by the matrix copy constructor.

It’s a shame we have to copy to call any version of safemedian (whether through list comprehension or collect). Apologies if this statement is incorrect and I didn’t understand your code!

To summarize so far - some great solutions here. However I can’t help but think it would nice to be able to call such a simple vectorized operation on the DataFrame rows without the need to copy each row (through collect or list comprehension) like in the solutions provided so far in this thread.

The exception is @pdeffebach 's approach which works for missings but not NaN’s.

It seems like AsVector(:) would have been the right tool for the job. @bkamins Am I right to think that?

If you are accepting larger initial compilation time then this should be faster on the original input:

safemedian3(y...) = all(isnan, y) ? NaN : median(x for x in y if !isnan(x))
select(df, All() => ByRow(safemedian3) => "median")
1 Like

In my findings this is actually the slowest solution

I guess my intuition is off. I’m the only one that seems to think that there should be a way to do row wise operations like this without copying (because we need to deal with NamedTuples which aren’t compatible with many vectorized functions).

Like when computing the row wise mean no copying occurs and it is blazingly fast

julia> df = DataFrame(rand(10000, 10), :auto)
julia> @btime select(df, AsTable(:) => ByRow(mean) => "mean")
34.000 μs (166 allocations: 87.16 KiB)

Both are O(n) operations so I would have expected similar performance. When I compute the median (with out any nan handling) it is much slower (but hey I don’t really know how median is implemented it could be O(nlogn) for all I know).

julia> @btime select(df, AsTable(:) => ByRow(median) => "median")
 957.000 μs (10181 allocations: 1.46 MiB)

Anyway there’s a big hit for the additional copying if we use any of the safemedian functions. here is one provided by @rocco_sprmnt21

julia> safemedian2(y) = all(isnan, y) ? NaN : median((x for x in y if !isnan(x)))
julia> @btime select(df, AsTable(:) => ByRow(safemedian2) => "median")
 1.604 ms (40173 allocations: 5.27 MiB)

And here’s the version with safemedian3

julia> safemedian3(y...) = all(isnan, y) ? NaN : median(x for x in y if !isnan(x))
julia> @btime select(df, All() => ByRow(safemedian3) => "median")
 4.003 ms (180678 allocations: 9.73 MiB)

I benchmarked it and it was 2x faster than your original solution. Can you please show your benchmarks?

edited the original post with my findings

Ah - I benchmarked against your original code. Yes - what @rocco_sprmnt21 proposed is faster. Regarding allocations median internally does 2 allocations on each call (independent of DataFrames.jl), so you are going to have allocations anyway.

median calculation is especially hard to do without allocation (it is similar to sorting). Here is a variation which reduced allocations on my machine:

function safemedian4(t)
    st = sort(t)
    st[1] == NaN && return NaN
    i = length(t)
    while st[i] == NaN ; i-=1 ; end
    iseven(i) ? st[(i÷2)] : (st[i÷2]+st[1+(i÷2)])*0.5
end

and to get the desired column:

select(df, AsTable(All()) => ByRow(safemedian4) => "median")

It benchmarked faster than some previous solutions on my machine (the exact ordering may depend on df size).

Yes, this is why I recommend using missing instead of NaN. There are a lot of convenience functions to deal with missing but not NaN. But yes there may be performance benefits in rare instances (it doesn’t look like you are benchmarking NaN vs missing at the moment though).

1 Like

for very large tables (>10^5 col) with the use of the collect function.

using DataStructures    
mid(nn)=isodd(nn) ? div(10-nn,2)+1 : div(10-nn,2):div(10-nn,2)+1 
function safemed(r)
    nn=count(e->isequal(e,NaN),r)
    nn==length(r)&&return NaN
    middle(nsmallest(div(10-nn,2)+1,r)[mid(nn)]...)
end

select(df, AsTable(:) => ByRow(safemed∘collect) => "median")

Otherwise

ERROR: ArgumentError: input table too wide (100000 columns) to convert to `NamedTuple` of `AbstractVector`s
Stacktrace:

I also started to rebuild a mediaskipnan, using sortperm though.
Thank you. This way I save myself the trouble of completing it :grin:

PS
sortperm however sends the NaNs to the bottom(not to the top of the list)

Well, there’s no fundamental reason to need collect here – you can do

safemedian(y) = all(isnan, y) ? NaN : median(skip(isnan, y))

... => ByRow(safemedian) => ...

(need recent Skipper version, earlier only supported arrays and not tuples).
Then the only allocations in safemedian are those in median itself: it does need to copy the data.

Also note that a significant part of the runtime in your tests is just the overhead from DataFrames. You get less/none overhead if you use basic Julian tables:

using Tables
tbl = rowtable(df)  # or create from scratch, it's just an array of namedtuples - no dependencies needed at all

x = map(safemedian, tbl)  # also shorter btw :)

(it doesn’t mean that all operations are slower in dataframes and faster with regular tables)

1 Like

Think it’s worth concluding this thread with a few observations

  1. Defining safemedian as @aplavin has and installing Skipper.jl from source I get the following result
julia>  safemedian(y) = all(isnan, y) ? NaN : median(skip(isnan, y))
julia> @btime select(df, AsTable(:) => ByRow(safemedian) => "median")
  1.442 ms (30173 allocations: 4.66 MiB)

The overhead from using DataFrames dosn’t seem to be massive in this case as the following 2 tests show - a win for DataFrames.jl in my mind.

julia> m = Matrix(df)
julia> @btime mapslices(safemedian, m, dims=2)
  1.456 ms (30019 allocations: 4.65 MiB)

julia> tbl = rowtable(df) 
julia> @btime map(safemedian, tbl)
  1.353 ms (30003 allocations: 4.65 MiB)
  1. I thought it was worth exploring the missing approach suggested by @pdeffebach since it feels the most natural (don’t need to define any new functions or install extra packages). This is the result I achieved
df = DataFrame(rand(10000, 10), :auto)
allowmissing!(df)
df[1, :x3] = missing
df[20, [:x3, :x6]] .= missing
df[5, :] .= missing

@btime select(df, AsTable(:) => ByRow(emptymissing(median) ∘ skipmissing) => :median)
  3.568 ms (210174 allocations: 21.23 MiB)

I was actually a bit surprised about this one.

  1. @Dan 's approach came in at
julia> @btime select(df, AsTable(:) => ByRow(safemedian4) => :median)
  3.191 ms (190181 allocations: 18.09 MiB)

probably because we’re using an O(nlogn) algo?

@aplavin Thanks for extending Skipper.jl to support tuples. This is great.
@pdeffebach I learnt a few things about handling missing’s from your code. Thanks. It’s a tricky choice because some functions like Impute.nocb work with missings and some are more performant with NaN’s. I do believe missings is the future. @aplavin comments on the performance between NaNs and missings here. Have to say I was suprised by this result - just as I was above, so much more memory used in the ByRow(emptymissing(median) ∘ skipmissing) solution.

What an awesome community! Thanks all

PS. any of the solutions perform way better than pandas by a country mile :slight_smile:

2 Likes