Get the maximum value by row in a DataFrame containing missing values

I want to aggregate data over rows, which works well if the row contains some values.
When the data row is empty maximum throws an error.

Maybe someone has an idea how to make this faster.

My table has usually about rows 200.000 and I have to aggregate over ~ 100 columns by row for different column groups to obtain statistical data.(Max, min, mad, median, mean, std, …)

using DataFrames
df = DataFrame(idx=1:3,b=2:4,a1=[1.0,missing,missing],a2=[2.0,3.0,missing])

# fast but breaks on empty row
transform(df,AsTable(r"a.*") => ByRow(maximum∘skipmissing))

# slow but works
maximum_safe(x) = isempty(x) ? NaN : maximum(x)
transform(df,AsTable(r"a.*") => ByRow(maximum_safe∘skipmissing))

I did not benchmark my function but you could try:

mymax(x)=maximum(x;init = missing)
transform(df,AsTable(r"a.*") => ByRow(mymax∘skipmissing))

This always produces missing, I think.

Sry, you are right. Of course, you could initialize it with a known number that is lower than every number you have in your df, but this sounds a bit hacky.
Like

mymax(x)=maximum(x;init=-Inf)

I made some benchmarks:

The wrapper, however, does not produce the same result. I think I am missing something there.

Code
using DataFrames
using BenchmarkTools
import Base:maximum,isless

ncols = 100
df = DataFrame(rand(1000,ncols),["a$i" for i in 1:ncols])
allowmissing!(df)
df[:,rand(1:100)].=missing
df_missing = copy(df)
df_missing[end,:] .= missing

suite = BenchmarkGroup()

suite["base"] = @benchmarkable transform($df,AsTable(r"a.*") => ByRow(maximum∘skipmissing))

max_isempty_NaN(x) = isempty(x) ? NaN : maximum(x) 
suite["isempty NaN"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(max_isempty_NaN∘skipmissing))

max_isempty_missing(x) = isempty(x) ? missing : maximum(x) 
suite["isempty missing"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(max_isempty_missing∘skipmissing))

max_init_low(x) = maximum(x;init=-Inf) 
suite["init low"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(max_init_low∘skipmissing))

suite["emptymissing"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(emptymissing(maximum)∘skipmissing))

struct Wrapper{T}
    value::T
end
isless(x1::Wrapper,x2::Wrapper) = isless(x1.value,x2.value)
maximum(x::AbstractArray{Wrapper}) = isempty(x) ? missing : maximum(x)
df_wrapped = Wrapper.(df_missing)
suite["maximum wrapper missing"] = @benchmarkable transform($df_wrapped,AsTable(r"a.*") => ByRow(maximum∘skipmissing))

struct Wrapper2{T}
    value::T
end
isless(x1::Wrapper2,x2::Wrapper2) = isless(x1.value,x2.value)
maximum(x::AbstractArray{Wrapper2}) = isempty(x) ? NaN : maximum(x)
df_wrapped2 = Wrapper2.(df_missing)
suite["maximum wrapper NaN"] = @benchmarkable transform($df_wrapped2,AsTable(r"a.*") => ByRow(maximum∘skipmissing))

tune!(suite);

results = run(suite)

DataFrame(results,["Name","Time"])




 Row │ Name                     Time               
─────┼────────────────────────────────────────────
   1 │ isempty missing          Trial(10.472 ms)
   2 │ base                     Trial(276.139 μs)
   3 │ maximum wrapper missing  Trial(1.749 ms)
   4 │ maximum wrapper NaN      Trial(1.519 ms)
   5 │ init low                 Trial(10.765 ms)
   6 │ emptymissing             Trial(10.664 ms)
   7 │ isempty NaN              Trial(10.703 ms)

That’s a bit odd/unfortunate as using a matrix seems to be much faster, i.e., adding the following to your suite

suite["base matrix"] = @benchmarkable let yo = copy($df); yo[!, :max] = (maximum∘skipmissing).(eachrow(Matrix(yo))) end
suite["emptymissing matrix"] = @benchmarkable let yo = copy($df_missing); yo[!, :max] = (emptymissing(maximum)∘skipmissing).(eachrow(Matrix(yo))) end

I get

 Row │ Name                 Time              
     │ String               Trial             
─────┼────────────────────────────────────────
   1 │ base                 Trial(990.527 μs)
   2 │ emptymissing matrix  Trial(1.383 ms)
   3 │ emptymissing         Trial(47.702 ms)
   4 │ base matrix          Trial(1.084 ms)

I added a method that only applies the base method to a subset of the data. This seems to be faster than filtering the empty rows in the original transform.

function subset_transform!(df::DataFrame,transformation)
    selector = transformation.first
    f = transformation.second.first
    out = transformation.second.second

    df[:,out] .= missing
    sdf = subset(df,selector=>ByRow((!isempty)∘skipmissing),view=true)
    transform!(sdf,selector => f => out)
    df
end

suite["subset"] = @benchmarkable subset_transform!($df_missing,AsTable(r"a.*") => ByRow(maximum∘skipmissing) => :out)
 Row │ Name                 Time                       
─────┼────────────────────────────────────────
   1 │ isempty missing      Trial(100.148 ms)
   2 │ base                 Trial(1.957 ms)
   3 │ emptymissing matrix  Trial(3.536 ms)
   4 │ init low             Trial(94.001 ms)
   5 │ subset               Trial(6.586 ms)
   6 │ emptymissing         Trial(101.437 ms)

My proposal (haven’t compared relative to all the other results)

using DataFrames
using BenchmarkTools
import Base: maximum, isless

ncols = 100
df    = DataFrame(rand(1000, ncols), ["a$i" for i in 1:ncols])
allowmissing!(df)
df[:, rand(1:100)] .= missing
df_missing          = copy(df)
df_missing[end, :] .= missing


# previous solution
max_isempty_NaN(x) = isempty(x) ? NaN : maximum(x)
@btime transform($df_missing, AsTable(r"a.*") => ByRow(max_isempty_NaN ∘ skipmissing) => :output1)
# 17.265 ms (197572 allocations: 156.51 MiB)


# solution proposed
maximum_nomiss = (maximum ∘ skipmissing)
selection(x)   = isempty(x) || all(ismissing,x)

function foo(df_missing)
    df_missing.output2 .= missing
    temp                = view(eachrow(df_missing),:)
    temp1               = view(df_missing, (!selection).(temp), :)
    temp1.output2      .= maximum_nomiss.(eachrow(temp1))
end

@btime foo($df_missing) 
# 8.910 ms (555108 allocations: 10.03 MiB)


# same result
transform!(df_missing, AsTable(r"a.*") => ByRow(max_isempty_NaN ∘ skipmissing) => :output1)
foo(df_missing)
isequal(df_missing.output1, df_missing.output2) # true

updated benchmark:

  • added more fully missing lines
  • increased rows
Code
using DataFrames
using BenchmarkTools
import Base:maximum,max

ncols = 100
nrows = 10000
df = DataFrame(rand(nrows,ncols),["a$i" for i in 1:ncols])
allowmissing!(df)
df[:,rand(1:100)].=missing
df_missing = copy(df)
df_missing[rand(1:nrows,100),:] .= missing

suite = BenchmarkGroup()

suite["base"] = @benchmarkable transform($df,AsTable(r"a.*") => ByRow(maximum∘skipmissing))

max_isempty_missing(x) = isempty(x) ? missing : maximum(x) 
suite["isempty missing"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(max_isempty_missing∘skipmissing))

max_init_low(x) = maximum(x;init=-Inf) 
suite["init low"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(max_init_low∘skipmissing))

suite["emptymissing"] = @benchmarkable transform($df_missing,AsTable(r"a.*") => ByRow(emptymissing(maximum)∘skipmissing))

function subset_transform!(df::DataFrame,transformation)
    selector = transformation.first
    f = transformation.second.first
    out = transformation.second.second

    df[!,out] .= missing
    sdf = subset(df,selector=>ByRow((!isempty)∘skipmissing),view=true)
    transform!(sdf,selector => f => out)
    df
end

suite["subset"] = @benchmarkable let yo = copy($df_missing);subset_transform!(yo,AsTable(r"a.*") => ByRow(maximum∘skipmissing) => :out) end

suite["emptymissing matrix"] = @benchmarkable let yo = copy($df_missing); yo[!, :max] = (emptymissing(maximum)∘skipmissing).(eachrow(Matrix(yo))) end

maximum_nomiss = (maximum ∘ skipmissing)
selection(x)   = isempty(x) || all(ismissing,x)

function foo(df_missing)
    df_missing.output2 .= missing
    temp                = view(eachrow(df_missing),:)
    temp1               = view(df_missing, (!selection).(temp), :)
    temp1.output2      .= maximum_nomiss.(eachrow(temp1))
    df_missing
end

suite["view"] = @benchmarkable foo($df_missing)

tune!(suite);

results = run(suite)

display(DataFrame(results,["Name","Time"]))
 Row │ Name                 Time                     
─────┼────────────────────────────────────────
   1 │ isempty missing      Trial(108.453 ms)
   2 │ base                 Trial(1.954 ms)
   3 │ emptymissing matrix  Trial(3.727 ms)
   4 │ init low             Trial(113.381 ms)
   5 │ subset               Trial(7.584 ms)
   6 │ emptymissing         Trial(118.192 ms)
   7 │ view                 Trial(109.994 ms)

I think the correct answer here is the emptymissing function from Missings.jl., which checks if an iterator is empty and returns missing if it’s empty. It’s a function wrapper like passmissing.

julia> x = [1, 2, missing];

julia> x_miss = [missing, missing, missing];

julia> emptymissing(maximum)(skipmissing(x))
2

julia> emptymissing(maximum)(skipmissing(x_miss))
missing

EDIT: Sorry, I didn’t realize emptymissing was already tested in this thread. Still, it’s the solution I recommend.

It seems there is only a special implementation for ByRow(maximum ∘skipmissing) and not ByRow(emptymissing(maximum)∘skipmissing.

Also only a few functions can be used instead of maximum.