I want to aggregate data over rows, which works well if the row contains some values.
When the data row is empty maximum throws an error.
Maybe someone has an idea how to make this faster.
My table has usually about rows 200.000 and I have to aggregate over ~ 100 columns by row for different column groups to obtain statistical data.(Max, min, mad, median, mean, std, …)
using DataFrames
df = DataFrame(idx=1:3,b=2:4,a1=[1.0,missing,missing],a2=[2.0,3.0,missing])
# fast but breaks on empty row
transform(df,AsTable(r"a.*") => ByRow(maximum∘skipmissing))
# slow but works
maximum_safe(x) = isempty(x) ? NaN : maximum(x)
transform(df,AsTable(r"a.*") => ByRow(maximum_safe∘skipmissing))
Sry, you are right. Of course, you could initialize it with a known number that is lower than every number you have in your df, but this sounds a bit hacky.
Like
That’s a bit odd/unfortunate as using a matrix seems to be much faster, i.e., adding the following to your suite
suite["base matrix"] = @benchmarkable let yo = copy($df); yo[!, :max] = (maximum∘skipmissing).(eachrow(Matrix(yo))) end
suite["emptymissing matrix"] = @benchmarkable let yo = copy($df_missing); yo[!, :max] = (emptymissing(maximum)∘skipmissing).(eachrow(Matrix(yo))) end
I get
Row │ Name Time
│ String Trial
─────┼────────────────────────────────────────
1 │ base Trial(990.527 μs)
2 │ emptymissing matrix Trial(1.383 ms)
3 │ emptymissing Trial(47.702 ms)
4 │ base matrix Trial(1.084 ms)
I added a method that only applies the base method to a subset of the data. This seems to be faster than filtering the empty rows in the original transform.
function subset_transform!(df::DataFrame,transformation)
selector = transformation.first
f = transformation.second.first
out = transformation.second.second
df[:,out] .= missing
sdf = subset(df,selector=>ByRow((!isempty)∘skipmissing),view=true)
transform!(sdf,selector => f => out)
df
end
suite["subset"] = @benchmarkable subset_transform!($df_missing,AsTable(r"a.*") => ByRow(maximum∘skipmissing) => :out)
I think the correct answer here is the emptymissing function from Missings.jl., which checks if an iterator is empty and returns missing if it’s empty. It’s a function wrapper like passmissing.