I’m jumping in without having read the whole topic but wanted to comment on that. Maybe a grid where NaN indicates an absent value is simply the wrong data structure? Perhaps a sparse matrix is more appropriate.
I get that some would like to “ignore” NaNs, others would like to see IEEE 754-compliant functions. Maybe a solution would be to avoid this situation in the first place (with a precondition “arguments != NaN”). In my experience developing numerical codes, NaNs pop up when numerical functions are evaluated outside of their domains. In this case, you probably want to catch the error early and handle the case differently.
Surely there’s a better solution to model the absence of data than NaNs.
In Julia what’s not stored is defined to be 0.0 (and storing -0.0, but NaN is stored, taking up space). I assume this goes for all alternative spars formats, and see e.g. also for:
The scipy sparse matrix format(s) stores non-zero values. The rest are 0’s. Period, full stop! I doubt if you’ll find “missing” in any of the scipy.sparse documentation.
Julia does have missing but it’s awkward for dense (not as space-efficient as NaN to indicate absence, for Float64 arrays), and worse for sparse arrays, doesn’t seem supported.
Actually I actually prefer performance over composability, I don’t want to end up like at python where lot of shiny things and a little wrong edge case and you get a 3-10-100x slowdown.
But of course the skip composability is beautiful so definitely worth to keep if someone want beautiful composability and readable codes.
Really nice thread all in all, just shows how easy and beautiful Julia is!
In its simplest form a 2D regular grid is a MxN dense matrix accompanied with the definition of the origin (x0,y0) plus the increments in x and y (dx, dy). With this we can find the coordinates of any point without needing to store 2 extra matrices with the coordinates.
When you open GoogleEarth and look at the oceans, that blue image was created with a grid like this. Now, given the product we don’t want to see holes in the ocean, although a very large part of the ocean’s depth was never measured with a ship echo-sounder. To fill those gaps grid interpolations and use of predicted bathymetry was used but that’s not the point here. The point is that for other variables we cannot simply fill the gaps. Imagine that you were seeing the Sea Surface Temperature (SST). What would you display as temperatures over land?
The answer is that for grids with data gaps, we must have nodata-values flags. Those can be any number, and they were many grids that used -99999(?) (the number of 9’s varied), or 0’s but these are ofc poor cohices (what would think if seeing an SST of 0 over land?). So whatever those flag numbers are, they must be filtered in many of its usage.
Now imagine we want to convert temperature in Celsius to Kelvin. I Julia (and may other languages) and if nodata is represented by a NaN we simply do
Tkelvin = Tcel + 273.15
but if the temperature grids were using any other flag number you’d have to write a loop that checked every node in the grid for that flag number. NaN propagation is ideal here, where it’s not is to compute the grid’s min/max like apparently the IEEE rules require (I mean the no propagation in min and max).
Note that this is not my personal opinion (but not that I disagree with it) but I’m describing what has been used during decades of production, compressing and distribution of earth science data.
I usually get frustrated when trying to apply these skip functions over a certain dimension of an array, which is very common in geo-applications. Here I think StefanKarpinski’s approach of defining a custom min/max is still the best, because it simply lets you use reduce with the dims keyword.
nanmin(x,y) = isnan(x) ? y : isnan(y) ? x : min(x,y)
nanmax(x,y) = isnan(x) ? y : isnan(y) ? x : min(x,y)
x = rand(100,10);
x[rand(1:length(x),100)] .= NaN;
reduce(nanmin,x,dims=1,init=Inf)
It would be best if eachslice(skip(...), ...) worked, and I think the only thing required for that is to remove ::AbstractArray restriction from methods in Base
[1] _eachslice(A::AbstractArray{T, N}, dims::Tuple{Vararg{Integer, M}}, drop::Bool) where {T, N, M}
@ slicearray.jl:54
[2] _eachslice(A::AbstractArray, dim::Integer, drop::Bool)
@ slicearray.jl:72
The problem is that your version is efficient for dims=1 but can be very cache-inefficient for dimensions with larger strides, here is a slightly modified example:
using BenchmarkTools, Skipper
nanmin(x,y) = isnan(x) ? y : isnan(y) ? x : min(x,y)
nanmax(x,y) = isnan(x) ? y : isnan(y) ? x : min(x,y)
x = rand(10000,10);
x[rand(1:length(x),1000)] .= NaN;
@benchmark reduce(nanmin,$x,dims=2,init=Inf)
@benchmark minimum.(skip.(isnan, eachslice($x, dims=2)))
takes on my system 50µs for the first and 250µs for the second version.
Yeah, currently neither skip nor skipmissing (from base julia) support stuff like minimum(skip(...), dims=...).
The skipmissing PR (https://github.com/JuliaLang/julia/pull/28027) has stalled for years, I guess nobody is that interested in the functionality. Maybe, the implementation can be made simpler given julia progress since 2018?
When/if this is added for skipmissing, should be very easy to copy for skip. Then, minimum(skip(isnan, x), dims=2) would work and should be efficient.