The first question is about filtering to remove missing data. There seems to be a major (500x) performance difference between using Iterators.filter(!ismissing, v)
and skipmissing
(in favor of the latter) and I wanted to understand why that is and whether that can be fixed or if it is recommended to stick with skipmissing
julia> v = missings(Float64, 1000000);
julia> for i in 2:2:1000000
v[i] = 1.0
end
julia> @benchmark collect(skipmissing($v))
BenchmarkTools.Trial:
memory estimate: 5.00 MiB
allocs estimate: 20
--------------
minimum time: 5.857 ms (0.00% GC)
median time: 6.155 ms (0.00% GC)
mean time: 6.198 ms (0.66% GC)
maximum time: 13.369 ms (53.76% GC)
--------------
samples: 798
evals/sample: 1
julia> @benchmark collect(Iterators.filter(!ismissing, $v))
BenchmarkTools.Trial:
memory estimate: 97.17 MiB
allocs estimate: 3999764
--------------
minimum time: 2.963 s (0.17% GC)
median time: 3.056 s (0.73% GC)
mean time: 3.056 s (0.73% GC)
maximum time: 3.150 s (1.27% GC)
--------------
samples: 2
evals/sample: 1
Also, the filter
version returns an Vector{Union{Missing, Float64}}
and I wonder whether type should be tightened when collecting a filtered iterator of a Array
of unions or be the same as the original type.
The second question is about checking if there are any values in a filtered iterator (before applying some function). The naive way would be to check if it is empty and then apply the function. For example:
julia> v = Iterators.filter(i -> (println(i); isfinite(i)), [Inf, Inf, 1.5])
Base.Iterators.Filter{getfield(, Symbol("##17#18")),Array{Float64,1}}(getfield(, Symbol("##17#18"))(), [Inf, Inf, 1.5])
julia> isempty(v) || println("mean is $(mean(v))")
Inf
Inf
1.5
Inf
Inf
1.5
mean is 1.5
What I wanted to know is the following: is there a way to avoid going through the iterator twice? Could some variant of isempty
give me an iterator that wouldn’t check the filtering function on the initial part of the array?
I tried playing a bit with start
, next
done
but got a bit lost. I think I want to precompute s = start(v)
(which gives the first accepted element) and check done(v, s)
. If that returns true
, the iterator is empty and I’m done. Otherwise, I’d like to return an iterator where I already know the value of start
(without going through the first part again). Is there a way to do that, “resuming” iterating through an iterator? As a brute force approach, one can create a custom iterator type with the same next
and done
as the initial iterator but accepting a custom start
value, but maybe somebody did that already.