Without going through the details of _var, I’d guess that by filtering values one-by-one, you prevent vectorization. Using transducers may help a little, but I don’t think you can eliminate the main slowdown unless you filter the array first.
I don’t see 20x slowdown, but maybe that’s possible with AVX512 hardware.
julia> x = rand(10^6);
julia> x[x.<0.5] .= NaN;
julia> @btime var($x)
832.718 μs (0 allocations: 0 bytes)
NaN
julia> @btime var(Iterators.filter(!isnan, $x))
4.725 ms (0 allocations: 0 bytes)
0.020807221377812737
julia> @btime var(filter(!isnan, $x))
2.342 ms (3 allocations: 7.63 MiB)
0.020807221377812827
julia> @btime var(filter!(!isnan, y)) setup=(y=copy(x)) evals=1
1.732 ms (2 allocations: 16 bytes)
0.020833079313851994
julia> versioninfo()
Julia Version 1.6.0-DEV.464
Commit 29826c2c08 (2020-07-14 22:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: AMD Ryzen 7 4700U with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, btver1)
Environment:
JULIA_NUM_THREADS = 8