Given a vector ww, I wish to obtain all its nonzero values and the corresponding positions (with specified type). I solve this with:
using BenchmarkTools, Test
function _find0(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}} where {tp<:Integer, tw}
pp = findall(!iszero, ww);
return tp.(pp), ww[pp] end;
O=UInt32(0); ww=rand(0.0:1.0,10^5); @btime _find0($O,$ww);
154.408 ΞΌs (13 allocations: 989.91 KiB)
Iβd like to squeeze every last drop of performance out of this, since I execute it in a long loop and it is a bottleneck for my application. Inspired by
my attempt at improvement is:
function _find1(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}} where {tp<:Integer, tw}
pp = collect(tp, (first(p) for p β pairs(ww) if !iszero(last(p))));
return pp, ww[pp] end;
O=UInt32(0); ww=rand(0.0:1.0,10^5); @btime _find1($O,$ww);
698.595 ΞΌs (19 allocations: 797.38 KiB)
Why such a performance dip? Both functions seem type-stable: @inferred _find0(O,ww); @inferred _find1(O,ww);
Even stranger, if I copy-paste findall from Julia Base into my script as _findall and use it in my _find0, I get the same bad performance. What is going on here?
Paying attention to the comments from the βfindallβ function you linked:
# Broadcasting is much faster for small testf, and computing.
# integer indices from logical index using findall has a negligible cost
findall(testf::F, A::AbstractArray) where {F<:Function} = findall(testf.(A))
Maybe thatβs where the difference in execution times lies.
In my case, this version, using the indications of the above-mentioned commentary, offers a slight improvement. This is what makes me think that this may be the reason why you find differences in performance.
using BenchmarkTools
function _find0(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}} where {tp<:Integer, tw}
pp = findall(!iszero, ww);
return tp.(pp), ww[pp] end;
function _find2(tp, ww)
typep = typeof(tp)
x = broadcast(!iszero,ww)
res = ww[x]
pos = (typep(1):typep(length(ww)))[x]
return (pos, res)
end
julia> O=UInt32(0); ww=rand(0.0:1.0,10^5);
julia> res0 = _find0(O,ww);
julia> res2 = _find2(O,ww);
julia> res0 == res2
true
julia> typeof(res0)
Tuple{Vector{UInt32}, Vector{Float64}}
julia> typeof(res2)
Tuple{Vector{UInt32}, Vector{Float64}}
julia> @benchmark _find0($O, $ww)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 110.400 ΞΌs β¦ 1.509 ms β GC (min β¦ max): 0.00% β¦ 81.08%
Time (median): 120.000 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 132.035 ΞΌs Β± 99.278 ΞΌs β GC (mean Β± Ο): 6.04% Β± 7.40%
βββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
110 ΞΌs Histogram: frequency by time 337 ΞΌs <
Memory estimate: 993.48 KiB, allocs estimate: 9.
@benchmark _find2($O, $ww)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min β¦ max): 90.200 ΞΌs β¦ 1.300 ms β GC (min β¦ max): 0.00% β¦ 89.55%
Time (median): 94.800 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 100.735 ΞΌs Β± 76.972 ΞΌs β GC (mean Β± Ο): 4.95% Β± 5.97%
ββββββββββ
βββββββββββββββββββββββββ β ββββββββββββββββββββββββββββββββββ β
90.2 ΞΌs Histogram: frequency by time 112 ΞΌs <
Memory estimate: 602.75 KiB, allocs estimate: 7.
Hmm, itβs super strange. Yesterday, when playing around with _find1, I achieved 2x faster time and fewer allocations than with _find0, so I know your solution could be improved. But I donβt know what exactly I changed (if anything), and I canβt recover that fast solution. : (