Understanding `Base.findall` and customizing it

Given a vector ww, I wish to obtain all its nonzero values and the corresponding positions (with specified type). I solve this with:

using BenchmarkTools, Test
function _find0(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}}  where {tp<:Integer, tw}  
    pp = findall(!iszero, ww);   
    return tp.(pp), ww[pp] end;
O=UInt32(0);   ww=rand(0.0:1.0,10^5);   @btime _find0($O,$ww);
  154.408 ΞΌs (13 allocations: 989.91 KiB)

I’d like to squeeze every last drop of performance out of this, since I execute it in a long loop and it is a bottleneck for my application. Inspired by

my attempt at improvement is:

function _find1(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}}  where {tp<:Integer, tw}  
    pp = collect(tp, (first(p) for p ∈ pairs(ww) if !iszero(last(p))));   
    return pp, ww[pp] end;
O=UInt32(0);   ww=rand(0.0:1.0,10^5);   @btime _find1($O,$ww);
  698.595 ΞΌs (19 allocations: 797.38 KiB)

Why such a performance dip? Both functions seem type-stable:
@inferred _find0(O,ww);
@inferred _find1(O,ww);

Even stranger, if I copy-paste findall from Julia Base into my script as _findall and use it in my _find0, I get the same bad performance. What is going on here?

Paying attention to the comments from the β€˜findall’ function you linked:

# Broadcasting is much faster for small testf, and computing.
# integer indices from logical index using findall has a negligible cost
findall(testf::F, A::AbstractArray) where {F<:Function} = findall(testf.(A))

Maybe that’s where the difference in execution times lies.

In my case, this version, using the indications of the above-mentioned commentary, offers a slight improvement. This is what makes me think that this may be the reason why you find differences in performance.

using BenchmarkTools

function _find0(::tp, ww::AbstractVector{tw}) ::Tuple{Vector{tp},Vector{tw}}  where {tp<:Integer, tw}  
    pp = findall(!iszero, ww);   
    return tp.(pp), ww[pp] end;

function _find2(tp, ww)
    typep = typeof(tp)
    x = broadcast(!iszero,ww)
    res = ww[x]
    pos = (typep(1):typep(length(ww)))[x]
    return (pos, res)
end

julia> O=UInt32(0);   ww=rand(0.0:1.0,10^5);

julia> res0 = _find0(O,ww);

julia> res2 = _find2(O,ww);

julia> res0 == res2
true

julia> typeof(res0)
Tuple{Vector{UInt32}, Vector{Float64}}

julia> typeof(res2)
Tuple{Vector{UInt32}, Vector{Float64}}

julia> @benchmark _find0($O, $ww)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  110.400 ΞΌs …  1.509 ms  β”Š GC (min … max): 0.00% … 81.08%
 Time  (median):     120.000 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   132.035 ΞΌs Β± 99.278 ΞΌs  β”Š GC (mean Β± Οƒ):  6.04% Β±  7.40%

   β–ˆβ–†β–ƒ
  β–„β–ˆβ–ˆβ–ˆβ–†β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–‚β–‚β–β–‚β–‚β–β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–β–‚β–‚β–‚β–‚β–‚β–‚ β–‚
  110 ΞΌs          Histogram: frequency by time          337 ΞΌs <

 Memory estimate: 993.48 KiB, allocs estimate: 9.

@benchmark _find2($O, $ww)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   90.200 ΞΌs …  1.300 ms  β”Š GC (min … max): 0.00% … 89.55%
 Time  (median):      94.800 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   100.735 ΞΌs Β± 76.972 ΞΌs  β”Š GC (mean Β± Οƒ):  4.95% Β±  5.97%

       β–ƒβ–β–†β–ˆβ–„β–ˆβ–‡β–‚β–„β–‚
  β–β–‚β–‚β–ƒβ–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–†β–‡β–‡β–†β–‡β–†β–…β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β– β–ƒ
  90.2 ΞΌs         Histogram: frequency by time          112 ΞΌs <

 Memory estimate: 602.75 KiB, allocs estimate: 7.
1 Like

Hmm, it’s super strange. Yesterday, when playing around with _find1, I achieved 2x faster time and fewer allocations than with _find0, so I know your solution could be improved. But I don’t know what exactly I changed (if anything), and I can’t recover that fast solution. : (