Performance optimization of a custom binning function

materializes a vector of Bools of length(x), hence the slowdown.

But, there is still a bit more to optimize, as:

still searches the vector x from the start each time, and not from the last position - this is a O(n^2) time algo and when x grows it will get worse.

On a different matter, looking at the bin counts in the results:

julia> ans2 = bin2(x, y, xbin_edges)
([0.05250221420254687, 0.15445935346995493, 0.2509019085421154, 0.34946848820087534, 0.4489906779469514, 0.5480592397942313, 0.6459265647762795, 0.7482272000306962, 0.8492190264532677, 0.0], [0.479559946078053; 0.5409834483815402; … ; 0.49189476192543485; 0.0;;], [100, 95, 95, 106, 107, 112, 99, 91, 87, 0])

julia> sum([100, 95, 95, 106, 107, 112, 99, 91, 87, 0])
892

892 isn’t a good sum, as it is, eh… less than 1000. This is a correctness bug.

Here is another version, which hopefully addresses this bug (and adds some others probs):

function bin3(
    x::Vector{<:Number},
    y::Array{<:Number},
    xbin_edges::Union{Vector{<:Number},StepRangeLen{}};
    method::F = mean,
) where {F<:Function}
    issorted(xbin_edges; lt = <) || error("xbin_edges not sorted")
    ndims(y) > 2 && error("y must be a vector or matrix")
    size(x, 1) == size(y, 1) || error("x and y must have the same number of rows")
    # sort along x for faster binning
    if !issorted(x)
        p = sortperm(x)
        x = x[p]
        y = y[p]
    end
    # initialize outputs
    n_bin = length(xbin_edges) - 1
    ind = zeros(Int, n_bin + 1)
    i, bi = 1, 1
    for be in xbin_edges
        ff = findfirst(β‰₯(be), @view x[i:end])
        if isnothing(ff)
            ind[bi] = length(x)
            break
        end
        ind[bi] = ff + i - 1
        bi += 1
        i += ff
    end
    last_bin = bi - 1
    x_binned = zeros(eltype(x), n_bin)
    y_binned = zeros(eltype(y), n_bin, size(y, 2))
    bin_count = zeros(Int, size(x_binned))
    # find binned values
    for i = 1:last_bin
        bin_count[i] = ind[i+1] - ind[i]
        x_binned[i] = method(@view x[ind[i]:(ind[i+1]-1)])
        y_binned[i, :] = method((@view y[ind[i]:(ind[i+1]-1), :]), dims = 1)
    end

    return x_binned, y_binned, bin_count
end

Edit: As PeterSimon pointed out, using findnext would be more readable (should have remembered this).

2 Likes

Also, [findfirst(x .>= be for be in xbin_edges for be in xbin_edges] creates a new true/false vector for each entry in xbin_edges. Better to pass a function as the first argument of findfirst and avoid all the allocations.

Edit: Crossed paths with Dan’s post, which pointed out the same thing. Yes, one could use findnext iteratively to avoid the inefficiency of multiple findfirst calls.

2 Likes

findfirst actually gives an O(n*m) algorithm with n = length(x) and m = length(xbin_edges). As an improvement one might use searchsortedfirst(x, be) instead which should be able to exploit the fact that x is sorted. It does not address the correctness bug, but should presumably run in O(m * log n), i.e., even better than O(n) if m << n.

3 Likes

searchsortedfirst should be a O(log n) binary search, so theoretically faster than O(n), but specific instance and hardware performance issues will determine faster approach.

Interestingly, a searchsortedguess version could be made which assumes some sort of β€œuniform” distribution of x and tries to binary search with bias towards a guessed location of the next bin cutoff. This might be going on a tangent though.

Edit: Again, cross-posting with bertschi about the log(N)-ess of sorted search.

1 Like

In this case this is intentional as we only want bins to include data within bin limits… even if the bin limits do not cover the full range of data

julia> xbin_edges
0.0:0.1:1.0

julia> extrema(x)
(0.0006307304436002914, 0.9988458402204999)

The bins include all the data in this test case.

Good catch!

Thanks all for taking the time… I learned a ton! Very much appreciated.

1 Like

Found a couple of more mistakes… this is what I settled on, posting for anyone that comes next:

function bin(
    x::Vector{<:Number},
    y::Array{<:Number},
    xbin_edges::Union{Vector{<:Number},StepRangeLen{}};
    method::F = mean,
) where {F<:Function}
    
    issorted(xbin_edges; lt = <) || error("xbin_edges not sorted")
    ndims(y) > 2 && error("y must be a vector or matrix")
    size(x, 1) == size(y, 1) || error("x and y must have the same number of rows")
    
    # sort along x for faster binning
    if !issorted(x)
        p = sortperm(x)
        x = x[p]
        y = y isa Vector ? y[p] : y[p,:]
    end

    # initialize outputs
    n_bin = length(xbin_edges) - 1
    ind = zeros(Int, n_bin + 1)
    i, bi = 1, 1
    for be in xbin_edges
        ff = searchsortedfirst((@view x[i:end]), be)
        if isnothing(ff)
            ind[bi] = length(x) + 1
            break
        end
        ind[bi] = ff + i - 1
        bi += 1
        i += ff - 1 
    end
    
    last_bin = bi - 2
    x_binned = zeros(eltype(x), n_bin)
    y_binned = zeros(eltype(y), n_bin, size(y, 2))
    bin_count = zeros(Int, size(x_binned))

    # find binned values
    for i = 1:last_bin
        bin_count[i] = ind[i+1] - ind[i]
        if bin_count[i] > 0
            x_binned[i] = method(@view x[ind[i]:(ind[i+1]-1)])
            y_binned[i, :] = method((@view y[ind[i]:(ind[i+1]-1), :]), dims = 1)
        end
    end

    return x_binned, y_binned, bin_count
end
1 Like

Just another few remarks on the searchsortedfirst piece:

  1. According to the docs searchsortedfirst already returns lastindex(x) + 1 if all values of x are greater than be, i.e., the if isnothing(ff) block will be unreachable.

  2. Changing the start index i in searchsortedfirst((@view x[i:end]), be) seems unnecessary as it will only cutoff a few steps in the binary search compared to searchsortedfirst(x, be) – the search is O(log n) after all – at the expense of additional complexity of keeping track of the indices in your code and some overhead from @view x[i:end] which also needs to map sub-indices to the parent array on each access.

In any case, the best algorithm surely depends on your use case, i.e., the sizes of x and xbin_edges. I.e., for short vectors x an O(n) search exploiting that both x and xbin_edges are sorted could be faster than the binary search of searchsortedfirst due to memory locality (binary search does random access in any case). Have not benchmarked this though …

2 Likes

Since performance juice has been squeezed from this function. Here is another version that taps some succinctness juice:

using StatsBase, BlockedArrays

x = rand(1000); y = rand(1000); xbin_edges = 0:0.1:1; method = :mean

function bin4(x, y, b, f = mean)
    p = sortperm(mortar([x,b]))
    bins = findall(p.>length(x))
    bin_count = diff(bins).-1
    x_binned = f.(x[p[bins[i]+1:bins[i+1]-1]] for i in 1:length(b)-1)
    y_binned = f.(y[p[bins[i]+1:bins[i+1]-1],:] for i in 1:length(b)-1)
    return x_binned, y_binned, bin_count
end

Gives similar results:

julia> ans4 = bin4(x, y, xbin_edges)
([0.05121915383957467, 0.14802010669692384, 0.25479136196099045, 0.3534202855299341, 0.44822152654780295, 0.5520721732892945, 0.6471681681318727, 0.7462173706421277, 0.8528325708490744, 0.9481594827553909], 
 [0.4935288649339929, 0.4978985328343402, 0.5233712241230352, 0.5127680432385109, 0.5095357581410285, 0.5337653237252364, 0.5257035473226342, 0.508290731656027, 0.5080514760215488, 0.4612049530088362],
 [69, 108, 107, 97, 109, 104, 99, 101, 99, 107])

This version can also be optimized a bit for performance (but it isn’t too bad).

Taking into account @bertschi comments and simplifying gives a highly performant and readable solution:

function bin5(
    x::Vector{<:Number},
    y::Array{<:Number},
    xbin_edges::Union{Vector{<:Number},StepRangeLen{}};
    method::F = mean,
) where {F<:Function}
    
    issorted(xbin_edges; lt = <) || error("xbin_edges not sorted")
    ndims(y) > 2 && error("y must be a vector or matrix")
    size(x, 1) == size(y, 1) || error("x and y must have the same number of rows")
    
    # sort along x for faster binning
    if !issorted(x)
        p = sortperm(x)
        x = x[p]
        y = y isa Vector ? y[p] : y[p,:]
    end

    # initialize outputs
    ind = Int[searchsortedfirst(x, be) for be in xbin_edges]
    n_bin = length(xbin_edges) - 1
    x_binned = zeros(eltype(x), n_bin)
    y_binned = zeros(eltype(y), n_bin, size(y, 2))
    bin_count = zeros(Int, size(x_binned))
 
    # find binned values
    for i = 1:n_bin
        bin_count[i] = ind[i+1] - ind[i]
        if bin_count[i] > 0
            x_binned[i] = method(@view x[ind[i]:(ind[i+1]-1)])
            y_binned[i, :] = method((@view y[ind[i]:(ind[i+1]-1), :]), dims = 1)
        end
    end

    return x_binned, y_binned, bin_count
end
using Statistics, BenchmarkTools
n = 1000;
x = rand(n); y = rand(n); xbin_edges = 0:0.1:1;
@benchmark bin5(x, y, xbin_edges; method=mean)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  31.718 ΞΌs …   9.859 ms  β”Š GC (min … max): 0.00% … 99.45%
 Time  (median):     34.044 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   41.035 ΞΌs Β± 179.824 ΞΌs  β”Š GC (mean Β± Οƒ):  8.72% Β±  1.99%

   β–ˆβ–†                                                           
  β–…β–ˆβ–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–†β–†β–„β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–β–‚β–‚ β–ƒ
  31.7 ΞΌs         Histogram: frequency by time         81.6 ΞΌs <

Very compact solution that I’m sure could be optimized but as is runs about 100 times slower

Turns out, I had only to give up on using clever BlockedArrays, and the compact function actually benchmarked faster than bin5:

function bin4(x, y, b, f = mean)
    p = sortperm(vcat(x,b))
    bins = findall(p.>length(x))
    x_binned = [f(x[p[k]] for k in bins[i]+1:bins[i+1]-1)
                  for i in 1:length(b)-1]
    y_binned = [f(y[p[k],t] for k in bins[i]+1:bins[i+1]-1)
                  for i in 1:length(b)-1, t in 1:size(y,2)]
    return x_binned, y_binned, diff(bins).-1
end

(and it doesn’t need sorted bin_edges)

julia> @btime bin4($x, $y, $xbin_edges);
  41.895 ΞΌs (14 allocations: 21.52 KiB)

julia> @btime bin5($x, $y, $xbin_edges);
  46.254 ΞΌs (118 allocations: 40.41 KiB)

Edit: Sacrificing clarity, the function can be made to fit into the 7-line category of this thread https://discourse.julialang.org/t/seven-lines-of-julia-examples-sought/50416/1 :

function bin6line(x, y, b, f = mean)
    p = sortperm(vcat(x,b))
    bins = findall(p.>length(x))
    aggr = @views [f(z[p[bins[i]+1:bins[i+1]-1]]) for i in 1:length(b)-1, z in (x, eachcol(y)...) ]
    return @view(aggr[:,1]), @view(aggr[:,2:end]), diff(bins).-1
end
2 Likes

Bravo… that’s impressive. I get about a 7% improvement in speed with this one… and in so few lines. Though I like more descriptive variable names for readability.

Should be

bins = findall(>(length(x)), p) 

to avoid unnecessary temporary array.

@DNF: it has crossed my mind, but the other form is cleaner and benchmarking showed no difference. In fact, on a side note, why is there no difference in the following?

julia> vv  = rand(1:100,10_000);

julia> @btime findall($vv.>99);
  6.475 ΞΌs (4 allocations: 6.42 KiB)

julia> @btime findall(>(99), $vv);
  6.531 ΞΌs (4 allocations: 6.42 KiB)

(Julia 1.8.3)

The more interesting question is how to speed up the next line aggr = @view… ?

@alex-s-gardner : To get better naming, returning a named-tuple seems nice, especially since elements of tuple have different types. Something like:

(x_binned = @view(aggr[:,1]), y_binned = @view(aggr[:,2:end]), bin_count = diff(bins).-1)

as last bin6 line (other versions can be adapted similarly). Doing so allows destructuring into different vars, or accessing fields namefullly.

TL-CP (too long, just let me copy-paste please):

Code
using Statistics, BenchmarkTools

function bin5(
    x::Vector{<:Number},
    y::Array{<:Number},
    xbin_edges::Union{Vector{<:Number},StepRangeLen{}};
    method::F = mean,
) where {F<:Function}
    
    issorted(xbin_edges; lt = <) || error("xbin_edges not sorted")
    ndims(y) > 2 && error("y must be a vector or matrix")
    size(x, 1) == size(y, 1) || error("x and y must have the same number of rows")
    
    # sort along x for faster binning
    if !issorted(x)
        p = sortperm(x)
        x = x[p]
        y = y isa Vector ? y[p] : y[p,:]
    end

    # initialize outputs
    ind = Int[searchsortedfirst(x, be) for be in xbin_edges]
    n_bin = length(xbin_edges) - 1
    x_binned = zeros(eltype(x), n_bin)
    y_binned = zeros(eltype(y), n_bin, size(y, 2))
    bin_count = zeros(Int, size(x_binned))
 
    # find binned values
    for i = 1:n_bin
        bin_count[i] = ind[i+1] - ind[i]
        if bin_count[i] > 0
            x_binned[i] = method(@view x[ind[i]:(ind[i+1]-1)])
            y_binned[i, :] = method((@view y[ind[i]:(ind[i+1]-1), :]), dims = 1)
        end
    end

    return x_binned, y_binned, bin_count
end

function bin4(x, y, b, f = mean)
    p = sortperm(vcat(x,b))
    bins = findall(p.>length(x))
    x_binned = [f(x[p[k]] for k in bins[i]+1:bins[i+1]-1)
                  for i in 1:length(b)-1]
    y_binned = [f(y[p[k],t] for k in bins[i]+1:bins[i+1]-1)
                  for i in 1:length(b)-1, t in 1:size(y,2)]
    return x_binned, y_binned, diff(bins).-1
end

function bin6(x, y, b, f = mean)
    p = sortperm(vcat(x,b))
    bins = findall(>(length(x)), p)
    aggr = @views [f(z[p[bins[i]+1:bins[i+1]-1]]) for i in 1:length(b)-1, z in (x, eachcol(y)...) ]
    (x_binned = @view(aggr[:,1]), y_binned = @view(aggr[:,2:end]), bin_count = diff(bins).-1)
end

n = 1000;
x = rand(n); y = rand(n); xbin_edges = 0:0.1:1;

display(@benchmark bin5(x, y, xbin_edges; method=mean));
display(@benchmark bin4(x, y, xbin_edges));
display(@benchmark bin6(x, y, xbin_edges));
; # that's all

ADDENDUM: This calculation can also be done with DataFrames (or any DB-like interface) as it is a GroupBy like operation. So:

using DataFrames

dfbin(x, y, b, f = mean) =
  combine(
    groupby(transform(
      DataFrame(x=x,y=collect(eachrow(y)); copycols=false), 
      :x => ByRow(x -> searchsortedfirst(b,x)) => :bin
    ), :bin; sort=true), 
    Not(:bin) .=> (reduced(c) = f(c,dims=1)), :bin => length => :bin_count;
    keepkeys=false
  )

and we have:

julia> dfbin(x,y,xbin_edges)
10Γ—3 DataFrame
 Row β”‚ x_reduced  y_reduced   bin_count 
     β”‚ Float64    Array…      Int64     
─────┼──────────────────────────────────
   1 β”‚ 0.0434871  [0.541275]         92
   2 β”‚ 0.151771   [0.521445]         85
   3 β”‚ 0.250651   [0.5025]          113
   4 β”‚ 0.346103   [0.440693]        107
   5 β”‚ 0.446809   [0.523061]        111
   6 β”‚ 0.555471   [0.491154]         89
   7 β”‚ 0.649694   [0.483566]         91
   8 β”‚ 0.751556   [0.44833]         102
   9 β”‚ 0.850171   [0.555502]        105
  10 β”‚ 0.948096   [0.535546]        105

It is slower (around 10x), but maybe optimization possible, especially if data comes from DataFrame already.

1 Like

Yeah, I find that odd as well, particularly that the allocations are the exactly the same every time.

In fact, this is really odd:

julia> vv = rand(1:100, 10_000);

julia> @btime findall(>(101), $vv);
  4.900 ΞΌs (4 allocations: 5.61 KiB)

julia> findall(>(101), vv)
Int64[]

Looked in the source:

findall(testf::F, A::AbstractArray) where {F<:Function} = findall(testf.(A))

Huh…

2 Likes

These two lines are so elegant

@Dan Note that we lost out check for empty bins, need something like:

# find bin breakpoints
    p = sortperm(vcat(x, xbin_edges))
    bins = findall(>(length(x)), p)
    bin_count = diff(bins).-1
    
    # calculate binned metrics
    x_binned = [(bin_count[i] > 0) ? method(x[p[bins[i]+1:bins[i+1]-1]]) : 0 for i in 1:length(xbin_edges)-1]
    y_binned = [(bin_count[i] > 0) ? method(y[p[bins[i]+1:bins[i+1]-1]]) : 0 for i in 1:length(xbin_edges)-1]