Performance optimization of a custom binning function

I’m looking for feedback on how I might be able to further optimize the below bin function. There are a couple of coding patterns that I’m curious to get feedback on, specifically my approach to internal array initialization and method switching. These are coding patterns that I use often but I have a suspicion that they are not performant in Julia.

using Statistics
x = rand(100); y = rand(100); xbin_edges = 0:0.1:1; method = :mean
bin(x, y, xbin_edges; method)

"""
bin(x::Vector{<:Number}, y::Array{<:Number}, xbin_edges::Vector{<:Number}; method::Symbol = :mean)

Fast binning of `y` as a funciton `x`. returns `x_binned`, `y_binned`, `bin_count`. `method` specifies approached used for aggregating binned values.
"""
function bin(x::Vector{<:Number}, y::Array{<:Number}, xbin_edges::Union{Vector{<:Number}, StepRangeLen{}}; method::Symbol = :mean)

# make sure that xbin_edges are from smallest to largests
if !issorted(xbin_edges; lt = <)
error("xbin_edges not sorted")
end

if ndims(y) > 2
error("y must be a vector or 2D array")
end

if size(x,1) != size(y,1)
error("x and y must have the same number of rows")
end

# sort along x for faster binning
if !issorted(x)
p = sortperm(x)
x = x[p]
y = y[p]
end

# initialize outputs
ind = [findfirst(x .>= be) for be in xbin_edges]
n_bin = length(xbin_edges)-1;
x_binned = zeros(typeof(x[1]), n_bin)
y_binned = zeros(typeof(y[1]), n_bin, size(y,2))
bin_count = zeros(size(x_binned))

# find binned values
for i = 1:n_bin
if !isnothing(ind[i]) && !isnothing(ind[i+1]) && (ind[i] < ind[i+1])
bin_count[i] = ind[i+1] - ind[i];

# add new methods here
if method == :mean
x_binned[i] = mean(x[ind[i]:(ind[i+1]-1),:])
y_binned[i] = mean(y[ind[i]:(ind[i+1]-1),:])
elseif method == :median
x_binned[i] = median(x[ind[i]:(ind[i+1]-1),:])
y_binned[i] = median(y[ind[i]:(ind[i+1]-1),:])
end
end
end

return x_binned, y_binned, bin_count
end

The lowest hanging fruit Is that these slices allocate new arrays. Use @view(x[...]) or add @views function... to the whole function instead.

Thanks for the suggestion but it seems adding @view(x[...]) or @views function has no impact on performance. I think this is because indexing into x is handled as a view by default, similar to Python.

n = 1000;

x = rand(n); y = rand(n); xbin_edges = 0:0.1:1; method = :mean

@benchmark bin(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  53.861 μs …  10.949 ms  ┊ GC (min … max):  0.00% … 99.19%
Time  (median):     59.354 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   75.399 μs ± 290.890 μs  ┊ GC (mean ± σ):  13.72% ±  3.56%

▄▆█▇▆▅▄▅▅▅▄▂▁▁▂▁▁▁▁▁                                         ▂
█████████████████████▇▇▆█▇▆▇▇▇▇▇▇▆▆▅▅▄▆▅▄▅▅▄▆▅▅▆▆▅▅▅▂▄▅▅▄▃▂▅ █
53.9 μs       Histogram: log(frequency) by time       148 μs <

Memory estimate: 80.45 KiB, allocs estimate: 291.

@benchmark bin_views(x, y, xbin_edges; method)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  55.590 μs …  10.326 ms  ┊ GC (min … max):  0.00% … 98.77%
Time  (median):     62.511 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   79.364 μs ± 311.441 μs  ┊ GC (mean ± σ):  14.89% ±  3.82%

▂▅▇█▇▆▄▃▄▃▄▅▄▁▁▁▁▁▁                                          ▂
██████████████████████▇▇▆▆▆▆▇▆▇▇▆▇▇▆▆▆▆▆▅▆▅▆▅▇▆▆▆▅▅▅▅▆▄▄▄▄▄▅ █
55.6 μs       Histogram: log(frequency) by time       148 μs <

Memory estimate: 93.39 KiB, allocs estimate: 255.

No, it is not. It copies by default. At least one would expect that allocations went down. Can you post exactly what you benchmarked?

1 Like

Here’s the bin_view version that I benchmarked:

function bin_views(x::Vector{<:Number}, y::Array{<:Number}, xbin_edges::Union{Vector{<:Number}, StepRangeLen{}}; method::Symbol = :mean)

# make sure that xbin_edges are from smallest to largests
if !issorted(xbin_edges; lt = <)
error("xbin_edges not sorted")
end

if ndims(y) > 2
error("y must be a vector or 2D array")
end

if size(x,1) != size(y,1)
error("x and y must have the same number of rows")
end

# sort along x for faster binning
if !issorted(x)
p = sortperm(x)
x = x[p]
y = y[p]
end

# initialize outputs
ind = [findfirst(x .>= be) for be in xbin_edges]
n_bin = length(xbin_edges)-1;
x_binned = zeros(typeof(x[1]), n_bin)
y_binned = zeros(typeof(y[1]), n_bin, size(y,2))
bin_count = zeros(size(x_binned))

# find binned values
for i = 1:n_bin
if !isnothing(ind[i]) && !isnothing(ind[i+1]) && (ind[i] < ind[i+1])
bin_count[i] = ind[i+1] - ind[i];

# add new methods here
if method == :mean
x_binned[i] = mean(@views(x[ind[i]:(ind[i+1]-1),:]))
y_binned[i] = mean(@views(y[ind[i]:(ind[i+1]-1),:]))
elseif method == :median
x_binned[i] =  median(@views(x[ind[i]:(ind[i+1]-1),:]))
y_binned[i] =  median(@views(y[ind[i]:(ind[i+1]-1),:]))
end
end
end

return x_binned, y_binned, bin_count
end

I concur with @lmiq. In addition, there are a couple of things that pop out at me. One of your assignment statements is

x_binned[i] = mean(x[ind[i]:(ind[i+1]-1),:])

but x is a vector, so the second index : is unneeded. Also, you write

y_binned[i] = median(y[ind[i]:(ind[i+1]-1),:])

where you index y_binned with a single integer index, but y_binned may be a matrix with more than a single column. This indexing is legal, but if there is more than one column, your code does not assign values to the second and subsequent columns. They are left at zero due to construction using zeros. It’s hard to imagine that is intended behavior.

Its @view(x[...]) here (without the s).

@views with the s is to use views in the complete function.

(I wonder if there was a way to throw some warning in such situations, but probably that’s hard, as that could make sense depending on what follows the macro).

Edit: now I tested it here it you can use both @views or @view there, its the same.

There is a small reduction in the amount of memory allocated, but the speedup is not very important in this case:

julia> @btime bin_views(\$x, \$y, \$xbin_edges);
50.068 μs (276 allocations: 80.22 KiB)

julia> @btime bin(\$x, \$y, \$xbin_edges);
51.724 μs (240 allocations: 93.22 KiB)

Thus, to actually work on accelerating that function, I think a profiling would be good, to see what is taking most of the time.

1 Like

Thanks… that was just a blunder on my part… it should have been:

x_binned[i] = mean(x[ind[i]:(ind[i+1]-1)])
y_binned[i,:] = mean(y[ind[i]:(ind[i+1]-1),:], dims=1)

OK, it looks like using @view instead (good catch… I would have never seen that) gives a small bump in performance and a 20% reduction in memory

@benchmark bin(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  53.735 μs …   9.672 ms  ┊ GC (min … max):  0.00% … 98.93%
Time  (median):     59.098 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   76.599 μs ± 320.107 μs  ┊ GC (mean ± σ):  16.49% ±  3.95%

▃▅▇█▇▄▃▂▂▃▃▅▄▃▃▂▁▂▂▁ ▁▁▁                                     ▂
████████████████████████▇▇▆▇▇▇▇▇▆▇▇▇▇▇▅▆▆▆▄▅▆▄▄▅▄▄▅▆▅▄▅▅▄▁▆▆ █
53.7 μs       Histogram: log(frequency) by time       135 μs <

Memory estimate: 97.66 KiB, allocs estimate: 278.

@benchmark bin_views(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  52.191 μs …  11.915 ms  ┊ GC (min … max):  0.00% … 99.02%
Time  (median):     57.584 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   74.385 μs ± 316.155 μs  ┊ GC (mean ± σ):  15.04% ±  3.55%

▃▆██▆▄▄▃▄▄▅▆▄▂▂▂▂▁▁                                          ▂
████████████████████▇▇▇▆▆▇▇▇▆▇▆▇▇▇▆▇▇▅▅▆▅▆▅▅▄▆▆▆▆▆▅▄▅▅▄▄▆▄▄▄ █
52.2 μs       Histogram: log(frequency) by time       135 μs <

Memory estimate: 80.25 KiB, allocs estimate: 278.

That benchmark is consistent with what I get, but using @views or @view does not make any difference for me.

Be careful in interpolating the variables in the benchmark (add \$ in @benchmark bin(\$x, \$y, \$xbin_edges), for example. In this case it is not making much of a difference, but in some cases it does, and it is the correct thing to do (because of the internals of the benchmarking macros).

1 Like

@view applies to a single a[b] statement while @views applies the substitution to an entire code block or statement. This block can be function, for, if, let, begin, etc, or a single statement. So @views was totally legal where used, it just happens to operate identically to @view in that situation.

I would limit uses of things like @views or @inbounds to only single lines or blocks of maybe <10 lines at a time. Otherwise you risk accidentally applying it in places where you’ve forgotten you were using it. Also, someone reading your code may not notice it if it’s far away from the effect site.

2 Likes

Seems that checking if x is sorted is not worth the effort. Commenting that and just calling sortperm always:

# sort along x for faster binning
#if !issorted(x)
p = sortperm(x)
x = x[p]
y = y[p]
#end

gives a 40% speedup.

I’m not seeing any improvement on my end:

@benchmark bin(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  54.385 μs …   9.381 ms  ┊ GC (min … max):  0.00% … 98.89%
Time  (median):     60.545 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   77.374 μs ± 283.877 μs  ┊ GC (mean ± σ):  14.39% ±  3.94%

▂▅▇█▆▄▄▄▅▄▄▄▃▂▁▁▁▁ ▁                                         ▂
███████████████████████▇▆▇▇▆▇▅▆▇▆▆▅▅▅▆▅▆▄▅▅▆▆▅▅▆▄▅▆▅▄▄▄▄▂▄▅▅ █
54.4 μs       Histogram: log(frequency) by time       146 μs <

Memory estimate: 97.88 KiB, allocs estimate: 278.

@benchmark bin_views(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  55.784 μs …  12.647 ms  ┊ GC (min … max):  0.00% … 98.98%
Time  (median):     65.115 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   85.777 μs ± 384.235 μs  ┊ GC (mean ± σ):  16.53% ±  3.69%

▂▅▇█▆▄▅▅▆▅▄▂▁▁▂▃▃▂▂▂▂▁▁▁                                    ▂
███████████████████████████████████▇▆▆▆▆▅▇▇▇▇▆▆▆▆▆▃▃▅▅▅▅▄▆▅▆ █
55.8 μs       Histogram: log(frequency) by time       149 μs <

Memory estimate: 85.59 KiB, allocs estimate: 332.

@benchmark bin_views_nosortcheck(x, y, xbin_edges; method)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  54.628 μs …  11.040 ms  ┊ GC (min … max):  0.00% … 98.84%
Time  (median):     61.635 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   79.633 μs ± 345.502 μs  ┊ GC (mean ± σ):  15.96% ±  3.69%

▃▆█▇▇▅▄▅▅▅▃▂▁▁▁▁▁                                            ▂
████████████████████▇▆▇▆▇▇▇██▇▇▆▆▆▇▆▇▇▇▆▆▆▆▄▆▄▅▆▆▆▆▆▅▅▆▄▅▅▄▅ █
54.6 μs       Histogram: log(frequency) by time       156 μs <

Memory estimate: 85.59 KiB, allocs estimate: 332.

I also don’t see any significant speed improvement from eliding the issorted test.

Here is my improved version:

function bin2(x::Vector{<:Number}, y::VecOrMat{<:Number}, xbin_edges::Union{Vector{<:Number}, StepRangeLen{}}; method::F=mean) where F <: Function

issorted(xbin_edges; lt = <) || error("xbin_edges not sorted")
size(x,1) == size(y,1) ||  error("x and y must have the same number of rows")

# sort along x for faster binning
if !issorted(x)
p = sortperm(x)
x = x[p]
y = y isa Vector ? y[p] : y[p,:]
end

# initialize outputs
ind = Int[x[end] ≥ be ? findfirst(≥(be), x) : 0 for be in xbin_edges]
n_bin = length(xbin_edges) - 1
x_binned = zeros(eltype(x), n_bin)
y_binned = zeros(eltype(y), n_bin, size(y,2))
bin_count = zeros(Int, size(x_binned))

# find binned values
for i = 1:n_bin
if !iszero(ind[i] * ind[i+1]) && (ind[i] < ind[i+1])
bin_count[i] = ind[i+1] - ind[i]
x_binned[i] = method(@view x[ind[i]:(ind[i+1]-1)])
y_binned[i,:] = method((@view y[ind[i]:(ind[i+1]-1), :]), dims=1)
end
end

return x_binned, y_binned, bin_count
end

x = rand(100); y = rand(100); xbin_edges = 0:0.1:1; method = :mean

ans1 = bin(x, y, xbin_edges)
ans2 = bin2(x, y, xbin_edges)

using LinearAlgebra: norm
@show maximum((norm(ans1[i] - ans2[i]) for i in 1:3)) # 0.0

using BenchmarkTools
@btime bin(\$x, \$y, \$xbin_edges); # 17.600 μs (215 allocations: 15.11 KiB)

@btime bin2(\$x, \$y, \$xbin_edges); # 5.383 μs (125 allocations: 11.22 KiB)

@btime bin(\$x, \$y, \$xbin_edges, method=:median); # 45.700 μs (539 allocations: 28.80 KiB)

@btime bin2(\$x, \$y, \$xbin_edges, method=median); # 29.700 μs (440 allocations: 24.77 KiB)

Note that I’m comparing to the corrected bin function as per the OP’s corrections.

Edit: Corrected type instability reported by @mikmoore , and eliminated a test by declaring y to be VecOrMat

1 Like

As written, the line y = y[p] drops every column of y after the first. You probably didn’t mean to do that.

It also changes the type of y from Array{T,N} to Array{T,1}. Since this change happens in a conditional block, this will cause a type instability. Since it’s a union of concrete types this instability probably isn’t too detrimental to performance, but this still probably wasn’t what was intended.

To fix the stability, you’ll need to ensure that y retains its type throughout this transformation. It isn’t pretty, but something like y = y[ntuple(i -> i==1 ? p : (:), ndims(y))...] would probably accomplish this. I’m sure there’s a better way (maybe using permute), but haven’t bothered to work it out. You could also make several if statements based on ndims(y).

1 Like

After @PeterSimon pointed this out I changed it to:

y = y[p,:]

is that sub-optimal?

Yes, it makes y 2-dimensional if it was initially a vector. See my edited correction.

1 Like

Wow… that incredible… 3X improvement. Any chance you could explain to me what’s going on when you state:

method::F=mean) where F <: Function

@benchmark bin(\$x, \$y, \$xbin_edges; method=:mean)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  18.003 μs …  5.894 ms  ┊ GC (min … max): 0.00% … 99.21%
Time  (median):     19.663 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   22.538 μs ± 81.777 μs  ┊ GC (mean ± σ):  5.10% ±  1.40%

█▇▅
▅███▇▃▂▂▂▂▂▄▅▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
18 μs           Histogram: frequency by time        47.2 μs <

Memory estimate: 15.20 KiB, allocs estimate: 218.

@benchmark bin_views(\$x, \$y, \$xbin_edges; method=:mean)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  18.434 μs …  6.070 ms  ┊ GC (min … max): 0.00% … 99.14%
Time  (median):     20.383 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   22.878 μs ± 84.880 μs  ┊ GC (mean ± σ):  5.22% ±  1.40%

█▂█▆▂
▆█████▇▄▄▄▃▃▄▆▇▇▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▂▂ ▃
18.4 μs         Histogram: frequency by time        42.7 μs <

Memory estimate: 16.42 KiB, allocs estimate: 272.

@benchmark bin_views_nosortcheck(\$x, \$y, \$xbin_edges; method=:mean)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max):  18.508 μs …  7.606 ms  ┊ GC (min … max): 0.00% … 98.95%
Time  (median):     19.770 μs              ┊ GC (median):    0.00%
Time  (mean ± σ):   22.540 μs ± 76.107 μs  ┊ GC (mean ± σ):  3.34% ±  0.99%

▆█▇▅▃▂▁▁▃▄▄▃▂            ▁                                  ▂
██████████████▆▆▅▃▄▄▁▄▁▄███▇▆▇▆▆▅▇▇▇▇▆▇▆▆▆▅▆▆▆▆▅▅▅▄▆▆▅▅▅▆▄▆ █
18.5 μs      Histogram: log(frequency) by time      54.5 μs <

Memory estimate: 16.42 KiB, allocs estimate: 272.

@benchmark bin2(\$x, \$y, \$xbin_edges; method=mean)

BenchmarkTools.Trial: 10000 samples with 6 evaluations.
Range (min … max):  5.499 μs … 905.726 μs  ┊ GC (min … max):  0.00% … 98.73%
Time  (median):     6.875 μs               ┊ GC (median):     0.00%
Time  (mean ± σ):   7.674 μs ±  27.809 μs  ┊ GC (mean ± σ):  11.93% ±  3.27%

▂▄             ▄█▄▁
▂▃▅██▇▅▄▃▃▃▃▃▃▃▃▃▅████▇▆▅▅▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
5.5 μs          Histogram: frequency by time        9.82 μs <

Memory estimate: 11.22 KiB, allocs estimate: 125.

It looks like nearly all of the performance gain came from replacing:

ind = [findfirst(x .>= be) for be in xbin_edges]
with
ind = Int[x[end] ≥ be ? findfirst(≥(be), x) : 0 for be in xbin_edges]

1 Like

Yes, your original vector ind had to be of type Any because it could contain both Int and nothing. My change allows ind to be a Vector{Int} which can be accessed much more efficiently.

Re your earlier question, see the performance tip for specialization on function arguments.

2 Likes