Design of efficient lazy broadcastable operator

Vitor_Patricio_Canta · April 14, 2025, 8:12am

Hi,

I am designing a bunch of operators I would like to evaluate lazily on sets of points (grid cells). The problem is that my design pattern still allocates a bit more when broadcasting, and involves me extending the broadcasting behaviour, which I am not very familiar with. Ideally, in the end result I just want to loop once through the grid cells to apply these operations. I am testing the for loop and broadcasting versions and they are worse than a simple function.

The operators are something like that:

import Base.Broadcast: broadcasted
import Base
using BenchmarkTools

struct InsideSheetOp
    xmin::Real
    xmax::Real
function InsideSheetOp(xmin::Real, xmax::Real)
    # do initialisation stuff here in more complicated cases
    new(xmin, xmax)
end
end

function (op::InsideSheetOp)(x::Real)
    if (x < op.xmin) || (x > op.xmax)
        return false
    end
    return true
end

Base.broadcastable(op::InsideSheetOp) = Ref(op)

# Define broadcasting behavior
function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = similar(x, Bool)
    @. mask = !((x < op.xmin || x > op.xmax))
    return mask
end

for example, comparing with a simple implementation that I broadcast:

function in_sheet(x::Real, xmin::Real, xmax::Real)
    if (x < xmin || x > xmax)
        return false
    else
        return true
    end
end

I get for a somewhat large grid:

x = collect(0:0.1:100000)
opsheet = InsideSheetOp(0, 1000)
@benchmark opsheet.($x) #1.2 ms 18 allocs (~1Mb)
@benchmark in_sheet.($x, Ref(0), Ref(1000)) # 296 μs 4 allocs (~100kb)

And a for loop allocates also:

@benchmark @inbounds for i in eachindex($x)
    $mask[i] = in_sheet($x[i], 0, 1000) #  0 allocs
end
mask2 = similar(x, Bool)
@benchmark @inbounds for i in eachindex($x)
    $mask2[i] = opsheet($x[i]) # lots of allocs
end

Am I overcomplicating things, here? Can anyone suggest improvements or a different approach? Why my functions allocate so much like this?

Thank you.

JADekker · April 14, 2025, 8:46am

I can’t comment on the specifics of your problem, but may I point out that InsideSheetOp is an abstract container type? It may be possible to increase your performance by using a parametric type instead. See Performance Tips · The Julia Language

On second thought: perhaps this is actually related to your problem, as ‘in_sheet’ will get specialized methods depending on the input types

Vitor_Patricio_Canta · April 14, 2025, 9:11am

Thank you!

I need to take a closer look at those docs, thanks!

Do you mean doing something like this?:

struct InsideSheetOp{T<:Real}
    xmin::T
    xmax::T
function InsideSheetOp(xmin::T, xmax::T) where T<:Real
    # do initialisation stuff here in more complicated cases
    new{T}(xmin, xmax)
end
end

I get less allocs when broadcasting, but still the allocation size and the execution time are the same.

JADekker · April 14, 2025, 9:19am

Yes (although explicitly defining the constructor function is only necessary if you are doing explicit initialisation stuff there). You could even have xmin and xmax have different subtypes of Real here.

JADekker · April 14, 2025, 9:28am

I’ve wrapped everything in functions to make sure that everything is properly compiled and benchmarks are properly ran:

import Base.Broadcast: broadcasted
import Base
using BenchmarkTools

struct InsideSheetOp{T<:Real}
    xmin::T
    xmax::T
function InsideSheetOp(xmin::T, xmax::T) where T<:Real
    # do initialisation stuff here in more complicated cases
    new{T}(xmin, xmax)
end
end

function (op::InsideSheetOp)(x::Real)
    if (x < op.xmin) || (x > op.xmax)
        return false
    end
    return true
end

Base.broadcastable(op::InsideSheetOp) = Ref(op)

# Define broadcasting behavior
function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = similar(x, Bool)
    @. mask = !((x < op.xmin || x > op.xmax))
    return mask
end

function in_sheet(x::Real, xmin::Real, xmax::Real)
    if (x < xmin || x > xmax)
        return false
    else
        return true
    end
end

function in_loop(mask, x)
    @inbounds for i in eachindex(x)
        mask[i] = in_sheet(x[i], 0, 1000)
    end
    return mask
end

function op_loop(mask, x, opsheet)
    @inbounds for i in eachindex(x)
        mask[i] = opsheet(x[i])
    end
    return mask
end
function test()
    x = collect(0:0.1:100000)
    opsheet = InsideSheetOp(0, 1000)
    display(@benchmark $opsheet.($x)) 
    display(@benchmark in_sheet.($x, Ref(0), Ref(1000)))

    mask1 = similar(x, Bool)
    display(@benchmark in_loop($mask1, $x))
    mask2 = similar(x, Bool)
    display(@benchmark op_loop($mask2, $x, $opsheet))
    
end

test()

gives

BenchmarkTools.Trial: 4908 samples with 1 evaluation per sample.
 Range (min … max):  929.959 μs …   3.925 ms  ┊ GC (min … max): 0.00% … 69.72%
 Time  (median):     974.959 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.018 ms ± 188.800 μs  ┊ GC (mean ± σ):  2.37% ±  7.52%

  ▅▆█▆▄▃▃▃▂▁▁                                                   ▁
  █████████████▇▇▇▆▅▅▄▄▅▃▄▁▃▄▄▁▃▁▁▃▃▅▁▃▄▃▃▃▅▅▃▅▁▅▄▅▃▅▅▃▅▅▅▄▅▅▄▅ █
  930 μs        Histogram: log(frequency) by time       2.04 ms <

 Memory estimate: 976.67 KiB, allocs estimate: 2.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  267.500 μs …  3.217 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     274.333 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   284.764 μs ± 98.014 μs  ┊ GC (mean ± σ):  1.33% ± 4.33%

  ▅ ▄█▆▆▆▄▄▃▃▂▂▂▂▁▁▁▁▁▁ ▁▁                                     ▂
  █▇██████████████████████████████▇▇▇▇▇▇▆▇▆▇▆▆▆▆▆▅▆▅▅▅▄▅▅▄▄▄▅▆ █
  268 μs        Histogram: log(frequency) by time       350 μs <

 Memory estimate: 126.45 KiB, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  214.458 μs …  10.128 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     214.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   221.703 μs ± 106.146 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁ ▅▂▂▁ ▁▁▁ ▁▂                                                ▁
  ██████████████████▇▇▇▇▇▇▇▇▇▇▇▆▇▇▇▆▇▇▇▆▆▆▆▆▆▇▆▆▆▆▆▆▆▆▅▆▅▅▅▅▅▄▄ █
  214 μs        Histogram: log(frequency) by time        271 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  303.666 μs …  3.494 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     304.166 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   314.105 μs ± 69.193 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▆▃▂▂▂▁▂▂   ▁▁ ▁                                            ▁
  ██████████████████▇████▇▇▇▇▇▇▇█▇▇▇▇▇▇▇▆▆▇▇▅▅▅▄▅▆▆▆▄▄▅▅▄▆▄▅▄▄ █
  304 μs        Histogram: log(frequency) by time       383 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Edit: Replacing Base.Broadcast.broadcasted by the following reducing the memory estimate at the cost of slightly more allocations:

function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = .!((x .< op.xmin .|| x .> op.xmax))
    return mask
end

gives

BenchmarkTools.Trial: 4752 samples with 1 evaluation per sample.
 Range (min … max):  1.019 ms …   6.999 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.031 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.051 ms ± 154.924 μs  ┊ GC (mean ± σ):  0.37% ± 3.00%

  ▅▆██▇▆▅▅▄▃▃▂▂▁▂▂▁▁▁▁ ▁        ▁  ▁                          ▂
  █████████████████████████▆█▇█▇████▇▇█▇▇▆▇▅▆▆▄▆▆▅▅▅▄▅▃▃▅▄▃▅▄ █
  1.02 ms      Histogram: log(frequency) by time      1.19 ms <

 Memory estimate: 126.45 KiB, allocs estimate: 4.

Vitor_Patricio_Canta · April 14, 2025, 9:38am

Yes, that´s the plan, to actually have an initialisation procedure there later, that’s why I am keeping it.

Thank you! I am getting similar results! I was also forgetting to capture ($) the opsheet. Regarding speed, it is still slower than the in_sheet function. Do you think I can still improve on that?

Vitor_Patricio_Canta · April 14, 2025, 9:45am

I also get type instability with the broadcast, which I guess also contributes for this large speed difference.

JADekker · April 14, 2025, 9:48am

Yeah I’m not sure how good function (op::InsideSheetOp)(x::Real) is for type stability.

If you write

function do_sheet(x::Real, op::InsideSheetOp)
    if (x < op.xmin || x > op.xmax)
        return false
    else
        return true
    end
end

Then display(@benchmark do_sheet.($x, $opsheet)) gives

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  356.792 μs …  1.766 ms  ┊ GC (min … max): 0.00% … 78.94%
 Time  (median):     365.000 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   372.982 μs ± 57.514 μs  ┊ GC (mean ± σ):  0.77% ±  3.82%

  ▆▁▅█▆▇▅▅▄▄▆▄▅▃▃▂▂▃▂▂▂▁ ▁  ▁                                  ▂
  ███████████████████████████████▇▇▇▇▇▆▇▇▆▆▆▅▆▅▆▅▅▅▅▆▆▆▆▅▇▅▇▄▅ █
  357 μs        Histogram: log(frequency) by time       444 μs <

 Memory estimate: 126.45 KiB, allocs estimate: 4.

JADekker · April 14, 2025, 9:52am

Lowering the code indicates the differences:

    display(@code_lowered opsheet(x[1]))
    display(@code_lowered in_sheet(x[1], 0, 1000))
    display(@code_lowered do_sheet(x[1], opsheet))

gives

CodeInfo(
1 ─ %1 = Base.getproperty(op, :xmin)
│   %2 = x < %1
└──      goto #3 if not %2
2 ─      goto #4
3 ─ %5 = Base.getproperty(op, :xmax)
│   %6 = x > %5
└──      goto #5 if not %6
4 ┄      return false
5 ─      return true
)
CodeInfo(
1 ─ %1 = x < xmin
└──      goto #3 if not %1
2 ─      goto #4
3 ─ %4 = x > xmax
└──      goto #5 if not %4
4 ┄      return false
5 ─      return true
)
CodeInfo(
1 ─ %1 = Base.getproperty(op, :xmin)
│   %2 = x < %1
└──      goto #3 if not %2
2 ─      goto #4
3 ─ %5 = Base.getproperty(op, :xmax)
│   %6 = x > %5
└──      goto #5 if not %6
4 ┄      return false
5 ─      return true
)

which means that in_sheet doesn’t need to do Base.getproperty twice.

Now, running one level up (to see the broadcasting behaviour)

    display(@code_lowered opsheet.(x))
    display(@code_lowered in_sheet.(x, 0, 1000))
    display(@code_lowered do_sheet.(x, opsheet))

gives

CodeInfo(
1 ─ %1 = Core.getfield(#self#, :opsheet)
│   %2 = Base.broadcasted(%1, x1)
│   %3 = Base.materialize(%2)
└──      return %3
)
CodeInfo(
1 ─ %1 = Base.broadcasted(Main.in_sheet, x1, x2, x3)
│   %2 = Base.materialize(%1)
└──      return %2
)
CodeInfo(
1 ─ %1 = Base.broadcasted(Main.do_sheet, x1, x2)
│   %2 = Base.materialize(%1)
└──      return %2
)

which I believe shows why do_sheet and in_sheet work better, they don’t have a Core.getfield(#self#, :opsheet) part.

Perhaps someone more versed on Julia internals can verify/complement my brief analysis here?

My preferred way of doing things is via this do_sheet approach, especially if you’ll sometimes be using other operations that InsideSheetOp.

Vitor_Patricio_Canta · April 14, 2025, 10:23am

Thanks for the analysis. Yes I agree that your do_sheet is better, and it is easier to code and broadcast. It would be great if the struct properties could go as function arguments to avoid this getproperties call

DNF · April 14, 2025, 12:05pm

Just as a matter of style (though possibly also performance), it is more idiomatic to avoid the branch, and simply return the result of the check, like this:

function do_sheet(x::Real, op::InsideSheetOp)
    return (op.min <= x <= op.max)
end

In general, stylistically, one should avoid this pattern

if false
    return false
if true
    return true
end

Just return the boolean value itself.

JADekker · April 14, 2025, 12:29pm

Yes, my bad, I just lazily copied the function as much as possible.

In general, I typically like to use multiple dispatch rather than broadcasting for functions like do_sheet, especially if you can avoid some repeated operations/computations, e.g.

function do_sheet(x::Vector{<:Real}, op::InsideSheetOp)
     return (x .< op.xmin .|| x .> op.xmax)
end

If InsideSheetOp is a subtype of some abstract type OpType, you can also define a fallback to use broadcasting if no special vector version has been implemented (but whether this works in your context will depend on the problem at hand, of course)

function do_sheet(x::Real, op::OpType)
    # do stuff
end

function do_sheet(x::Vector{<:Real}, op::OpType)
     return do_sheet.(x, op)
end

Topic		Replies	Views
Lazy broadcasting speed issues Performance broadcasting , lazy-evaluation , lazy , lazyarray	13	727	March 12, 2023
Confusion on performance when using the broadcasting macro @. vs explicit . operators Performance	7	167	March 27, 2025
Why does this broadcast operation require specialization for optimal performance? Performance broadcast , array	7	467	February 15, 2023
Performance of simple broadcasting operations with many arguments Performance performance , broadcast	15	1592	November 29, 2021
Blog post: Loop fusion and vectorization in Julia 0.6 Internals & Design announcement , broadcast	28	8417	May 4, 2017

Design of efficient lazy broadcastable operator

Related topics