Design of efficient lazy broadcastable operator

Hi,

I am designing a bunch of operators I would like to evaluate lazily on sets of points (grid cells). The problem is that my design pattern still allocates a bit more when broadcasting, and involves me extending the broadcasting behaviour, which I am not very familiar with. Ideally, in the end result I just want to loop once through the grid cells to apply these operations. I am testing the for loop and broadcasting versions and they are worse than a simple function.

The operators are something like that:

import Base.Broadcast: broadcasted
import Base
using BenchmarkTools

struct InsideSheetOp
    xmin::Real
    xmax::Real
function InsideSheetOp(xmin::Real, xmax::Real)
    # do initialisation stuff here in more complicated cases
    new(xmin, xmax)
end
end

function (op::InsideSheetOp)(x::Real)
    if (x < op.xmin) || (x > op.xmax)
        return false
    end
    return true
end

Base.broadcastable(op::InsideSheetOp) = Ref(op)

# Define broadcasting behavior
function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = similar(x, Bool)
    @. mask = !((x < op.xmin || x > op.xmax))
    return mask
end

for example, comparing with a simple implementation that I broadcast:

function in_sheet(x::Real, xmin::Real, xmax::Real)
    if (x < xmin || x > xmax)
        return false
    else
        return true
    end
end

I get for a somewhat large grid:

x = collect(0:0.1:100000)
opsheet = InsideSheetOp(0, 1000)
@benchmark opsheet.($x) #1.2 ms 18 allocs (~1Mb)
@benchmark in_sheet.($x, Ref(0), Ref(1000)) # 296 ΞΌs 4 allocs (~100kb)

And a for loop allocates also:

@benchmark @inbounds for i in eachindex($x)
    $mask[i] = in_sheet($x[i], 0, 1000) #  0 allocs
end
mask2 = similar(x, Bool)
@benchmark @inbounds for i in eachindex($x)
    $mask2[i] = opsheet($x[i]) # lots of allocs
end

Am I overcomplicating things, here? Can anyone suggest improvements or a different approach? Why my functions allocate so much like this?

Thank you.

1 Like

I can’t comment on the specifics of your problem, but may I point out that InsideSheetOp is an abstract container type? It may be possible to increase your performance by using a parametric type instead. See Performance Tips Β· The Julia Language

On second thought: perhaps this is actually related to your problem, as β€˜in_sheet’ will get specialized methods depending on the input types

1 Like

Thank you!

I need to take a closer look at those docs, thanks!

Do you mean doing something like this?:

struct InsideSheetOp{T<:Real}
    xmin::T
    xmax::T
function InsideSheetOp(xmin::T, xmax::T) where T<:Real
    # do initialisation stuff here in more complicated cases
    new{T}(xmin, xmax)
end
end

I get less allocs when broadcasting, but still the allocation size and the execution time are the same.

Yes (although explicitly defining the constructor function is only necessary if you are doing explicit initialisation stuff there). You could even have xmin and xmax have different subtypes of Real here.

I’ve wrapped everything in functions to make sure that everything is properly compiled and benchmarks are properly ran:

import Base.Broadcast: broadcasted
import Base
using BenchmarkTools

struct InsideSheetOp{T<:Real}
    xmin::T
    xmax::T
function InsideSheetOp(xmin::T, xmax::T) where T<:Real
    # do initialisation stuff here in more complicated cases
    new{T}(xmin, xmax)
end
end

function (op::InsideSheetOp)(x::Real)
    if (x < op.xmin) || (x > op.xmax)
        return false
    end
    return true
end

Base.broadcastable(op::InsideSheetOp) = Ref(op)

# Define broadcasting behavior
function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = similar(x, Bool)
    @. mask = !((x < op.xmin || x > op.xmax))
    return mask
end

function in_sheet(x::Real, xmin::Real, xmax::Real)
    if (x < xmin || x > xmax)
        return false
    else
        return true
    end
end

function in_loop(mask, x)
    @inbounds for i in eachindex(x)
        mask[i] = in_sheet(x[i], 0, 1000)
    end
    return mask
end

function op_loop(mask, x, opsheet)
    @inbounds for i in eachindex(x)
        mask[i] = opsheet(x[i])
    end
    return mask
end
function test()
    x = collect(0:0.1:100000)
    opsheet = InsideSheetOp(0, 1000)
    display(@benchmark $opsheet.($x)) 
    display(@benchmark in_sheet.($x, Ref(0), Ref(1000)))

    mask1 = similar(x, Bool)
    display(@benchmark in_loop($mask1, $x))
    mask2 = similar(x, Bool)
    display(@benchmark op_loop($mask2, $x, $opsheet))
    
end

test()

gives

BenchmarkTools.Trial: 4908 samples with 1 evaluation per sample.
 Range (min … max):  929.959 ΞΌs …   3.925 ms  β”Š GC (min … max): 0.00% … 69.72%
 Time  (median):     974.959 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.018 ms Β± 188.800 ΞΌs  β”Š GC (mean Β± Οƒ):  2.37% Β±  7.52%

  β–…β–†β–ˆβ–†β–„β–ƒβ–ƒβ–ƒβ–‚β–β–                                                   ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–…β–…β–„β–„β–…β–ƒβ–„β–β–ƒβ–„β–„β–β–ƒβ–β–β–ƒβ–ƒβ–…β–β–ƒβ–„β–ƒβ–ƒβ–ƒβ–…β–…β–ƒβ–…β–β–…β–„β–…β–ƒβ–…β–…β–ƒβ–…β–…β–…β–„β–…β–…β–„β–… β–ˆ
  930 ΞΌs        Histogram: log(frequency) by time       2.04 ms <

 Memory estimate: 976.67 KiB, allocs estimate: 2.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  267.500 ΞΌs …  3.217 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     274.333 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   284.764 ΞΌs Β± 98.014 ΞΌs  β”Š GC (mean Β± Οƒ):  1.33% Β± 4.33%

  β–… β–„β–ˆβ–†β–†β–†β–„β–„β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–β–β–β–β–β– ▁▁                                     β–‚
  β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–†β–‡β–†β–‡β–†β–†β–†β–†β–†β–…β–†β–…β–…β–…β–„β–…β–…β–„β–„β–„β–…β–† β–ˆ
  268 ΞΌs        Histogram: log(frequency) by time       350 ΞΌs <

 Memory estimate: 126.45 KiB, allocs estimate: 4.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  214.458 ΞΌs …  10.128 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     214.750 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   221.703 ΞΌs Β± 106.146 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ– ▅▂▂▁ ▁▁▁ ▁▂                                                ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–‡β–†β–‡β–‡β–‡β–†β–‡β–‡β–‡β–†β–†β–†β–†β–†β–†β–‡β–†β–†β–†β–†β–†β–†β–†β–†β–…β–†β–…β–…β–…β–…β–…β–„β–„ β–ˆ
  214 ΞΌs        Histogram: log(frequency) by time        271 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  303.666 ΞΌs …  3.494 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     304.166 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   314.105 ΞΌs Β± 69.193 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‚β–†β–ƒβ–‚β–‚β–‚β–β–‚β–‚   ▁▁ ▁                                            ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–‡β–†β–†β–‡β–‡β–…β–…β–…β–„β–…β–†β–†β–†β–„β–„β–…β–…β–„β–†β–„β–…β–„β–„ β–ˆ
  304 ΞΌs        Histogram: log(frequency) by time       383 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Edit: Replacing Base.Broadcast.broadcasted by the following reducing the memory estimate at the cost of slightly more allocations:

function Base.Broadcast.broadcasted(op::InsideSheetOp, x)
    #Probably I want in-place operations but I am testing it this way for now
    mask = .!((x .< op.xmin .|| x .> op.xmax))
    return mask
end

gives

BenchmarkTools.Trial: 4752 samples with 1 evaluation per sample.
 Range (min … max):  1.019 ms …   6.999 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.031 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.051 ms Β± 154.924 ΞΌs  β”Š GC (mean Β± Οƒ):  0.37% Β± 3.00%

  β–…β–†β–ˆβ–ˆβ–‡β–†β–…β–…β–„β–ƒβ–ƒβ–‚β–‚β–β–‚β–‚β–β–β–β– ▁        ▁  ▁                          β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–‡β–‡β–†β–‡β–…β–†β–†β–„β–†β–†β–…β–…β–…β–„β–…β–ƒβ–ƒβ–…β–„β–ƒβ–…β–„ β–ˆ
  1.02 ms      Histogram: log(frequency) by time      1.19 ms <

 Memory estimate: 126.45 KiB, allocs estimate: 4.

Yes, thatΒ΄s the plan, to actually have an initialisation procedure there later, that’s why I am keeping it.

Thank you! I am getting similar results! I was also forgetting to capture ($) the opsheet. Regarding speed, it is still slower than the in_sheet function. Do you think I can still improve on that?

I also get type instability with the broadcast, which I guess also contributes for this large speed difference.

Yeah I’m not sure how good function (op::InsideSheetOp)(x::Real) is for type stability.

If you write

function do_sheet(x::Real, op::InsideSheetOp)
    if (x < op.xmin || x > op.xmax)
        return false
    else
        return true
    end
end

Then display(@benchmark do_sheet.($x, $opsheet)) gives

BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  356.792 ΞΌs …  1.766 ms  β”Š GC (min … max): 0.00% … 78.94%
 Time  (median):     365.000 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   372.982 ΞΌs Β± 57.514 ΞΌs  β”Š GC (mean Β± Οƒ):  0.77% Β±  3.82%

  β–†β–β–…β–ˆβ–†β–‡β–…β–…β–„β–„β–†β–„β–…β–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–‚β–‚β– ▁  ▁                                  β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–‡β–‡β–†β–†β–†β–…β–†β–…β–†β–…β–…β–…β–…β–†β–†β–†β–†β–…β–‡β–…β–‡β–„β–… β–ˆ
  357 ΞΌs        Histogram: log(frequency) by time       444 ΞΌs <

 Memory estimate: 126.45 KiB, allocs estimate: 4.

Lowering the code indicates the differences:

    display(@code_lowered opsheet(x[1]))
    display(@code_lowered in_sheet(x[1], 0, 1000))
    display(@code_lowered do_sheet(x[1], opsheet))

gives

CodeInfo(
1 ─ %1 = Base.getproperty(op, :xmin)
β”‚   %2 = x < %1
└──      goto #3 if not %2
2 ─      goto #4
3 ─ %5 = Base.getproperty(op, :xmax)
β”‚   %6 = x > %5
└──      goto #5 if not %6
4 β”„      return false
5 ─      return true
)
CodeInfo(
1 ─ %1 = x < xmin
└──      goto #3 if not %1
2 ─      goto #4
3 ─ %4 = x > xmax
└──      goto #5 if not %4
4 β”„      return false
5 ─      return true
)
CodeInfo(
1 ─ %1 = Base.getproperty(op, :xmin)
β”‚   %2 = x < %1
└──      goto #3 if not %2
2 ─      goto #4
3 ─ %5 = Base.getproperty(op, :xmax)
β”‚   %6 = x > %5
└──      goto #5 if not %6
4 β”„      return false
5 ─      return true
)

which means that in_sheet doesn’t need to do Base.getproperty twice.

Now, running one level up (to see the broadcasting behaviour)

    display(@code_lowered opsheet.(x))
    display(@code_lowered in_sheet.(x, 0, 1000))
    display(@code_lowered do_sheet.(x, opsheet))

gives

CodeInfo(
1 ─ %1 = Core.getfield(#self#, :opsheet)
β”‚   %2 = Base.broadcasted(%1, x1)
β”‚   %3 = Base.materialize(%2)
└──      return %3
)
CodeInfo(
1 ─ %1 = Base.broadcasted(Main.in_sheet, x1, x2, x3)
β”‚   %2 = Base.materialize(%1)
└──      return %2
)
CodeInfo(
1 ─ %1 = Base.broadcasted(Main.do_sheet, x1, x2)
β”‚   %2 = Base.materialize(%1)
└──      return %2
)

which I believe shows why do_sheet and in_sheet work better, they don’t have a Core.getfield(#self#, :opsheet) part.

Perhaps someone more versed on Julia internals can verify/complement my brief analysis here?

My preferred way of doing things is via this do_sheet approach, especially if you’ll sometimes be using other operations that InsideSheetOp.

Thanks for the analysis. Yes I agree that your do_sheet is better, and it is easier to code and broadcast. It would be great if the struct properties could go as function arguments to avoid this getproperties call

Just as a matter of style (though possibly also performance), it is more idiomatic to avoid the branch, and simply return the result of the check, like this:

function do_sheet(x::Real, op::InsideSheetOp)
    return (op.min <= x <= op.max)
end

In general, stylistically, one should avoid this pattern

if false
    return false
if true
    return true
end

Just return the boolean value itself.

Yes, my bad, I just lazily copied the function as much as possible.

In general, I typically like to use multiple dispatch rather than broadcasting for functions like do_sheet, especially if you can avoid some repeated operations/computations, e.g.

function do_sheet(x::Vector{<:Real}, op::InsideSheetOp)
     return (x .< op.xmin .|| x .> op.xmax)
end

If InsideSheetOp is a subtype of some abstract type OpType, you can also define a fallback to use broadcasting if no special vector version has been implemented (but whether this works in your context will depend on the problem at hand, of course)

function do_sheet(x::Real, op::OpType)
    # do stuff
end

function do_sheet(x::Vector{<:Real}, op::OpType)
     return do_sheet.(x, op)
end