How is it possible that parallelized code causes fewer allocations?

This case baffles me:

using StructArrays
n, s = 10^5, 100;
llp = [rand(1:n,s) for _=1:n];
llw = [rand(-1:0.001:1,s) for _=1:n]

function sort_vec!(pp::Vector{tp}, ww::Vector{tw}, by::Function=first) ::Nothing  where {tp<:Integer, tw}
    sort!(StructArray((pp,ww)), by=by, alg=QuickSort)
return nothing end;

function sort_prl0!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, by::Function=first) ::Nothing  where {tp<:Integer, tw}
    @inbounds for j=1:length(llp)  sort_vec!(llp[j], llw[j], by) end; 
return nothing end;

function sort_prl1!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, by::Function=first) ::Nothing  where {tp<:Integer, tw}
    @inbounds Threads.@threads for j=1:length(llp)  sort_vec!(llp[j], llw[j], by) end;   
return nothing end;

@time sort_vec!(llp[1], llw[1])
  0.000011 seconds
@time sort_prl0!(llp, llw)
  0.074360 seconds (100.00 k allocations: 3.052 MiB)
@time sort_prl1!(llp, llw)
  0.017991 seconds (52 allocations: 5.297 KiB)

If sort_vec! causes 0 allocations, so should sort_prl0!, but instead it is 100K. How can I find the cause of that and how can I fix it to cause 0 allocs?

I used also @btime and @code_warntype:

julia> @btime sort_prl0!($llp, $llw)
  76.797 ms (100000 allocations: 3.05 MiB)
julia> @code_warntype sort_prl0!(llp, llw)
MethodInstance for sort_prl0!(::Vector{Vector{Int64}}, ::Vector{Vector{Float64}})
  from sort_prl0!(llp::Array{Vector{tp}, 1}, llw::Array{Vector{tw}, 1}) where {tp<:Integer, tw} @ Main REPL[79]:1
Static Parameters
  tp = Int64
  tw = Float64
Arguments
  #self#::Core.Const(Main.sort_prl0!)
  llp::Vector{Vector{Int64}}
  llw::Vector{Vector{Float64}}
Body::Nothing
1 ─ %1 = (#self#)(llp, llw, Main.first)::Core.Const(nothing)
└──      return %1

Not sure how to use this.

1 Like

I suspect this to be caused by inlining happening in the threaded case but not the serial case. Try adding @inline in front of the definition of sort_vec!.

I tried putting @inline before sort! inside sort_vec! and before sort_vec! inside sort_prl0!, but it did not help.

@threads implicitly creates a closure, which probably serves to introduce a function barrier. This probably also allows the function to specialize on the exact type of by, which would likely permit automatic inlining. The reason this likely doesn’t happen in the serial case is because julia doesn’t specialize on Function arguments when they’re just passed through to another function, instead of being called directly.

Be aware that @code_warntype for example always shows specialized code, even if Julia otherwise would not specialize a given call.

1 Like

Hmm, so how should I restructure sort_prl0! to cause 0 allocs?

If it’s the specialization heuristic, then the official workaround is to introduce a type parameter, say by::F=first and add F or F<:Function to the where clause.

2 Likes

Wonderful!

using StructArrays, Polyester
n, s = 10^5, 100;
llp, llw = [rand(1:n,s) for _=1:n], [rand(-1:0.001:1,s) for _=1:n];

function sort_vec!(pp::Vector{tp}, ww::Vector{tw}, by::F=first) ::Nothing  where {tp<:Integer,tw,F<:Function}
    @inline sort!(StructArray((pp,ww)), by=by, alg=QuickSort)
return nothing end;

function sort_prl0!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, by::F=first) ::Nothing  where {tp<:Integer,tw,F<:Function}
    @inbounds for j=1:length(llp)  @inline sort_vec!(llp[j], llw[j], by) end; 
return nothing end;

function sort_prl1!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, by::F=first) ::Nothing  where {tp<:Integer,tw,F<:Function}
    @inbounds Threads.@threads for j=1:length(llp)  sort_vec!(llp[j], llw[j], by) end;   
return nothing end;
function sort_prl2!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, by::F=first) ::Nothing  where {tp<:Integer,tw,F<:Function}
    @inbounds Polyester.@batch for j=1:length(llp)  sort_vec!(llp[j], llw[j], by) end;   
return nothing end;
@time sort_vec!(llp[1], llw[1])
  0.000012 seconds
@time sort_prl0!(llp, llw)
  0.046306 seconds
@time sort_prl1!(llp, llw)
  0.015499 seconds (52 allocations: 5.297 KiB)
@time sort_prl2!(llp, llw)
  0.017102 seconds (1 allocation: 32 bytes)

So the answers of Sukera and Benny solved my issue.

Any idea why there are these additional allocations when multithreading?

There’s a little bit of overhead for creating the Tasks that are run on multiple threads, as well as some data structures for managing those until they’re done.

2 Likes

One more questions regarding performance @Sukera @Benny . Will
function f(x::AbstractVector{T}) where {T}
have the same performance as
function f(x::Vector{T}) where {T}
?

I’m asking mainly because I need my function to work on Vectors, views, StrideArraysCore.PtrArrays and don’t want to write too many methods of the same function with the same body. I was hoping that julia would automatically specialize on each of these 3.

Is there any overhead if in tight loops I run f(view(x, i:j)) instead of f(x, i, j)?

This is the essence of how Julia works. The annotations on the function parameters are only to control dispatch and have no effect on performance.

Tight loops over views are fine if the view is of a contiguous batch of memory. Otherwise you may have some overhead. But for vectors that’s, essentially, always the case.

2 Likes

Is this style of programming good for performance:

function _f1(x::Vector{T}) ::Int  where {T}  ... end
function _f2(x::Vector{T}) ::Int  where {T}  ... end
function _g(f::F, x::Vector{T})  where {F<:Function, T} ... f(x) ... end
function h(x::Vector{T}; mtd::Int=1)  where {T} 
   @assert mtd in (1,2)
   f = (_f1, _f2)[mtd] 
   _g(f, x) end

?

My internal functions _f1 and _f2 compute the same thing in 2 different ways (one sacrifices space to be faster, the other vice-versa). The internal _g uses either _f1 or _f2; since I pass f as F<:Function, each call g(_f1,x) and g(_f2,x) should compile separately (i.e. specialize). In the end, the top non-internal function h just decides which method/algorithm to use with the switch mtd::Int.

I want the end result h to use mtd, because all my top functions are structured this way, as it is a unified framework (mtd = method of computation or algorithm, prl = method of parallelization, etc).

Is this a good approach? Will h(1,x) and h(2,x) compile to 2 different method instances?

No, it won’t, because dispatch won’t occur for the “values” of mtd.

I think you’d better of with an if... clause, because the way you choose now the function may cause a type instability on f, although in this case the compiler may be smart enough to avoid it.

For really compiling two different h methods you need to provide the information on which method to use in the type of the input of h. This is done by using Val types or creating auxiliary types for dispatch, like struct Method1 end to dispatch on ::Type{Method1}.

Here are a couple of points for you to ponder:

  • Check the docstring of @assert - you’re relying on this to catch erronous calls like h([1,2]; mtd=3), but that may not work. @assert is not guaranteed to run, and may be optimized out.
  • You’re basically implementing manual dispatch, but why not expose _g as the external API directly? That way, you don’t need a mtd parameter at all, since passing the function/method of computation directly is just as good performance wise. If you’re worried about external callers passing in “unwanted” functions, you can use a f in (func1, func2) || throw(ArgumentError("Only func1 or func2 allowed!")) check to guard against that.
  • As for prl - using an if and branching accordingly is likely going to be easier to maintain in the long run, rather than shoe-horning this into dispatch. Since the value will be user-supplied, that dispatch is likely going to be dynamic either way, so an if saves you a few calls into the runtime.
1 Like

Hmm, wrapping each of my many internal functions into a Struct just to trigger specialization seems tedious.

And instead of h(x, mtd=2) calling h(x, _my_very_long_internal_function_for_h_that_I_forget_every_time) is undesirable, as I have hundreds of functions and remembering the exact name of each internal function is unnecessary baggage. I have switches mtd, prl, chk, cnv, srt, ord which have the same meaning but are used in many top functions. Having these abbreviations makes it much more manageable to use the top level API and observe performance, debug, improve, verify. If on top of everything else I have to also think about the exact names of internal functions, it will slow me down and be unpleasant to use in other applications.

@assert is not guaranteed to run, and may be optimized out.

help?> @assert
  @assert cond [text]

  Throw an AssertionError if cond is false. This is the preferred syntax for writing assertions, which are conditions that are assumed to be true, but that the user might decide to check anyways, as an aid to debugging if they fail. The optional message text is displayed upon assertion failure.

  │ Warning
  │
  │  An assert might be disabled at some optimization levels. Assert should therefore only be used as a debugging tool and not used for authentication verification (e.g., verifying passwords or checking array bounds). The code
  │  must not rely on the side effects of running cond for the correct behavior of a function.

Hmm, it is unclear to me when @assert might be switched off. Is it just when I manually decide to switch it off to observe unhindered performance, or can it happen unintentionally?

In any case, how do I use mtd to have efficient code? Is this better?

function h(x::Vector{T}; mtd::Int=1)  where {T} 
   if     mtd==1  f = _f1
   elseif mtd==2  f = _f2
   else error("mtd=$mtd is not supported.") end
   _g(f, x) 
end

Or do I have to call

function h(x::Vector{T}; mtd::Int=1)  where {T} 
   if     mtd==1  _g(_f1, x)
   elseif mtd==2  _g(_f2, x)
   else error("mtd=$mtd is not supported.") end
end

I am asking because if I use f in many places inside h, the second option would increase the code quantity by a lot, so I’m trying to avoid it somehow.

Is there another way to use mtd to force specialization based on the values of it?

If I create global empty structs

abstract type Mtd end
struct mtd0 <: Mtd end
struct mtd1 <: Mtd end

and then structure my code as

function _f1(x::Vector{T}) ::Int  where {T}  ... end
function _f2(x::Vector{T}) ::Int  where {T}  ... end
function _g(::Type{mtd1}, x::Vector{T})  where {T} ... _f1(x) ... end
function _g(::Type{mtd2}, x::Vector{T})  where {T} ... _f2(x) ... end
function h(mtd::Mtd, x::Vector{T})  where {T} 
   _g(mtd, x) end

Would this have any performance benefits (type stability) over my initial proposal with mtd::Int?

The annoying thing is that if the body of _g is big, it must be duplicated, even though the only difference is just using _f1 or _f2.

That´s what I meant.

You can still use the version that passes the function around, just only changing the call to ˋhˋ to this one instead of the integer method identifier.

(you can even use the original if…else structure, now that the function used is defined on dispatch the compiler will almost surely remove the branches that are never reached).

Now, note that the simple if… branch with the original integer method identifier would be quite fast, and the (performance) advantages over controling this by dispatch will be probably be relevant only if this functions are called from within a very tight loop where operations can be vectorized, thus if the inner functions are very simple.

1 Like

You don’t need to wrap functions in structs to get specializations; each function in julia has its own type, and (apart from a few exceptions) julia will already specialize your code based on that.

Maybe a more concrete example would be more helpful? If you really do have hundreds of such internal functions that you don’t want to pass around explicitly, how can you remember the (presumably) hundreds of possible parameters for mtd? Wouldn’t a clearer, more descriptive name be easier to recall?

Presumably this large number of internal functions stems from the combinatorial explosion between your many parameters, and not from a large number of distinct operations.

Purely guessing, but do these shorthands correspond to

  • mtd → method
  • prl → parallel (Boolean?)
  • chk → chunk (size, Integer?)
  • cnv → No clue
  • srt → sort (sorting the output, Boolean?)
  • ord → order

? These all seem extremely generic, so I’m wondering why the implementations of these features (e.g. parallelization) needs to be bespoke to each different mtd. This seems like a good candidate for employing something like ThreadsX.jl and ChunkSplitters.jl, as well as just relying on Base.sort in your high-level API. That would probably decrease the number of internal functions substantially, by reusing/sharing more code.

Moreover, you don’t have to create a method for all combinations of these - you could for example have a high-level function that takes just the kernel function (sort_vec! in your original example) that you want to compare/benchmark/test instead of a mtd parameter, internally switch to run_parallel/run_chunked/etc. based on those high-level switches and pass the kernel function to that. Julia will specialize just fine on that.

The point I’m trying to make is that @assert doesn’t protect the array indexing further down like you think it does. The compiler doesn’t explicitly take it into consideration when removing the automatic bounds checks, for example, so adding it doesn’t improve performance. It’s only a debugging tool.

If you’re relying on that @assert for correctness of your program, you may end up with Out-Of-Bounds reads, segfaults or other nastiness in case it does get removed (either by you or a user of your code). See Do something about `@assert` · Issue #51729 · JuliaLang/julia · GitHub and RFC: implement proper debug mode for packages (with support for re-precompilation) by KristofferC · Pull Request #37874 · JuliaLang/julia · GitHub for more information.

3 Likes

If you really do have hundreds of such internal functions that you don’t want to pass around explicitly, how can you remember the (presumably) hundreds of possible parameters for mtd ? Wouldn’t a clearer, more descriptive name be easier to recall?

I’m attempting to implement sparse matrix types that are not in SparseArrays.jl. I’m also interested in parallelizing operations, like +, *, permuting and slicing columns/rows, etc.

do these shorthands correspond

Interesting, you guessed almost all of them :). It seems 3 letters is descriptive enough. Just to my liking, as it is tedious to have overly long names.

  • mtd = method of computation, e.g. either precompute the necessary RAM and allocate it, or allocate more RAM than necessary and then resize! to a smaller length.
  • prl = method of parallelization, e.g. 0 = no parallelization, 1 = Base (@threads, @spawn), 2 = ThreadsX, 3 = Folds, 4 = FLoops, 5 = Polyester.
  • srt = sorting of entries (w.r.t. positions or weights) in each column/row.
  • ord = order (permute) rows/cols w.r.t. some criterion.
  • chk = check correctness (specify how much verification to carry out, 0 means no checking and computation at full speed). This is similar to @assert, which has only 2 levels (on or off).
  • cnv = when and how to convert matrix to a different type.

For instance, in a product of sparse matrices, there are several ways how to approach just the non-parallel computation. Even in the official implementation SparseArrays.jl/src/linalg.jl at 4fd3aad5735e3b80eefe7b068f3407d7dd0c0924 · JuliaSparse/SparseArrays.jl · GitHub
the comment for prefer_sort in line 553 says

Determine if sort! shall be used or the whole column be scanned based on empirical data on i7-3610QM CPU. Measuring runtimes of the scanning and sorting loops of the algorithm. The parameters 6 and 3 might be modified for different architectures.

Ideally, these hidden internal settings would be the default values of a kwarg, that could, however, be easily changed to observe how that impacts the performance. Also, I’d want the product to have the option of using mtd=1 (sorting and aggregating the entries), mtd=2 (adding entries into a dense vector and extracting nonzero ones), or mtd=3 (smart combination where at the right threshold I switch from 1 to 2, like it is in the official implementation). If I had such mtds, I could plot the runtime for 1 and for 2 while the density of the input matrices increases, to accurately choose the right threshold on my computer.

Regarding @assert, I never mistook it for some performance enhancement, I was merely using it as I use assert in Python. Could you share some MWE that demonstrates how it fails protecting me from passing an index that is out of bounds. For instance, how could this fail?

function tmp(k::Int) ::Char
    @assert k in (1,2,3)
    return ('a', 'b', 'c')[k] end

Regarding specialization (the main question of this post), what if I refactor like this:

using StructArrays
abstract type Mtd end
struct mtd0 <: Mtd end
struct mtd1 <: Mtd end
abstract type Prl end
struct prl0 <: Prl end
struct prl1 <: Prl end
function _f!(pp::Vector{tp}, ww::Vector{tw}, ::Type{mtd0}) ::Nothing  where {tp,tw}  
    sort!(StructArray((pp,ww)), by=first, alg=QuickSort);  return nothing end
function _f!(pp::Vector{tp}, ww::Vector{tw}, ::Type{mtd1}) ::Nothing  where {tp,tw}  
    sort!(StructArray((pp,ww)), by=last,  alg=QuickSort);  return nothing end
function g!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, mtd::Type{<:Mtd}, ::Type{prl0}) ::Nothing  where {tp,tw}  
    @inbounds for j=1:length(llp) _f!(llp[j], llw[j], mtd) end end
function g!(llp::Vector{Vector{tp}}, llw::Vector{Vector{tw}}, mtd::Type{<:Mtd}, ::Type{prl1}) ::Nothing  where {tp,tw}  
    @inbounds Threads.@threads for j=1:length(llp) _f!(llp[j], llw[j], mtd) end end
m, n, s = 10^4, 10^5, 5;                            # mxn sparse matrix with ≤s entries in each column
llp = [rand(1:m,rand(0:s)) for _=1:n];              # list of lists of positions 
llw = [rand(-1:0.001:1,length(pp)) for pp in llp];  # list of lists of weights (values) 
@time g!(llp, llw, mtd0, prl0)
@time g!(llp, llw, mtd1, prl0)
@time g!(llp, llw, mtd0, prl1)
@time g!(llp, llw, mtd1, prl1)

julia> methods(g!)
# 2 methods for generic function "g!" from Main:
 [1] g!(llp::Array{Vector{tp}, 1}, llw::Array{Vector{tw}, 1}, mtd::Type{<:Mtd}, ::Type{prl1}) where {tp, tw}
     @ REPL[20]:1
 [2] g!(llp::Array{Vector{tp}, 1}, llw::Array{Vector{tw}, 1}, mtd::Type{<:Mtd}, ::Type{prl0}) where {tp, tw}
     @ REPL[19]:1

julia> [s for m in methods(g!) for s in m.specializations if !isnothing(s)]
4-element Vector{Core.MethodInstance}:
 MethodInstance for g!(::Vector{Vector{Int64}}, ::Vector{Vector{Float64}}, ::Type{mtd0}, ::Type{prl1})
 MethodInstance for g!(::Vector{Vector{Int64}}, ::Vector{Vector{Float64}}, ::Type{mtd1}, ::Type{prl1})
 MethodInstance for g!(::Vector{Vector{Int64}}, ::Vector{Vector{Float64}}, ::Type{mtd0}, ::Type{prl0})
 MethodInstance for g!(::Vector{Vector{Int64}}, ::Vector{Vector{Float64}}, ::Type{mtd1}, ::Type{prl0})

I guess this is just what I wanted: each combination of mtd and prl gives its own compiled function, right?

@Sukera @Imiq If _f! were some simple function that is performed many many times, does having arg ::Type{mtd1} as opposed to not having it and just using 1 method, have the same performance?

Have you seen Finch.jl?

2 Likes

@abraemer Very interesting, will give it a try! TY