Poor performance of `lock(l) do` closures, caused by poor performance of capturing closures

NHDaly · June 25, 2020, 8:51pm

Multithreading is hard.

Correctly and safely locking and unlocking can be hard.

For these reasons, we prefer abstractions that help us do things safely and correctly. One such abstraction is the lock() method that takes a function argument:

lock(f::Function, lock)
Acquire the lock, execute f with the lock held, and release the lock when f returns. If the lock is already locked by a different task/thread, wait for it to become available.

When this function returns, the lock has been released, so the caller should not attempt to unlock it.

Used like this:

lock(l) do
    ...
end

This guarantees that we always correctly unlock the lock whenever we exit the computation by any means, whether we return or an exception is thrown. Which is great.

However, as discussed before (here: Performance of closures, and here: JuliaLang/julia#15276), it’s easy for this to perform poorly. From what i can tell, this introduces allocations if the lambda captures any non-isbits objects.

This is unfortunate and unexpected because I would have expected the lambda to be completely compiled away - it’s anonymous, and only ever passed to the one function where it’s invoked, so I would have expected this all to inline away. However, instead it causes allocations:

EDIT: See the thread below for a better example with nontrivial allocations

julia> const l = ReentrantLock()
ReentrantLock(nothing, Base.GenericCondition{Base.Threads.SpinLock}(Base.InvasiveLinkedList{Task}(nothing, nothing), Base.Threads.SpinLock(0)), 0)

julia> function f_locked(x::Vector)
           lock(l) do
               push!(x, 2)
           end
       end
f_locked (generic function with 1 method)

julia> function f(x::Vector)
           push!(x, 1)
       end
f (generic function with 1 method)

julia> @time f([])
  0.000002 seconds (2 allocations: 128 bytes)
1-element Array{Any,1}:
 1

julia> @time f_locked([])
  0.000008 seconds (3 allocations: 144 bytes)
1-element Array{Any,1}:
 2

That seems small, but locks tend to show up in very performance critical inner-loop type code, and if we lock and unlock several times in the same function, this adds up quite quickly.

But if the lock/unlock is written manually, there’s no overhead:

julia> function f_locked_manually(x::Vector)
           lock(l)
           try
               push!(x, 2)
           finally
               unlock(l)
           end
       end
f_locked_manually (generic function with 1 method)

julia> @time f_locked_manually([])
  0.000005 seconds (2 allocations: 128 bytes)
1-element Array{Any,1}:
 2

# So, now my questions:

Is this a known thing? Has it always been like this, or was there a regression at some point where this stopped performing as well?
What is the recommended workaround?

# Proposed Workaround: A `@lock` macro

Regarding workarounds, at RelationalAI we've started using this macro, instead, which works perfectly well since in fact this is an entirely syntactic transformation:

macro lock(l, ex)
    quote
        local lk = $(esc(l))
        lock(lk)
        try
            $(esc(ex))
        finally
            unlock(lk)
        end
    end
end

@lock l begin
    ...
end

Is there anything we’re missing why this isn’t a good idea? If it seems okay, is this something we should spread more widely, and everyone should just always prefer to do this? Should we put this in base, and recommend it instead of the closure form? (other reasonable names include @withlock or @locked, etc)

Thanks all!
Cheers,
~Nathan

Mason · June 25, 2020, 9:19pm

First of all, is there a reason you’re not using BenchmarkTools.jl for this? I’m not sure such quickly running tests are reliable with @time. On Julia 1.5-beta1, and using BenchmarkTools.jl, I still see a significant overhead to f_locked_manually:

for f ∈ (f, f_locked, f_locked_manually)
    println("Benchmarking $f")
    @btime $f(v) setup=(v=[])
end
println("")
versioninfo()

#+RESULTS:
Benchmarking f
  7.858 ns (0 allocations: 16 bytes)
Benchmarking f_locked
  47.520 ns (0 allocations: 16 bytes)
Benchmarking f_locked_manually
  48.987 ns (0 allocations: 16 bytes)

Julia Version 1.5.0-beta1.0
Commit 6443f6c95a* (2020-05-28 17:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600 Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver1)
Environment:
  JULIA_NUM_THREADS = 6

I don’t think the performance of captured variables problem is relevant here because v isn’t being redefined or changed, it’s being mutated which shouldn’t have any performance overhead.

NHDaly · June 25, 2020, 9:28pm

@Mason

Sorry, no i was just being lazy. I didn’t use BenchmarkTools because this was such a simple benchmark.

But i’m not sure what’s going on in your example. I still see the same allocations difference with @btime:

julia> @btime f($([]));
  11.406 ns (0 allocations: 0 bytes)

julia> @btime f_locked($([]));
  75.739 ns (1 allocation: 16 bytes)

julia> @btime f_locked_manually($([]));
  66.458 ns (0 allocations: 0 bytes)

NHDaly · June 25, 2020, 9:29pm

Hmm, though it could be that i’m on an old version of julia…:

julia> versioninfo()
Julia Version 1.4.1
Commit efc7879aae (2020-05-09 02:48 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8
  JULIA_DEPOT_PATH = /Users/daly/Documents/work/rai/raicode-clean/../jl_depots/raicode-clean

EDIT:
Oh indeed! Well, there you have it. This is solved in later versions of julia:


julia> @btime f($([]));
  11.878 ns (0 allocations: 0 bytes)

julia> @btime f_locked($([]));
  65.186 ns (0 allocations: 0 bytes)

julia> @btime f_locked_manually($([]));
  65.341 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.6.0-DEV.59
Commit 84dd4dda5f* (2020-05-18 13:29 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

– okay, but imagine I am doing things like re-assigning references or something?

Mason · June 25, 2020, 9:32pm

For an example that does invoke the performance of captured variables problem, here’s a MWE:

function g(x)
    x = x + 1
end

function g_locked(x)
    lock(l) do
        x = x + 1
    end
end

function g_locked_manually(x)
    lock(l)
    try
        x = x + 1
    finally
        unlock(l)
    end
end

for f ∈ (g, g_locked, g_locked_manually)
    println("Benchmarking $f")
    x = Ref(1)
    @btime $f($x[])
end

#+RESULTS:
: Benchmarking g
:   1.329 ns (0 allocations: 0 bytes)
: Benchmarking g_locked
:   72.267 ns (1 allocation: 16 bytes)
: Benchmarking g_locked_manually
:   38.689 ns (0 allocations: 0 bytes)

so we see that the overhead is a bit less than the overhead of the try block.

One way to get rid of most of this additional overhead is to use Refs so we’re doing mutation rather than redefinition inside the closure:

function g_locked_ref(x)
    Rx = Ref(x)
    lock(l) do     
        Rx[] = Rx[] + 1
    end
end

for f ∈ (g, g_locked, g_locked_manually, g_locked_ref)
    println("Benchmarking $f")
    x = Ref(1)
    @btime $f($x[])
end

#+RESULTS:
: Benchmarking g
:   1.329 ns (0 allocations: 0 bytes)
: Benchmarking g_locked
:   72.267 ns (1 allocation: 16 bytes)
: Benchmarking g_locked_manually
:   38.689 ns (0 allocations: 0 bytes)
: Benchmarking g_locked_ref
:   42.787 ns (1 allocation: 16 bytes)

So yeah, there could be some real value in your @lock macro for tight loops and such.

NHDaly · June 25, 2020, 9:34pm

Thanks @Mason. Indeed, yours is a better MWE.

In this case, I’d rather avoid tricks like using a Ref, since that’s difficult to educate users about, and would prefer a solution that is safe to use all the time, as the @lock macro is.

Are there any downsides I’m missing? Thanks for engaging with me on this!

NHDaly · June 25, 2020, 9:36pm

Adding the macro version to your examples:

julia> function g_locked_macro(x)
           @lock l begin
               x = x + 1
           end
       end
g_locked_macro (generic function with 1 method)

julia> for f ∈ (g, g_locked, g_locked_manually, g_locked_macro)
           println("Benchmarking $f")
           x = Ref(1)
           @btime $f($x[])
       end
Benchmarking g
  1.893 ns (0 allocations: 0 bytes)
Benchmarking g_locked
  91.522 ns (1 allocation: 16 bytes)
Benchmarking g_locked_manually
  53.697 ns (0 allocations: 0 bytes)
Benchmarking g_locked_macro
  57.073 ns (0 allocations: 0 bytes)

Mason · June 25, 2020, 9:51pm

Yeah, I think there’s definitely value in your macro. This seems like something that doesn’t really need to be a closure and a macro seems more appropriate, but I guess one advantage of the closure approach is multiple dispatch (no idea if dispatching on lock(::Function, ::MyLock) is ever done to do something other than what you show)

Just to spitball some ideas, I think a bit of a rougher, solution, though more generally extensible outside of simple locks, would be to make a macro to automatically do the Refing on specified variables:

using MacroTools: postwalk 
macro refd(args...)
    vars = args[1:end-1]
    d = Dict(var => gensym(var) for var in vars)
    ex = args[end]
    ex′ = postwalk(ex) do x
        if x ∈ keys(d)
            :($(d[x])[])
        else
            x
        end
    end
    refdefs = map(vars) do var
        :($(d[var]) = Ref($var))
    end
    esc(Expr(:block, refdefs..., ex′))
end

function g_locked_lock_macro(x)
    @lock l begin
        x = x + 1
    end
end

function g_locked_ref_macro(x)
    @refd x lock(l) do     
        x = x + 1
    end
end

for f ∈ (g, g_locked, g_locked_manually, g_locked_ref, g_locked_lock_macro, g_locked_ref_macro)
    println("Benchmarking $f")
    x = Ref(1)
    @btime $f($x[])
end

#+RESULTS:
Benchmarking g
  1.329 ns (0 allocations: 0 bytes)
Benchmarking g_locked
  70.358 ns (1 allocation: 16 bytes)
Benchmarking g_locked_manually
  37.126 ns (0 allocations: 0 bytes)
Benchmarking g_locked_ref
  43.262 ns (1 allocation: 16 bytes)
Benchmarking g_locked_lock_macro
  38.971 ns (0 allocations: 0 bytes)
Benchmarking g_locked_ref_macro
  42.948 ns (1 allocation: 16 bytes)

So there’s a bit of a tradeoff between our two macros, yours is slightly faster for this and doesn’t require that you declare which variables are being Refd, but I think mine can be extended as more a general solution to the performance problem with captured variables. That is, it isn’t specialized for lock.

One extension one might want for @refd would be to declare @refd x::Union{Int, Float64} and have it generate Ref{Union{Int, Float64}}(x) for instance if you think you might need to change the type of x inside the closure.

NHDaly · June 25, 2020, 9:56pm

Ooh, neat! I like that idea as something to discuss (separately?) as a generic workaround for pref problems with lambdas that capture.

But yeah, my main goal is that we should provide a dead-simple lock function that’s guaranteed to be safe and fast. I don’t want people to be screwing around trying to be clever to get performance when it might mean they forget to unlock a lock.

Mason · June 25, 2020, 9:59pm

Do you know of any cases out in the wild where people overload lock(f, l) for some lock type to do something other than what your macro does? That would be my only naive concern about the macro.

NHDaly · June 25, 2020, 10:05pm

Hmmm, yeah, interesting. Good question! Yeah cause the macro can’t dispatch… Good question.

Doesn’t look like it: JuliaHub

Everything there looks the same, from what i can tell.
For example:
https://github.com/invenia/LibPQ.jl/blob/0c821b8f868b20ffa0a5d293e942967370e1e16b/src/connections.jl#L334-L341

NHDaly · June 26, 2020, 2:45am

Wellllllllllllllllllll this is embarrasing.

This macro already exists … in Base! … and it’s exactly as we’ve written it above!

Haha sorry for all this drama. I don’t know how I missed that.
https://github.com/JuliaLang/julia/blob/6c9bc51b80a626a656aeca9758556495c1676c2b/base/lock.jl#L178-L188

Many thanks to @jameson for pointing this out.

BUT It seems like it’s still good we’re talking about this! Shouldn’t everyone always use this instead, like, all the time?

In fact it looks like almost no one uses it:
https://juliahub.com/ui/CodeSearch?q=lock&u=use&t=macro
Only CuArrays is using it right now.

Is there any reason everyone shouldn’t be using this?
Some thoughts and concerns:

It’s not currently exported, so not very discoverable:

julia> @lock ReentrantLock() 2
ERROR: LoadError: UndefVarError: @lock not defined
in expression starting at REPL[1]:1

It’s not mentioned in any of the docstrings for locks, so again not very discoverable
It doesn’t have its own docstring, so not very discoverable
It seems like this should be preferred over the existing lock(f, lock). Can we deprecate that in favor of this?

~I’ll open~ an issue to document and export the macro at least.

Okay i’ve opened an issue here :
https://github.com/JuliaLang/julia/issues/36441

Topic		Replies	Views
Immediately executing closure as function barrier General Usage closure	16	853	September 30, 2022
Can someone explain closures to me Internals & Design performance , type-stability , closure	13	3032	November 2, 2023
Passing a function as an argument via closure Performance question , package , performance , numerics , closure	9	1729	August 28, 2019
--track-allocation and @threads General Usage question , multithreading	10	765	August 1, 2022
Function definition inside let-block impacting performance? Performance question	42	1495	February 14, 2024

Poor performance of `lock(l) do` closures, caused by poor performance of capturing closures

# So, now my questions:

# Proposed Workaround: A `@lock` macro

Related topics