I’ve tried to define essentially the same closure in two different ways (Julia 0.5, macOS):
function define()
result = 0
global function compute(k::Int)
s = 2
for i in 1:1000
s += k
end
result = s
return
end
function local_compute(k::Int)
s = 2
for i in 1:1000
s += k
end
result = s
return
end
local_compute
end
const compute2 = define()
Timing compute(4) yielded on average 15ns, and timing compute2(4) yielded 40ns, so the global function definition is at least 2 times faster. Doing @code_llvm also showed that the code which is generated for both methods are different. I checked the methods:
And seemingly there is no type uncertainty or instability.
Question: could someone explain the difference, and how to make the local_compute to compile and perform as well as its global counterpart? Thank you.
I think result needs to actually be a Ref
i.e. const result = Ref{Int}(0)
then they are the same
julia> function define()
const result = Ref{Int}(0)
global function compute(k::Int)
s = 2
for i in 1:1000
s += k
end
result[] = s
return
end
function local_compute(k::Int)
s = 2
for i in 1:1000
s += k
end
result[] = s
return
end
local_compute
end
define (generic function with 1 method)
julia> @btime lc(2)^C
julia> const compute2 = define()
(::local_compute) (generic function with 1 method)
julia> using BenchmarkTools
julia> @btime compute(20)
3.947 ns (0 allocations: 0 bytes)
julia> @btime compute2(20)
3.947 ns (0 allocations: 0 bytes)
@musm, thank you for your suggestion. Julia did a heavy optimization in this case (that’s why you got identical execution times), but looking on LLVM, I again would say that the functions are not identical.
The global one:
define void @julia_compute_71902(i64) #0 {
top:
%1 = mul i64 %0, 1000
%2 = or i64 %1, 2
store i64 %2, i64* inttoptr (i64 4659960896 to i64*), align 64
ret void
}
The “local” one:
define void @julia_compute_71958(%jl_value_t*, i64) #0 {
top:
%2 = mul i64 %1, 1000
%3 = or i64 %2, 2
%4 = bitcast %jl_value_t* %0 to i64**
%5 = load i64*, i64** %4, align 8
store i64 %3, i64* %5, align 16
ret void
}
As one can see, the local one gets additional argument jl_value_t*, which is used to bitcast the final result. Is this the “closure lowering issue” you mentioned, @yuyichao? If so, how to avoid it, how to make the second function do what the “normal”, global one, does?
Also, I’m not an expert here, but why Julia adds this type information when I clearly specify the type of argument? It is as if the argument is still of type Any?
Do NOT use code_llvm as your first tool to identify performance issue. I really don’t know who started the tradition of using overkilling tools to confuse themselves (certainly not you) but this has cause too many people to look at low level representation and trying to guess what it means and confuse more people in that process (and I expect you to be one of the victims too).
In particular, do not use code_llvm if you have not used LLVM IR before and do not suggest it to anyone unless you can understand it yourself or at least identify the issue in a particular IR. It’s much better than code_native (which is almost never useful for performance issues) but we still have much better tools in most of the case.
In almost all cases, the right starting point is code_warntype. In this case,
I’ll not expect all users to identify this is related to closure lowering but it should be very clear what the difference is in the IR (the last expression) and the slower version also warns you about a ::Any type.
Out of curiosity, is there enough information preserved in the transformation from surface syntax to the lowered + type-inferred AST to support the building of tools that show information from code_warntype in the context of the code as it was typed in by the user?
That’s what the # line ...: for, which is currently shown in a pretty confusing way. In the piece below the code between # line 6: and # line 8: are from a expression starting on line 6.