Different compilation results for global and "local" functions

wrgr · May 19, 2017, 8:50am

I’ve tried to define essentially the same closure in two different ways (Julia 0.5, macOS):

function define()
    result = 0
    global function compute(k::Int)
        s = 2
        for i in 1:1000
            s += k
        end
        result = s
        return
    end
    function local_compute(k::Int)
        s = 2
        for i in 1:1000
            s += k
        end
        result = s
        return
    end
    local_compute
end

const compute2 = define()

Timing compute(4) yielded on average 15ns, and timing compute2(4) yielded 40ns, so the global function definition is at least 2 times faster. Doing @code_llvm also showed that the code which is generated for both methods are different. I checked the methods:

methods(compute2)
# (::#local_compute#4)(k::Int64)
methods(compute)
# compute(k::Int64)

And seemingly there is no type uncertainty or instability.
Question: could someone explain the difference, and how to make the local_compute to compile and perform as well as its global counterpart? Thank you.

ScottPJones · May 19, 2017, 9:44am

@yuyichao should comment, the difference between the two versions seems to be the overhead of calling jl_get_ptls_states_fast.

wrgr · May 19, 2017, 4:12pm

agree, @yuyichao always is very helpful

musm · May 19, 2017, 4:34pm

I think result needs to actually be a Ref
i.e. const result = Ref{Int}(0)

then they are the same

julia> function define()
           const result = Ref{Int}(0)
           global function compute(k::Int)
               s = 2
               for i in 1:1000
                   s += k
               end
               result[] = s
               return
           end
           function local_compute(k::Int)
               s = 2
               for i in 1:1000
                   s += k
               end
               result[] = s
               return
           end
           local_compute
       end
define (generic function with 1 method)

julia> @btime lc(2)^C

julia> const compute2 = define()
(::local_compute) (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime compute(20)
  3.947 ns (0 allocations: 0 bytes)

julia> @btime compute2(20)
  3.947 ns (0 allocations: 0 bytes)

yuyichao · May 19, 2017, 5:08pm

No ptls overhead is much smaller. It’s closure lowering issue.

wrgr · May 20, 2017, 10:34am

@musm, thank you for your suggestion. Julia did a heavy optimization in this case (that’s why you got identical execution times), but looking on LLVM, I again would say that the functions are not identical.
The global one:

define void @julia_compute_71902(i64) #0 {
top:
%1 = mul i64 %0, 1000
%2 = or i64 %1, 2
store i64 %2, i64* inttoptr (i64 4659960896 to i64*), align 64
ret void
}

The “local” one:

define void @julia_compute_71958(%jl_value_t*, i64) #0 {
top:
%2 = mul i64 %1, 1000
%3 = or i64 %2, 2
%4 = bitcast %jl_value_t* %0 to i64**
%5 = load i64*, i64** %4, align 8
store i64 %3, i64* %5, align 16
ret void
}

As one can see, the local one gets additional argument jl_value_t*, which is used to bitcast the final result. Is this the “closure lowering issue” you mentioned, @yuyichao? If so, how to avoid it, how to make the second function do what the “normal”, global one, does?
Also, I’m not an expert here, but why Julia adds this type information when I clearly specify the type of argument? It is as if the argument is still of type Any?

yuyichao · May 20, 2017, 9:57pm

Do NOT use code_llvm as your first tool to identify performance issue. I really don’t know who started the tradition of using overkilling tools to confuse themselves (certainly not you) but this has cause too many people to look at low level representation and trying to guess what it means and confuse more people in that process (and I expect you to be one of the victims too).

In particular, do not use code_llvm if you have not used LLVM IR before and do not suggest it to anyone unless you can understand it yourself or at least identify the issue in a particular IR. It’s much better than code_native (which is almost never useful for performance issues) but we still have much better tools in most of the case.

In almost all cases, the right starting point is code_warntype. In this case,

julia> @code_warntype compute(1)
Variables:                                                                                          
  #self#::#compute                                                                                  
  k::Int64                                                                                          
  i::Int64                                                                                          
  #temp#::Int64                                                                                     
  s::Int64                                                                                          

Body:
  begin
      s::Int64 = 2 # line 5:
      SSAValue(3) = (Base.select_value)((Base.sle_int)(1, 1000)::Bool, 1000, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#::Int64 = 1
      5:
      unless (Base.not_int)((#temp#::Int64 === (Base.add_int)(SSAValue(3), 1)::Int64)::Bool)::Bool goto 14
      SSAValue(4) = #temp#::Int64
      SSAValue(5) = (Base.add_int)(#temp#::Int64, 1)::Int64
      #temp#::Int64 = SSAValue(5) # line 6:
      s::Int64 = (Base.add_int)(s::Int64, k::Int64)::Int64
      12:
      goto 5
      14:  # line 8:
      SSAValue(2) = s::Int64
      (Core.setfield!)(Core.Box(0), :contents, SSAValue(2))::Int64 # line 9:
      return
  end::Void

julia> @code_warntype compute2(1)
Variables:
  #self#::#local_compute#1
  k::Int64
  i::Int64
  #temp#::Int64
  s::Int64

Body:
  begin
      s::Int64 = 2 # line 13:
      SSAValue(3) = (Base.select_value)((Base.sle_int)(1, 1000)::Bool, 1000, (Base.sub_int)(1, 1)::Int64)::Int64
      #temp#::Int64 = 1
      5:
      unless (Base.not_int)((#temp#::Int64 === (Base.add_int)(SSAValue(3), 1)::Int64)::Bool)::Bool goto 14
      SSAValue(4) = #temp#::Int64
      SSAValue(5) = (Base.add_int)(#temp#::Int64, 1)::Int64
      #temp#::Int64 = SSAValue(5) # line 14:
      s::Int64 = (Base.add_int)(s::Int64, k::Int64)::Int64
      12:
      goto 5
      14:  # line 16:
      SSAValue(2) = s::Int64
      (Core.setfield!)((Core.getfield)(#self#::#local_compute#1, :result)::Any, :contents, SSAValue(2))::Int64 # line 17:
      return
  end::Void

I’ll not expect all users to identify this is related to closure lowering but it should be very clear what the difference is in the IR (the last expression) and the slower version also warns you about a ::Any type.

yurivish · May 20, 2017, 10:42pm

Out of curiosity, is there enough information preserved in the transformation from surface syntax to the lowered + type-inferred AST to support the building of tools that show information from code_warntype in the context of the code as it was typed in by the user?

yuyichao · May 20, 2017, 10:54pm

That’s what the # line ...: for, which is currently shown in a pretty confusing way. In the piece below the code between # line 6: and # line 8: are from a expression starting on line 6.

wrgr · May 21, 2017, 12:57pm

Thank you for @code_warntype suggestion, an awesome tool!

Topic		Replies	Views
What is happening? Why these simple codes slow down Julia 30 times? New to Julia	4	419	May 5, 2022
Function not faster than global scope Performance	7	1015	May 13, 2020
Mysterious runtime difference Performance	3	194	February 26, 2023
Functions inside functions using "global" variables Performance	2	677	November 23, 2019
Huge difference in performance between two programmes Performance	7	493	April 27, 2021

Different compilation results for global and "local" functions

Related topics