For loop with variable range so slow

Recent I find my julia program cost 3.7s for each iteration while c++ only use 0.022s. By test I find for loop with variable range seems make program slow.

using SQLite, DataFrames, Statistics, Serialization, Dates, Printf



n_T = 100
for i = 1:2
    @time begin
        n = 1
        for j = 1:100
            n += j
        end
    end
end

Result:

  0.000000 seconds
  0.000000 seconds

while for:

using SQLite, DataFrames, Statistics, Serialization, Dates, Printf



n_T = 100
for i = 1:2
    @time begin
        n = 1
        for j = 1:n_T
            n += j
        end
    end
end

result:

  0.000059 seconds (180 allocations: 4.688 KiB)
  0.000012 seconds (170 allocations: 4.234 KiB)

the compilier seems not smart enough to elliminate such a simple loop.

Classic case of non-constant global variables being slow.

Put n_T in a local scope — and it’s fast again:

julia> let
       n_T = 100
       for i = 1:2
           @time begin
               n = 1
               for j = 1:n_T
                   n += j
               end
           end
       end
       end
  0.000000 seconds
  0.000000 seconds

Or make n_T constant:

julia> const n_T = 100;

julia> for i = 1:2
           @time begin
               n = 1
               for j = 1:n_T
                   n += j
               end
           end
       end
  0.000000 seconds
  0.000000 seconds

In general, accessing global variables may require allocations, making the code slower.

1 Like

(post deleted by author)

Why compilier didn’t optimize it? I thought this is a very basic optimization in C/C++.

Good question.

In C/C++, every program starts from main(). To make the comparison fair, you should also put your Julia code inside a function. Then you will get the expected performance.

function main()
    n_T = 100
    for i = 1:2
        @time begin
            n = 1
            for j = 1:n_T
                n += j
            end
        end
    end
end

main()

So there is more than one way to fix this:

  1. As ForceBru mentioned, use let to introduce a local scope.
  2. As ForceBru also mentioned, mark the global variable as const. This is a recommended approach.[1]
  3. Put performance-critical code inside a function. This is also a recommended approach.[2]

[1] https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-untyped-global-variables
[2] https://docs.julialang.org/en/v1/manual/performance-tips/#Performance-critical-code-should-be-inside-a-function

Yet another option is to make the global variable typed:

julia> n_T::Int = 100 # assign type!
100

julia> for i = 1:2
                  @time begin
                      n = 1
                      for j = 1:n_T
                          n += j
                      end
                  end
              end
  0.000000 seconds
  0.000000 seconds

You can subsequently assign another integer to n_T and it’ll remain fast.

As discussed here (Why is there a performance difference between global and local variables in non-interactive Julia programs?):

A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.

However, if you fix the type (like n_T::Int = 100), this is no longer a problem. This was introduced in Julia 1.8 (Types · The Julia Language):

As of Julia 1.8, type declarations can now be used in global scope i.e. type annotations can be added to global variables to make accessing them type stable.

In general, things that must run fast should be put inside functions. The unit of compilation is function. Things that are written directly at the julia prompt may not be compiled properly, and certainly not if it uses non-const globals.

function sumit(n_t)
    n = 1
    for j = 1:n_t
        n += j
    end
    return n
end
julia> @time sumit(100)
  0.000000 seconds
5051

Indeed, the loop has been optimized away, just a multiplication and a shift is left, i.e. just
1 + n_t(n_t+1)/2:

julia> @code_native sumit(100)
...
1 Like

The Compiler indeed failed to optimize the code due to that the global variable was not typed.

Full technical details:

I included the suggestion of @karei and looked into the runtime behavior using CodeGlass, this is what I noticed:

Full Example code:

using CodeGlass

function one()
	for i = 1:2
		n = 1
		for j = 1:100
			n += j
		end
	end
end

function two()
	n_T_local = 100
	for i = 1:2
		n = 1
		for j = 1:n_T_local
			n += j
		end
	end
end

n_T = 100
function three()
	for i = 1:2
		n = 1
		for j = 1:n_T
			n += j
		end
	end
end


@cgprofile one()
@cgprofile two()
@cgprofile three()

CodeGlass reported that function one() and two() was fully optimized and inlined by the compiler, no allocations or other calls to other methods.

however function three() reported the following:

# Reconstructed runtime executed code, showing methods and allocations:
function three()
	for i = 1:2
		n = 1
		# Allocation of 2 UnitRange{Int64} for 32 bytes each iteration of i, total 64 bytes.
		n_T_iter::UnitRange{Int64} = n_T
		j_next::Union{Nothing, Tuple{Int64, Int64}} = (dynamic dispatch call) Main.Base.iterate(n_T_iter::UnitRange{Int64}) in range.jl:917
		while j_next !== nothing #  Iterated 100 times each iteration of i, 200 iterations in total.
			# Allocation of 200 Tuple{Int64,Int64} for 32 bytes each iteration of j_next, total 6400 bytes
    		(item, state)::Tuple{Int64,Int64} = j_next
			n = (dynamic dispatch call) Main.base.+(n::Int64, item::Int64)::Int64 in int.jl:87
			# Allocation of 138 Int64 for 16 bytes each iteration after iteration 31 of j_next, total of 2208 bytes
			?::Int64;
			j_next = (dynamic dispatch call) Main.Base.iterate(n_T_iter::UnitRange{Int64}, item::Int64)::Union{Nothing, Tuple{Int64, Int64}} in range.jl:919
		end
	end
end

This clearly shows that it was unable to determine n_T at compile time, thus having to insert dynamic dispatch calls that needed to be resolved during runtime.

I also ran @ForceBru & @sgaure suggestions


n_T_typed::Int = 100
function four()
	for i = 1:2
		n = 1
		for j = 1:n_T_typed
			n += j
		end
	end
end


function sumit(n_t)
    n = 1
    for j = 1:n_t
        n += j
    end
    return n
end

@cgprofile four()
@cgprofile sumit(100)

function four() and sumit() was also fully optimized and inlined by the compiler, no allocations no other calls to other methods.

Small note on sumit(n_t), the compiler did assume Int64 instead of Int:

sumit(n_t::Int64)::Int64

Just to explain what happens when an untyped global is used (a const is always typed). The loop for j in 1:n_t contains the range 1:n_t. If the compiler is not sure about the type of n_t (which it can’t be with an untyped global n_t), the 1:n_t will create a UnitRange{Any}(1, n_t), i.e. a range which iterates over values which the compiler knows nothing about (Any).

The n += j is rewritten as n = n + j, but the compiler knows nothing about the type of j. It has to generate code for looking up which + to use for every iteration in the loop. Since the compiler doesn’t know which + to use, it doesn’t know the type of the result either, so it knows nothing about n and nothing about j inside the loop. Such lookup takes a lot of time.

On the other hand, if the compiler knows the type of n_t as an Int, it will create the range UnitRange{Int}(1, n_t), which iterates over Ints. It knows n starts out as an Int, the n + j is compiled to a single instruction, and it knows the result is an Int, so n will continue to be an Int. Moreover, the compiler is then actually able to recognize this sum, and replace it with 1 + n_t*(n_t+1)>>1 or similar, optimizing the entire loop away.

1 Like