For loop with variable range so slow

Recent I find my julia program cost 3.7s for each iteration while c++ only use 0.022s. By test I find for loop with variable range seems make program slow.

using SQLite, DataFrames, Statistics, Serialization, Dates, Printf



n_T = 100
for i = 1:2
    @time begin
        n = 1
        for j = 1:100
            n += j
        end
    end
end

Result:

  0.000000 seconds
  0.000000 seconds

while for:

using SQLite, DataFrames, Statistics, Serialization, Dates, Printf



n_T = 100
for i = 1:2
    @time begin
        n = 1
        for j = 1:n_T
            n += j
        end
    end
end

result:

  0.000059 seconds (180 allocations: 4.688 KiB)
  0.000012 seconds (170 allocations: 4.234 KiB)

the compilier seems not smart enough to elliminate such a simple loop.

Classic case of non-constant global variables being slow.

Put n_T in a local scope — and it’s fast again:

julia> let
       n_T = 100
       for i = 1:2
           @time begin
               n = 1
               for j = 1:n_T
                   n += j
               end
           end
       end
       end
  0.000000 seconds
  0.000000 seconds

Or make n_T constant:

julia> const n_T = 100;

julia> for i = 1:2
           @time begin
               n = 1
               for j = 1:n_T
                   n += j
               end
           end
       end
  0.000000 seconds
  0.000000 seconds

In general, accessing global variables may require allocations, making the code slower.

5 Likes

Why compilier didn’t optimize it? I thought this is a very basic optimization in C/C++.

Good question.

In C/C++, every program starts from main(). To make the comparison fair, you should also put your Julia code inside a function. Then you will get the expected performance.

function main()
    n_T = 100
    for i = 1:2
        @time begin
            n = 1
            for j = 1:n_T
                n += j
            end
        end
    end
end

main()

So there is more than one way to fix this:

  1. As ForceBru mentioned, use let to introduce a local scope.
  2. As ForceBru also mentioned, mark the global variable as const. This is a recommended approach.[1]
  3. Put performance-critical code inside a function. This is also a recommended approach.[2]

[1] https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-untyped-global-variables
[2] https://docs.julialang.org/en/v1/manual/performance-tips/#Performance-critical-code-should-be-inside-a-function

9 Likes

Yet another option is to make the global variable typed:

julia> n_T::Int = 100 # assign type!
100

julia> for i = 1:2
                  @time begin
                      n = 1
                      for j = 1:n_T
                          n += j
                      end
                  end
              end
  0.000000 seconds
  0.000000 seconds

You can subsequently assign another integer to n_T and it’ll remain fast.

As discussed here (Why is there a performance difference between global and local variables in non-interactive Julia programs?):

A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.

However, if you fix the type (like n_T::Int = 100), this is no longer a problem. This was introduced in Julia 1.8 (Types · The Julia Language):

As of Julia 1.8, type declarations can now be used in global scope i.e. type annotations can be added to global variables to make accessing them type stable.

4 Likes

In general, things that must run fast should be put inside functions. The unit of compilation is function. Things that are written directly at the julia prompt may not be compiled properly, and certainly not if it uses non-const globals.

function sumit(n_t)
    n = 1
    for j = 1:n_t
        n += j
    end
    return n
end
julia> @time sumit(100)
  0.000000 seconds
5051

Indeed, the loop has been optimized away, just a multiplication and a shift is left, i.e. just
1 + n_t(n_t+1)/2:

julia> @code_native sumit(100)
...
6 Likes

The Compiler indeed failed to optimize the code due to that the global variable was not typed.

Full technical details:

I included the suggestion of @karei and looked into the runtime behavior, this is what I noticed:

Full Example code:


function one()
	for i = 1:2
		n = 1
		for j = 1:100
			n += j
		end
	end
end

function two()
	n_T_local = 100
	for i = 1:2
		n = 1
		for j = 1:n_T_local
			n += j
		end
	end
end

n_T = 100
function three()
	for i = 1:2
		n = 1
		for j = 1:n_T
			n += j
		end
	end
end


@cgprofile one()
@cgprofile two()
@cgprofile three()

It reported that function one() and two() was fully optimized and inlined by the compiler, no allocations or other calls to other methods.

however function three() reported the following:

# Reconstructed runtime executed code, showing methods and allocations:
function three()
	for i = 1:2
		n = 1
		# Allocation of 2 UnitRange{Int64} for 32 bytes each iteration of i, total 64 bytes.
		n_T_iter::UnitRange{Int64} = n_T
		j_next::Union{Nothing, Tuple{Int64, Int64}} = (dynamic dispatch call) Main.Base.iterate(n_T_iter::UnitRange{Int64}) in range.jl:917
		while j_next !== nothing #  Iterated 100 times each iteration of i, 200 iterations in total.
			# Allocation of 200 Tuple{Int64,Int64} for 32 bytes each iteration of j_next, total 6400 bytes
    		(item, state)::Tuple{Int64,Int64} = j_next
			n = (dynamic dispatch call) Main.base.+(n::Int64, item::Int64)::Int64 in int.jl:87
			# Allocation of 138 Int64 for 16 bytes each iteration after iteration 31 of j_next, total of 2208 bytes
			?::Int64;
			j_next = (dynamic dispatch call) Main.Base.iterate(n_T_iter::UnitRange{Int64}, item::Int64)::Union{Nothing, Tuple{Int64, Int64}} in range.jl:919
		end
	end
end

This shows that it was unable to determine n_T at compile time, thus having to insert dynamic dispatch calls that needed to be resolved during runtime.

I also ran @ForceBru & @sgaure suggestions


n_T_typed::Int = 100
function four()
	for i = 1:2
		n = 1
		for j = 1:n_T_typed
			n += j
		end
	end
end


function sumit(n_t)
    n = 1
    for j = 1:n_t
        n += j
    end
    return n
end

@cgprofile four()
@cgprofile sumit(100)

function four() and sumit() was also fully optimized and inlined by the compiler, no allocations no other calls to other methods.

3 Likes

Just to explain what happens when an untyped global is used (a const is always typed). The loop for j in 1:n_t contains the range 1:n_t. If the compiler is not sure about the type of n_t (which it can’t be with an untyped global n_t), the 1:n_t will create a UnitRange{Any}(1, n_t), i.e. a range which iterates over values which the compiler knows nothing about (Any). Or just a lookup call for :.

The n += j is rewritten as n = n + j, but the compiler knows nothing about the type of j. It has to generate code for looking up which + to use for every iteration in the loop. Since the compiler doesn’t know which + to use, it doesn’t know the type of the result either, so it knows nothing about n and nothing about j inside the loop. Such lookup takes a lot of time.

On the other hand, if the compiler knows the type of n_t as an Int, it will create the range UnitRange{Int}(1, n_t), which iterates over Ints. It knows n starts out as an Int, the n + j is compiled to a single instruction, and it knows the result is an Int, so n will continue to be an Int. Moreover, the compiler is then actually able to recognize this sum, and replace it with 1 + n_t*(n_t+1)>>1 or similar, optimizing the entire loop away.

2 Likes

If you are using a 64-bit Julia, those are the same thing.

julia> Int64 === Int
true
2 Likes

Oh haha thanks, somehow never noticed that before :slight_smile:

x should be large enough to show difference

using BenchmarkTools

function loop_sum(x::Int)
    n = 1
    for j = 1:x
        n += j
    end
    return n
end

function comp_sum(x::Int)
    return 1 + sum(j for j = 1:x)
end

@btime loop_sum(1000000)
@btime comp_sum(1000000)

1.666 ns (0 allocations: 0 bytes)
0.916 ns (0 allocations: 0 bytes)```

Comprehensions are to be preferred over for loops.

Nope.

  1. You don’t have a comprehension, you have a generator. (A comprehension allocates an array, whereas a generator is “lazy”.)
  2. sum is implemented by loops written in Julia too … for loops, written properly, are not slow!
  3. Your benchmark is broken because the compiler is optimizing away the whole loop, so you are basically measuring noise.

In particular, you’ll notice that @btime loop_sum(10) reports exactly the same time as @btime loop_sum(10^6), which should give you a clue: the compiler is smart enough to realize that this sum is an arithmetic series and just replaces the whole loop with the arithmetic-series formula.

A more interesting benchmark, which at least doesn’t get optimized away to nothing, is:

function loop_sum2(n)
    s = 0.0
    for j = 1:n
        s += 1/j^2
    end
    return s
end

comp_sum2(n) = sum(1/j^2 for j = 1:n)

which gives (on my M1 max):

julia> @btime loop_sum2(10000)
  9.750 ÎĽs (0 allocations: 0 bytes)
1.6448340718480652

julia> @btime comp_sum2(10000)
  9.750 ÎĽs (0 allocations: 0 bytes)
1.6448340718480652

However, you can make loop_sum2 faster by changing for to @simd for (which gives the compiler permission to change the answer slightly by re-ordering the sum to use SIMD), and this gives:

julia> @btime loop_sum2(10000)
  3.865 ÎĽs (0 allocations: 0 bytes)
1.6448340718480614

If you use the functional form of sum, it is able to insert @simd for you (because this form, like reduce, is documented to allow re-ordering of the summands), and gets almost the same speed, with a slight overhead because it uses pairwise summation to get a more accurate sum:

func_sum2(n) = sum(j -> 1/j^2, 1:n)

gives

julia> @btime func_sum2(10000)
  3.979 ÎĽs (0 allocations: 0 bytes)
1.6448340718480614
10 Likes

Thanks for bringing me up short. Now that I’m better @btime aware, it will be a help

using BenchmarkTools

function loop_sum(x::Int)
    n = 1
    for j = 1:x
        n += j
    end
    return n
end

function comp_sum(x::Int)
    return 1 + sum(j for j = 1:x)
end

println("loop_sum")
@btime loop_sum($1_000_000)
println("comp_sum")
@btime comp_sum($1_000_000)
function loop_sum2(n::Float64)

    s = 0.0
    for j = 1:n
        s += 1 / j^2
    end
    return s
end
comp_sum2(n::Float64) = sum(1 / j^2 for j = 1:n)

println("comp_sum2")
@btime comp_sum2($1e6)
println("loop_sum2")
@btime loop_sum2($1e6)

loop_sum
  1.666 ns (0 allocations: 0 bytes)
comp_sum
  1.708 ns (0 allocations: 0 bytes)
comp_sum2
  1.089 ms (0 allocations: 0 bytes)
loop_sum2
  1.089 ms (0 allocations: 0 bytes)
1.6449330
6684877

And generators! I had totally forgotten about that pattern.

Did you mistype ms as ns?

It’s still optimizing away the loop, so this benchmark is meaningless.

Using $ interpolation in @btime is only useful for non-constant globals. It does nothing for numeric literals. And it has no effect on the compiler’s ability to replace the loop with an arithmetic-series formula.

Note also that declaring the argument types has no effect on performance here. You might as well leave them out (as in my post). This is a common point of confusion — see “argument-type declarations” in the manual.

And using a floating point n is allowed but not typical for loops over 1:n.

2 Likes

I tried to slightly modify my original code and find it cost 0.028/each iter. Indeed parallel with C++. Impresive. Now I’m going to use Julia as my main daily programming language.

3 Likes

Since it hasn’t been mentioned, the Performance Tips · The Julia Language is a good place to look when performance is lower than expected.

2 Likes

Thank you, it’s useful. I find my problem is just in the first section of the doc.