The big difference in speed of remainder calculations

I found a curious behavior in Julia. When I ran the codes below, big difference in speed was observed.

Code 1

function main()
    a = 0
    p = 2^5
    N = 100000000
    for i in 1:N
        a %= p
    end
    println(a)
end

@time main()

Result 1

0
  0.839353 seconds (15.62 k allocations: 751.056 KiB)

Code 2

function main()
    a = 0
    p = 2^4
    N = 100000000
    for i in 1:N
        a %= p
    end
    println(a)
end

@time main()

Result 2

0
  0.020099 seconds (15.62 k allocations: 751.056 KiB)

At first I supposed that it was because 2^5 is bigger than 2^4, but with several trials, I realized that the magnitude of numbers had nothing to do with it and the power index was important. I summarized the result of my trials in the figures below. Specifically, if the power index is five or six, the calculation takes much time whatever the base number is.

Then, why does this phenomenon occur? I suppose that it is important whether β€˜a’ is interpreted as a constant number or not because remainder calculations are performed faster when the divisor is constant, but I have no idea except for it. My Julia version is 1.4.0.

I suspect this is just measuring the time of pre-compilation. One should use https://github.com/JuliaCI/BenchmarkTools.jl to benchmark their programs.

Upon doing so, I found the difference vanish. Indeed, the loop is not being ran in either case when inspecting with @code_native. It is true that the first version generates more code as 2^5 doesn’t become a constant of 32 but 2^4 does get inferred to be 16, but since either way the loop isn’t being ran, they both take nanoseconds.

5 Likes

Thank you for your fast reply! I thought that your hypothesis that it was only the difference of the time of pre-compilation looked so likely, but when I measured the time again with @benchmark of BenchmarkTools.jl, the difference did not disappear. My usage of BenchmarkTools may be wrong?

2^5

BenchmarkTools.Trial:
  memory estimate:  176 bytes
  allocs estimate:  6
  --------------
  minimum time:     628.993 ms (0.00% GC)
  median time:      657.863 ms (0.00% GC)
  mean time:        654.278 ms (0.00% GC)
  maximum time:     688.478 ms (0.00% GC)
  --------------
  samples:          8
  evals/sample:     1

2^4

BenchmarkTools.Trial:
  memory estimate:  176 bytes
  allocs estimate:  6
  --------------
  minimum time:     83.401 ΞΌs (0.00% GC)
  median time:      91.750 ΞΌs (0.00% GC)
  mean time:        195.334 ΞΌs (0.00% GC)
  maximum time:     21.021 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Also, when I changed N, there seemed to be a linear relationship between N and the time.

In any case, I will check it with @code_native next.

Oh I seem to have made a mistake. It seems that I was testing without the println so it led it to optimize away the loop. If keeping the println it seems like it is failing to realise that p is a constant and allows it to optimize away the loop. I believe changing p = 32 will show that it is back to a small amount of time.

1 Like

What if you return a rather than print it?

2 Likes

If you want to benchmark remainder calculations you should do just that, and not mix in printing, which will tend to dominate the runtime.

Here’s an example comparing rem with 2^5 and 2^4:

julia> @benchmark a % p setup=(a=rand(1:100_000_000); p=2^5)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.695 ns (0.00% GC)
  median time:      4.761 ns (0.00% GC)
  mean time:        4.959 ns (0.00% GC)
  maximum time:     24.656 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark a % p setup=(a=rand(1:100_000_000); p=2^4)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.612 ns (0.00% GC)
  median time:      4.758 ns (0.00% GC)
  mean time:        5.009 ns (0.00% GC)
  maximum time:     46.481 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
2 Likes

Oh, without println(a), the loop seems not to be ran. I tried it and got the same result too. a is not recognized as a constant number when the power index is 5 or 6… It is interesting.

I tried it. Almost the same result as that of println(a) was observed. With the return code, the loop seems to be ran.

Thank you! I think your code is better than mine to evaluate the time of remainder calculations genuinely.
But also I’m interested in why Julia fails to interpret a as a constant number or does something special when the power index is 5 or 6 now.

Yes, I’m not sure why this happens, but the compiler somehow is not able to run the exponentiation that far for constant propagation. Take a look at the llvm code:

julia> foo() = 2^4
foo (generic function with 3 methods)

julia> @code_llvm foo()

;  @ REPL[169]:1 within `foo'
define i64 @julia_foo_18883() {
top:
  ret i64 16
}
julia> foo() = 2^5
foo (generic function with 3 methods)

julia> @code_llvm foo()

;  @ REPL[171]:1 within `foo'
define i64 @julia_foo_18884() {
L21.lr.ph:
; β”Œ @ none within `literal_pow'
; β”‚β”Œ @ none within `macro expansion'
; β”‚β”‚β”Œ @ intfuncs.jl:238 within `^'
; β”‚β”‚β”‚β”Œ @ intfuncs.jl:225 within `power_by_squaring'
      br label %L21

L21:                                              ; preds = %L21.lr.ph, %L21
      %0 = phi i64 [ 1, %L21.lr.ph ], [ %2, %L21 ]
      %value_phi35 = phi i64 [ 2, %L21.lr.ph ], [ %1, %L21 ]
; β”‚β”‚β”‚β”‚ @ intfuncs.jl:226 within `power_by_squaring'
; β”‚β”‚β”‚β”‚β”Œ @ int.jl:54 within `*'
       %1 = mul i64 %value_phi35, %value_phi35
; β”‚β”‚β”‚β”‚β””
; β”‚β”‚β”‚β”‚ @ intfuncs.jl:225 within `power_by_squaring'
; β”‚β”‚β”‚β”‚β”Œ @ int.jl:52 within `-'
       %2 = add nsw i64 %0, -1
; β”‚β”‚β”‚β”‚β””
; β”‚β”‚β”‚β”‚β”Œ @ operators.jl:341 within `>='
; β”‚β”‚β”‚β”‚β”‚β”Œ @ int.jl:410 within `<='
        %3 = icmp slt i64 %0, 1
; β”‚β”‚β”‚β”‚β””β””
      br i1 %3, label %L23, label %L21

L23:                                              ; preds = %L21
      %phitmp = shl i64 %1, 1
; β””β””β””β””
  ret i64 %phitmp
}

1 Like

I inspected the compiled code with @code_llvm and @code_native, and found that the interpretation of a was more complicated if the power index was 5 or 6. It is still a riddle why Julia compiler does so, but this is instructive information.

Without knowing, I would guess that calculating higher powers is more work, and the compiler will only do so much before giving up.

1 Like

I think that this may have to do with literal powers being specialised on using the literal_pow function?

And the fact that the code generated by Base.power_by_squaring is quite complicated for higher powers, as you can see in your last code segment, so presumably LLVM is not able to simplify it.

2 Likes

Thank you! In Julia, power calculations seem to be implemented in power_by_squaring funcion. Now I think this is the reason. With the power index such as 1 = 0b1, 2 = 0b10, 3 = 0b11, 4 = 0b100, 7 = 0b111 and 8 = 0b1000, this algorithm is enough simple for the precompiler to transform p into a constant number, but with 5 = 0b101 and 6 = 0b110 it’s not, because it requires a rather complicated process. Moreover, I tried with the power index 14 = 0b1110, 15 = 0b1111, 16 = 0b10000, 30 = 0b11110, 31 = 0b11111 and 32 = 0b100000, then p was interpretted as a constant number if the power index was 15, 16, 31 or 32. It reinforced the hypothesis.