Inconsistent performance Julia 1.6 vs 1.5 on Linux

I am trying to switch to Julia from Matlab+Fortran, and I am a bit frustrated with the performance inconsistency across versions. I used profiler to isolate the lines in my code that takes longer than others. I have the following example code, which is a very common equation in Economics. Often used in a loop for solution and it is usually a part of contraction mapping. Of course, this is just an example, there are lots of other stuff inside the loop in the actual code.

Anyhow, when I time it, it takes 0.20 seconds on Julia 1.5.x and 1.4.x, but 0.90 seconds on 1.6.x and 1.7.x on various Ubuntu Linux flavors. (20.04, 21.04, etc). It is more than 4 times slower with newer versions of Julia on Linux. The same code takes about 0.40 seconds consistently on various versions in Windows.

Just for comparison, it takes about 0.20 seconds on Matlab (windows). Julia is just as fast as Matlab with this vectorized operation, but only with versions < 1.5.x and Linux. I am using the same computer (AMD with 8 cores and 48gb RAM), do not time the first run so that compilation time is not included, etc.

I am a bit hesitant to switch because I probably can’t keep using version 1.5.x forever.

I would appreciate any thoughts. Thanks a lot.

Example code:


function rtest4()

    N1=62
    N2=100
    theta=5.0
    A=rand(Float64,N2,N1)
    B=rand(Float64,N2,N1)

for ii=1:2000
        C = ( sum( ( B.^theta).*A, dims = 2) ).^(1.0./theta)
end

end
1 Like

I can reproduce this, with an even larger performance difference. Hopefully someone more informed can chime in.

Here is a more streamlined function showing the performance difference

julia> function rtest4()
           rng = MersenneTwister(1234)
           N1=62
           N2=100
           theta=5.0
           A=rand(rng,N2,N1)
           B=rand(rng,N2,N1)
           C = (sum( ( B.^theta).*A, dims = 2) ).^(1.0./theta)
       end;

On 1.6:

julia> @btime rtest4();
  601.963 μs (24 allocations: 166.95 KiB)

On 1.5

julia> @btime rtest4();
  150.233 μs (26 allocations: 167.03 KiB)

It’s definitely the C = ... line, otherwise the difference in times wouldn’t scale as much as it does.

1.5 version infor:

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

1.6 version info

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
1 Like

I’m seeing:

julia> @btime rtest4()
  423.483 ms (16004 allocations: 98.36 MiB)

julia> versioninfo()
Julia Version 1.7.0-beta3.0

and

julia> @btime rtest4()
1.086 s (16004 allocations: 98.42 MiB)

julia> versioninfo()
Julia Version 1.5.3

on Windows 10, Intel i7.

1 Like

Can you run the following and report results?

function rtest5()
           rng = MersenneTwister(1234)
           A=rand(rng,1024)
           C = A.^3.456532
       end;

If this is a regression, the problem is a known problem with scalar pow causes by an accidental change of libm version (which is why this isn’t reproducibile on all OSes).

1 Like

The performance difference exists for me with that function.

1 Like

@Erhan sorry about the regression. This is why we try not to rely on libm for things. The bad news is I didn’t have the time to fix this for 1.7. the good news is that if you can accept a small amount of inaccuracy, you can fix this for now with Base.(^)(x,y) = exp2(y * log2(x))

3 Likes

Thanks a lot, yes, same problem here. 1.5.x is about 4 to 5 times faster than 1.6.x.

Thanks a lot for pointing out the problem, and the solution. Indeed, if I use theta=1.0, there is no difference in performance. It is great to know that you are aware of the problem and it will be fixed in the future. There are a lot of things I like about Julia.

I might be off here, but when I see something like your code with sum(..., dims = 2) and matrix multiplications I feel there’s a Tullio/LoopVectorization solution with potential large speedups around the corner, so might be worth looking into that if this is really the bottleneck in your code.

1 Like

Thanks for the tip. Yes, it is probably not the most efficient way to do it. It was just copied and pasted from Matlab with minor syntax modifications.

Just in case if some newbie, like myself, tries to implement the solution without giving enough thought. I think overriding works better if we declare the types of inputs explicitly since each function might have multiple versions for different types of inputs, like this Base.:^(x::Float64,y::Float64) = exp2(y * log2(x)) I like that we do not need to define a separate version that broadcasts, since we can broadcast on any function with x.^y or like a regular function (^).(x,y).

2 Likes