Performance slowdown on AMD with Windows10 compared to Intel with Linux (arch)

Update, TL;DR: the problem is the AMD processor, or the combination AMD+windows. Not Windows itself as on the same laptop, same version of julia (1.7.3), performance are similar.

Hi all,

I have two setups, as the title suggests:

  • A laptop with a i7-8565U CPU, arch linux, julia 1.7.3 installed from pacman using openblas.

  • A workstation with a Ryzen9 5950x, windows10, julia 1.7.3 installed from the website using openblas.

I’m iteratively running this piece of code:

norm(A*bitpermutation - EMG)

Where A and EMG are constant, bitpermutation changes for each iteration.
In particular A is a 1013 x 19 matrix, bitpermutation is a 19 x 1 column vector, EMG is a 1013 x 1.

My laptop takes 0.000032 seconds on average, while the workstation 0.000075 seconds. Both have been measured with @time and @btime, results are consinstent.

Now, my question is: why is my slow laptop faster than the fast workstation? Is it a windows problem? something related with ASM matrix instruction (AVX maybe?)? Something else?

I can’t install linux on the workstation and I’d like to avoid to install windows on my laptop. So I can’t do this check.

Side question for arch users: why is julia installed from pacman much faster than the “julia-bin” package on the AUR?

I can’t tell you the answer to why you see such a stark timing difference - it’s hard to say without a MWE or access to e.g. @code_native output on those machines.

That definitely shouldn’t be the case - usually, the pacman install makes problems, due to the arch repo having had numerous issues in the past, mostly related to just using system LLVM, while julia requires certain patches that just take a while to be upstreamed…

If you have a reproducible example, I’d like to try that - I’m on arch as well, but I usually compile from source due to making PRs and the like. In my experience, AUR or self compiled has been consistently faster (not by a lot though).

Well, this is the MWE:

using LinearAlgebra

tempmin = Inf
musize = 19
A = randn(1013, 19)
EMG = randn(1013, 1)
for case in 0:2^(musize)-1
    bitpermutation = digits(Int32, case; base=2, pad=musize)
    tempval = @time norm(A*bitpermutation - EMG)
    if tempval < tempmin
        tempmin = tempval
        solution = bitpermutation
    end
end

About the AUR compiled it was 35% slower on this piece, I really don’t know why.

Well for one, I’d recommend not benchmarking in global scope.

Performance critical code should be inside a function

Any code that is performance critical should be inside a function. Code inside functions tends to run much faster than top level code, due to how Julia’s compiler works.

Your code also seems to allocate a lot of (I think) unnecessary intermediate variables, which I think is usually a little more expensive on windows. I think in part that was due to syscalls in windows being a little more expensive. It’s hard to say though, since the timings are relatively close.

This doesn’t explain the difference you observe of course, but it’s very possible that the allocation and internal behavior of malloc on windows makes a difference here.

Yeah, the code is inside a function. Useless parts, like the “solution” variable are used after that piece of code, in particular “solution” is inserted into a matrix and returned. Anyway, the point where I see significant differences is the line which computes the norm. In particular, with or without the norm doesn’t change anything.

The full function code is:

function naiveminerr(A::Matrix{Float32}, EMG::Matrix{Float32})::Matrix{Float32}
    totaltime = size(EMG, 2)
    musize = size(A, 2)
    res = Matrix{Int32}(undef, musize, totaltime)
    for instant in 1:totaltime
        println(instant/totaltime*100)
        tempmin = Inf
        solution = undef
        for case in 0:2^(musize)-1
            bitpermutation = digits(Int32, case; base=2, pad=musize)
            tempval = @time norm(A*bitpermutation - EMG[:, instant])
            if tempval < tempmin
                tempmin = tempval
                solution = bitpermutation
            end
        end
        res[:, instant] = solution
    end
    return res
end

Anyway, performance slowdown are the same. The previous example was simplified to isolate the problematic line.
Now I’m computing that matrix in a completely different way, but I was curious to understand why this slowdown with a more performant CPU. I think I’ll try to run the same code on my laptop on windows to have more information. I’ve also seen the same problem on other dot products inside the same codebase. So I guess there is something not working properly, I guess at low level.

To be clear, when you said that

, which measurement are you talking about? An individual call to norm(...) or the whole function? I’m asking because on my linux machine, the call to norm takes just 0.000014 seconds, while the whole computation takes ~10s, making me think that this is not the main culprit of the slowdown. This is my machine:

julia> versioninfo()
Julia Version 1.9.0-DEV.614
Commit 0488abc625* (2022-05-24 15:28 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

The function you’ve posted is in part also type unstable, so that may have an influence as well. Another factor may be a difference in single core performance between the two CPUs, but I do have to admit that I’m not up to date with which manufacturer is generally better there. I think that may have been Intel favored (which may play a part), but I’m not sure.

Have you considered using dot(x, A, y), to avoid a potential expensive matrix product? Both A and EMG are matrices after all, not vectors, so I’m not sure what you mean with the dot product there.

First of all thanks for your support!

yeah, I talk about the average computation time of norm(...). The whole computation of the inner for cycle is ~7s.
The problem is that totaltime is ~65000. So it would takes days.

About this:

, where is the function type unstable? the product A*bitpermutation? I already checked this part: it doesn’t change if bitpermutation is a Float. About CPUs, I really don’t know.

In this:

, all dimensions are right, as EMG has the same number of rows of A and the number of columns of A is the same of the column vector bitpermutation.

An additional information: the Windows build of julia uses libLLVM 12, the arch one libLLVM 13. I don’t know which role this can have.

@Sukera, I just checked on the laptop, switching to windows 10: it is a few slower than on arch, about 10% on average. I think this is pretty similar and has sense. So…, it is probably a problem with AMD processors. Or with that gen of processors. Unfortunately I really can’t install arch on the workstation to fully check.