Possible threading bug on M1 Max (only ARM build)

Got a new MacBook Pro 16" M1 Max with 64GB mem and installed both the 1.7.2 ARM build and the 1.6.5 macOS (so non-ARM) build. The simple Threads.@threads example from the doc pages behaves as expected on the non-arm build but not on the ARM build.

Here is as expected when running on the 1.6.5 non-ARM build:

~ % julia6 -t 8
...
julia> Threads.nthreads()
8

julia> a = zeros(Int, 10);

julia> Threads.@threads for i in 1:length(a)
         a[i] = Threads.threadid()
       end

julia> a'
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  2  2  3  4  5  6  7  8

But when running on the 1.7.2 ARM build not all of the array gets any threadids:

~ % julia7 -t 8
...
julia> Threads.nthreads()
8

julia> a = zeros(Int, 10);

julia> Threads.@threads for i in 1:length(a)
         a[i] = Threads.threadid()
       end

julia> a'
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  0  0  0  0  0  0  0  0

I tried starting with different number of threads but makes no difference. There are always some of the array elements (at the tail end) that have the value zero. I’ll install and check also the non-ARM 1.7.2 build next.

Anyone else seen similar problems?

you should try on 1.8 beta. 1.7.x arm builds are very experimental.

1 Like
Julia Version 1.8.0-DEV.1434
Commit 4abf26eec8 (2022-01-30 20:04 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, cyclone)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
mumuse4.jl
@show Threads.nthreads()
a = zeros(Int, 10);
Threads.@threads for i in 1:length(a)
         a[i] = Threads.threadid()
       end

a'
julia> include("mumuse4.jl")
Threads.nthreads() = 8
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  2  2  3  4  5  6  7  8
1 Like

Thanks Oscar and Laurent. Better on 1.8.0-beta1 but only uses thread 1:

Julia Version 1.8.0-beta1
Commit 7b711ce699 (2022-02-23 15:09 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.2.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_COPY_STACKS = 1
  JULIA_NUM_THREADS = 8

Still not what I would expect:

julia> include("mumuse4.jl")
Threads.nthreads() = 8
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1

What is even worse is that the buggy behaviour is also seen on 1.7.2 the x86 build:

Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
  JULIA_COPY_STACKS = 1
  JULIA_NUM_THREADS = 8

tail end values still without threadids:

julia> include("mumuse4.jl")
Threads.nthreads() = 8
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  0  0  0  0  0  0  0  0

This might be a weird version of Darwin/ARM64: Julia freezes on nested `@threads` loops · Issue #41820 · JuliaLang/julia · GitHub.

I can’t reproduce the behaviour you see with 1 1 1 1 as the output.

julia> a = zeros(Int, 10);

julia> Threads.@threads for i in 1:length(a)
                a[i] = Threads.threadid()
              end
julia> a'
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  2  2  2  3  3  4  4

julia> versioninfo()
Julia Version 1.9.0-DEV.165
Commit 8076517c97* (2022-03-09 19:46 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_PRECOMPILE_TASKS = 4
  JULIA_NUM_THREADS = 4

Thanks @gbaraldi, I’ll check that thread. For now I also built and tested on latest Julia master branch but with the same strange/weird/buggy behaviour:

julia> versioninfo()
Julia Version 1.9.0-DEV.174
Commit 258ddc07d4 (2022-03-12 08:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_COPY_STACKS = 1
  JULIA_NUM_THREADS = 8

julia> include("mumuse4.jl")
Threads.nthreads() = 8
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1

try @threads :static for ...

I tried also with :static but makes no difference.

To summarise when running the above code (with or without :static) on my M1 Max MacBook Pro (latest macOS) I get:

Expected behaviour on: Julia 1.6.5 x86

1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  2  2  3  4  5  6  7  8

Buggy behaviour on Julia 1.7.2 x86 and Julia 1.7.2 arm:

1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  0  0  0  0  0  0  0  0

“Non-buggy” but only using threadid 1 on Julia 1.8.0-beta1 and on Julia 1.9.0-DEV.174:

1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  1  1  1  1  1  1  1  1

The behaviour is the same if started with 2, 4, or 8 threads (with julia -t).

It is strange that I could not reproduce your results on my M1 Max. I will try to find some time to change the Julia version latter in the WE.

Adding these results from running Base.runtests(["threads"]; ncores = 8) on different versions since it might be relevant:

  • No errors on Julia 1.6.5 x86
  • Freezes on Julia 1.7.2 x86
  • Co-schedule error on Julia 1.7.2 arm and 1.8.0-beta1 and 1.9.0-DEV.174

The Co-schedule error is this (well line numbers differ due to test code changes, most likely), the same on 1.7.2 arm, 1.8.0-beta1 arm, and on 1.9.0-DEV.174 arm:

Test Failed at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/test/threads_exec.jl:848
  Expression: (current_task()).sticky == true
   Evaluated: false == true
Co-schedule: Error During Test at /Applications/Julia-1.7.app/Contents/Resources/julia/share/julia/test/threads_exec.jl:844

Still unable to reproduce your bug on the latest master… I have no idea why.

Julia Version 1.9.0-DEV.174
Commit 258ddc07d4 (2022-03-12 08:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
julia> include("mumuse4.jl")
Threads.nthreads() = 8
a' = [2 2 1 1 5 3 7 4 8 6]
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 2  2  1  1  5  3  7  4  8  6

julia> include("mumuse4.jl")
Threads.nthreads() = 8
a' = [1 1 2 2 3 5 7 4 8 6]
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  2  2  3  5  7  4  8  6

julia> include("mumuse4.jl")
Threads.nthreads() = 8
a' = [2 2 1 1 3 7 5 4 8 6]
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 2  2  1  1  3  7  5  4  8  6

I can reproduce by setting the ENV variable JULIA_COPY_STACKS=1.
If I do not set it, I get the expected behavior.

This is on a regular (4/4) M1.
I can also reproduce this on an Intel CPU.

3 Likes

Thanks Chris.

That seems to be it. If I set JULIA_COPY_STACKS=0 behaviour is as expected:

julia> include("mumuse4_static.jl")
Julia Version 1.9.0-DEV.174
Commit 258ddc07d4 (2022-03-12 08:01 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.3.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_COPY_STACKS = 0
  JULIA_NUM_THREADS = 8
versioninfo() = nothing
Threads.nthreads() = 8
1×10 adjoint(::Vector{Int64}) with eltype Int64:
 1  1  2  2  3  4  5  6  7  8

Correct behaviour also on 1.8.0-beta1 arm, 1.7.2 arm, and 1.7.2 x86.

Seems the docs might need to mention this.

Anyway, thanks to you all that got involved.

I checked my .bashrc and I had originally set JULIA_COPY_STACKS=1 for the Taro.jl package. I don’t remember why but now deleted. Anyway, thanks again.

Taro.jl relies on JavaCall.jl, which requires JULIA_COPY_STACKS=1.

EDIT:
I’ve filed an issue: `@threads` does not work with `JULIA_COPY_STACKS=1` · Issue #44589 · JuliaLang/julia · GitHub

1 Like

Well there it is then. JavaCall.jl still requests people to set JULIA_COPY_STACKS to 1 which will not work on M1 Macs then, I guess…

It does not work on Intel Linux either.

EDIT: Nor does it work on AMD Linux, not that I was expecting it to. Pretty sure it’s just broken for all platforms.

2 Likes

Ok wow, good we found it then.