Apple silicon full power

tomtom · October 30, 2021, 1:41pm

First of all, I would like to thank for the efforts put in v1.7 that supports Apple silicon CPU

Just two years or so earlier, no body understands why Apple going away from standard GPU libraries like OpenCL. Now, we should appreciate how powerful is Apple’s SoC and why it needs a new and specific Metal layer.

That said, could we expect Julia being able to leverage the full power of Apple Silicon besides the CPU, i.e. Neural Engines and GPUs? If this comes true, many linear algebra operations like matrix multiplication could see a 10x gain in performance!

If the past 10 years of improving their Axx chips tell anything, it shows that Apple would dominate and revolutionize the computing hardwares in very near future, if not already. There’re already a lot of benchmark tests showing that M1 Pro/M1 Max is way more powerful than any top-configured PCs.

It would be ready nice for Julia, being a new language that emphasizes performance, going natively with Apple silicon’s full power. That would a dream for scientific computing!

jling · October 30, 2021, 3:02pm

not if cluster/super computers don’t start using apple silicon

ImreSamu · October 30, 2021, 9:47pm

IMHO:
The first step has been started: Reverse Engineering

Apple Matrix coprocessor - Reverse Engineering

...
# AMX: Apple Matrix coprocessor
#
# This is an undocumented arm64 ISA extension present on the Apple M1. These
# instructions have been reversed from Accelerate (vImage, libBLAS, libBNNS,
# libvDSP and libLAPACK all use them), and by experimenting with their
# behaviour on the M1. Apple has not published a compiler, assembler, or
# disassembler, but by callling into the public Accelerate framework
# APIs you can get the performance benefits (fast multiplication of big
# matrices). This is separate from the Apple Neural Engine.
#
# Warning: This is a work in progress, some of this is going to be incorrect.
#
# This may actually be very similar to Intel Advanced Matrix Extension (AMX),
# making the name collision even more confusing, but it's not a bad place to
# look for some idea of what's probably going on.
...

Apple M1 Neural Engine - Reverse Engineering

GitHub - hollance/neural-engine: Everything we actually know about the Apple Neural Engine (ANE)

Apple M1 GPU - Reverse Engineering

More Progress Is Made Understanding Apple's M1 GPU, Working Towards An Open Driver - Phoronix

And as usual - adding “reverse engineering” for the keywords … you can check the latest status

example: apple neural engine reverse engineering - Google Search

Related thread:

Apple M1, M1 pro M1 Max and Julia developpers - #15 by giordano

lmiq · October 30, 2021, 9:52pm

Is reverse engineering required to take full advantage of the chips? What are the developers outside apple expected to do? Apple idea is to support a compiler?

giordano · October 30, 2021, 9:53pm

It’s already a thing, it’s called Apple Clang

sdanisch · October 30, 2021, 10:08pm

Do you mean laptops or some special use cases, where they built special hardware support?

Because especially multi core application seem to be still by far dominated by AMD + Intel, and for single thread applications “top configured PCs” seem to be around eye level from what I can tell… E.g. my 1 year old ryzen 5800x scores 28480 points on cpubenchmarks, while the m1 pro 10core scores 23800points, and my CPU isn’t really top configuration anymore…

Still super impressive considering its a new processor and doing it at ~40% of the power consumption in a fanless laptop, but seems quite far away from is way more powerful than any top-configured PCs…

I’m not 100% up to date on special purpose benchmarks, so if you have any serious benchmarks that back up that claim (not those ominous ones from apple^^), I’d be pretty interested to see them!

ImreSamu · October 30, 2021, 10:24pm

using only the high-level private API <> “take full advantage”

for the low-level assembly tunning: is a good documentation required

like in the OpenBlas kernel: optimization~=Low level Assembly :
- https://github.com/xianyi/OpenBLAS/search?q=ASSEMBLER

LaurentPlagne · October 31, 2021, 12:07am

It is a bit early for m1 pro/Max benchmarks but there is a large (IMO a large majority) fraction of scientific simulations that are bounded by memory bandwidth. About every large CFD, mechanics, wave propagation (all mesh based pde solvers) are in this situation. Then when a laptop is supposed to bring 400 GBs to CPU computation (20 times more than usual laptops) at a super low power I can’t help myself to think that this could be the most significant hardware step since gpgpu or multicore processor. Although I am not quite sure yet that these figures will effectively be converted in large acceleration for simulation codes: for example I wonder about the potential latency increase.

stevengj · October 31, 2021, 1:34am

There is an undocumented matrix coprocessor in some M1 chips. Outside developers are apparently expected to use Apple’s libraries (e.g. Apple vecLib BLAS) in order to take advantage of it.

tomtom · October 31, 2021, 2:48am

tests like this are flooding the web

tomtom · October 31, 2021, 2:50am

I don’t think we need to reverse engineering the chips. Instead it’s very good enough for Julia being able to call Metal and Accelerate.

jling · October 31, 2021, 3:08am

and clearly that’s a laptop. worse, it’s apple’s own laptop which had serious thermal throttling issue

StatisticalMouse · October 31, 2021, 9:20am

Julia should be able to use the Apple libraries, right? (Whether someone spends the effort to enable that is of course a relevant question.)

haydenfree · November 5, 2021, 5:40pm

I have a new Macbook Pro in the mail, and would really prefer to continue building ML / AI apps in Julia, so I hope this will come.

tomtom · November 16, 2021, 11:41am

and this one is massive

Palli · November 16, 2021, 4:39pm

5 sec. into the video: “they smoked high-end RTX graphics Windows machines”. Really?

3:30 in:

1440P Aztec Ruins Offscreen
(FPS higher is better)

310 FPS for M1 Max vs 205 FPS for $15,000 Mac Pro Vega II

No, higher FPS (than 205) isn’t better. I’m sure it is better there, and all else equal. Regarding RTX cards however, have ray-tracing, what I would want if I cared about graphics. Graphics are not just about speed/FPS, 60 FPS should be ok assuming constant not average (and with temporal and spacial anti-aliasing, might no be there). I think people go for 100+ FPS for temporal anti-aliasing, there might be more clever ways do do it with less than 100 FPS. The old benchmarks seem meaningless. I’m sure you can find programs, related to graphics or not, where the M1 RAM limitation is a problem.

joa-quim · November 16, 2021, 5:52pm

And will those new Macs still rotten the cables once per year

StevenSiew · November 16, 2021, 9:58pm

Can someone explain what this sentence below means?
And will those new Macs still rotten the cables once per year

I tried to read it ten times and I am still confused. English is not my first language.

joa-quim · November 16, 2021, 10:32pm

OK, sorry. It was just a silly joke.

I had 3 generations of Mac laptops and during all those ~10 years the Apple cables (power supply) just gotten rotten and I had to buy new ones, as well as the power supplies themselves. Unbelievable “Apple quality” but this is 10 years of my experience with Apple.

Palli · November 18, 2021, 9:13am

FYI: I’m jumping the gun a bit, but note still not tier 1 (for 1.7 RCs or 1.8 master): blog post: Julia 1.7 Highlights by KristofferC · Pull Request #1419 · JuliaLang/www.julialang.org · GitHub

While we are now able to provide pre-built Julia binaries for this platform, its support is currently considered [tier 3]

I believe everything should work with Rosetta, and actually most with M1 binaries:

Remember that also the x86-64 (Intel) binaries of Julia can run on these machines, thanks to the Rosetta 2 compatibility layer, albeit with a reduced performance.

There are still open issues for M1:

github.com/JuliaLang/julia

Darwin/ARM64 tracking issue

opened 03:53AM - 11 Jul 20 UTC

closed 03:09AM - 20 Jan 22 UTC

Keno

help wanted mac arm apple silicon

I figured it would be worth having a single issue to track all the known issues …on Apple Silicon. I'll try to keep this list updated as things get fixed or people encounter additional issues. - [x] Add `MacOS(:aarch64)` as a valid platform (https://github.com/JuliaLang/Pkg.jl/pull/1916) - [x] Port Mach exception handling (https://github.com/JuliaLang/julia/pull/36592) - [x] Unconditioanlly enable CRC32 (#36624) ~~Hook up ARM feature detection (via `sysctl hw.optional`)~~ - [x] Figure out where to get a Fortran compiler from (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96168 for gcc discussion - other compiler seem farther away) - [x] Build BB shard (should be relatively straightforward once we have a Fortran compiler - https://github.com/JuliaPackaging/Yggdrasil/pull/1626) - [x] Fix synchronous memory exception delivery (#36625) - [x] Upgrade config.sub (fixed by #36615) - [x] GMP fails to compile (fixed by #36616) - [x] PCRE2 crashes (https://bugs.exim.org/show_bug.cgi?id=2618 https://github.com/zherczeg/sljit/pull/90) (#39310) - [x] uv_cpu_info errors (tracked at https://github.com/libuv/libuv/issues/2911) - [x] LLVM9 process ARM64 relocations incorrectly (seems to be fixed in LLVM10 - needs #35318) - [x] Our libosxunwind is 7 years old and does not have ARM64 support - we should replace it by LLVM libunwind (issue #30154) - [x] Unknown LLVM issue with the following symptom (might be a general LLVM10 issue?): (https://github.com/JuliaPackaging/Yggdrasil/pull/1318) ``` From worker 14: While deleting: i8* %splitgep From worker 14: An asserting value handle still pointed to this value! From worker 14: UNREACHABLE executed at /Users/julia/julia/deps/srccache/llvm-10.0.0/lib/IR/Value.cpp:917! From worker 14: From worker 14: signal (6): Abort trap: 6 From worker 14: in expression starting at /Users/julia/julia/usr/share/julia/stdlib/v1.6/LinearAlgebra/test/diagonal.jl:11 ``` - [x] Some sort of (intermittent) issue during precompile: ``` Generating REPL precompile statements... 22/28ERROR: LoadError: IOError: stream is closed or unusable ``` - [x] Test failure in `worlds` test: ``` worlds (4) | failed at 2020-11-13T00:31:04.270 On worker 4: BoundsError: attempt to access 3-element BitVector at index [0:3] ``` - [x] Test failure in `numbers` test (related to SIGFPE handling) (#39894): ``` Worker 6 terminated. numbers (6) | failed at 2020-11-13T00:31:34.703 ProcessExitedException(6) ``` - [x] Segfault in `complex` test: complex (2) | started at 2020-11-13T00:39:12.332 ``` From worker 2: From worker 2: signal (11): Segmentation fault: 11 From worker 2: in expression starting at /Users/julia/julia23/test/complex.jl:30 From worker 2: jl_method_error_bare at /Users/julia/julia23/usr/lib/libjulia.1.6.dylib (unknown line) From worker 2: jl_method_error at /Users/julia/julia23/usr/lib/libjulia.1.6.dylib (unknown line) From worker 2: jl_apply_generic at /Users/julia/julia23/usr/lib/libjulia.1.6.dylib (unknown line) From worker 2: do_call at /Users/julia/julia23/usr/lib/libjulia.1.6.dylib (unknown line) ``` - [x] Test failures in `complex` test (filed as #38419) - [x] Several tests run forever: ``` LinearAlgebra/triangular (running for 61 minutes) LinearAlgebra/addmul (running for 55 minutes) bitarray (running for 53 minutes) iterators (running for 52 minutes) ccall (running for 39 minutes) loading (running for 39 minutes) sorting (running for 24 minutes) ``` - [x] Test failure in inference ``` compiler/inference (5) | failed at 2020-11-13T01:24:18.980 Test Failed at /Users/julia/julia23/test/compiler/inference.jl:944 Expression: break_21369() Expected: ErrorException Thrown: BoundsError ``` - [x] Build system hacks since we don't have a native GCC toolchain built (#38421) - [x] Some sort of segfault in debug build (filed as #39818): ``` signal (11): Segmentation fault: 11 in expression starting at REPL[1]:1 jfptr_LinearIndices_7740 at /Users/julia/julia-master/usr/lib/julia/sys-debug.dylib (unknown line) _jl_invoke at /Users/julia/julia-master/src/gf.c:2223 jl_apply_generic at /Users/julia/julia-master/src/gf.c:2424 ssa_substitute_op! at ./compiler/ssair/inlining.jl:1432 ssa_substitute! at ./compiler/ssair/inlining.jl:1406 [inlined] ir_inline_item! at ./compiler/ssair/inlining.jl:369 batch_inline! at ./compiler/ssair/inlining.jl:553 ``` - [x] Address space related LLVM assertion (https://reviews.llvm.org/D61259#2585835) - [x] The linker corrupts our sysimage (#39820) ``` signal (11): Segmentation fault: 11 in expression starting at none:0 <= at ./int.jl:444 [inlined] >= at ./operators.jl:409 [inlined] unitrange_last at ./range.jl:359 [inlined] UnitRange at ./range.jl:354 [inlined] ``` - [x] Unwinding from JIT frames doesn't work (#39986) - [x] Integer printing is broken under `make debug` (#39823) ``` julia> typemin(Int32) 2147483648 ``` - [x] LLVM Assertion failure in iterators/bitarray test ``` From worker 4: Assertion failed: (isInt<33>(Addend) && "Invalid page reloc value."), function encodeAddend, file /Users/julia/julia-master/deps/srccache/llvm-11.0.1/lib/ExecutionEngine/RuntimeDyld/Targets/RuntimeDyldMachOAArch64.h, line 210. From worker 4: From worker 4: signal (6): Abort trap: 6 From worker 4: in expression starting at /Users/julia/julia-master/test/iterators.jl:343 From worker 4: __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line) From worker 4: Allocations: 962963350 (Pool: 962516480; Big: 446870); GC: 662 Worker 4 terminated. iterators (4) | failed at 2021-02-25T16:52:13.534 ``` - [x] Artifacts test needs updating (#39829) - [ ] LibCURL tests fail: ``` LibCURL (24) | failed at 2021-02-25T17:03:28.235 Error During Test at /Users/julia/julia-master/usr/share/julia/stdlib/v1.7/LibCURL/test/runtests.jl:34 Got exception outside of a @test SSL peer handshake failed, the server most likely requires a client certificate to connect while requesting https://github.com/JuliaWeb/LibCURL.jl/blob/master/README.md Test Failed at /Users/julia/julia-master/usr/share/julia/stdlib/v1.7/LibCURL/test/ssl.jl:32 Expression: res == CURLE_OK Evaluated: 0x00000023 == 0x00000000 ``` - [x] Segfault in SVD test #41440 - [ ] Darwin/ARM64: Julia freezes on nested @threads loops

Topic		Replies	Views
Is there anything users can do to help move Apple Silicon support from Tier 3 to Tier 1? Internals & Design mac-m1	33	5315	July 19, 2022
Apple M1, M1 pro M1 Max and Julia developpers Offtopic	17	5481	November 1, 2021
Apple M1 GPU from Julia? GPU question	20	5956	March 31, 2023
Mac's AMD GPU GPU	61	10658	September 7, 2020
Show off Julia performance on your PC! Performance	53	4502	April 26, 2020

Apple silicon full power

Apple Matrix coprocessor - Reverse Engineering

Apple M1 Neural Engine - Reverse Engineering

Apple M1 GPU - Reverse Engineering

Related topics