Different `@code_llvm` output on macos and x86

LaurentPlagne · December 8, 2023, 5:08pm

Hi,
I wonder about the different outputs I obtain from @code_llvm with the same Julia version (1.10.rc2) on different architectures (arm vs x86). The following script:


f(a,b) = a .+ b
f (generic function with 1 method)

@code_llvm debuginfo=:none f((1.,2.,3.,4.),(5.,6.,7.,8.))

returns this on x86 (Intel 13900K)

define void @julia_f_117([4 x double]* noalias nocapture noundef nonnull sret([4 x double]) align 8 dereferenceable(32) %0, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [4 x double]* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast [4 x double]* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast [4 x double]* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

and that on arm (apple m1 max)

define void @julia_f_142([4 x double]* noalias nocapture noundef nonnull sret([4 x double]) align 8 dereferenceable(32) %0, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %2) #0 {
top:
  %3 = getelementptr inbounds [4 x double], [4 x double]* %1, i64 0, i64 2
  %4 = getelementptr inbounds [4 x double], [4 x double]* %2, i64 0, i64 2
  %5 = bitcast [4 x double]* %1 to <2 x double>*
  %6 = load <2 x double>, <2 x double>* %5, align 8
  %7 = bitcast [4 x double]* %2 to <2 x double>*
  %8 = load <2 x double>, <2 x double>* %7, align 8
  %9 = fadd <2 x double> %6, %8
  %10 = bitcast [4 x double]* %0 to <2 x double>*
  store <2 x double> %9, <2 x double>* %10, align 8
  %newstruct.sroa.3.0..sroa_idx9 = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 2
  %11 = bitcast double* %3 to <2 x double>*
  %12 = load <2 x double>, <2 x double>* %11, align 8
  %13 = bitcast double* %4 to <2 x double>*
  %14 = load <2 x double>, <2 x double>* %13, align 8
  %15 = fadd <2 x double> %12, %14
  %16 = bitcast double* %newstruct.sroa.3.0..sroa_idx9 to <2 x double>*
  store <2 x double> %15, <2 x double>* %16, align 8
  ret void
}

The versioninfo() outputs on both machines are given below

versioninfo() output on x86

julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795b (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
  Threads: 1 on 32 virtual cores

versioninfo() output on apple silicon

Julia Version 1.10.0-rc2
Commit dbb9c46795b (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

I was expecting to see the same output and a difference with ‘@code_native’.
Is it because Julia assume a 128bit SIMD width on m1 chips ?

P.S. I got the example from the nice video https://youtu.be/W1hXttRmuks?si=49UMwwkVqPSFird_

Oscar_Smith · December 8, 2023, 5:19pm

You should see the same output for @code_llvm optimize=false. What you are seeing is LLVM is using AVX2 (256 bit vectors) for your intel chip and only 128 bit vectors for the M1 (since it knows that it will eventually be lowering to native code).

LaurentPlagne · December 8, 2023, 5:24pm

Thanks !
It makes sense.

Wonder how M1 (max) is so fast with only 128 bits SIMD instructions.

Oscar_Smith · December 8, 2023, 7:02pm

A lot of the answer is that vector size is only part of the story. You also care about how many vectors can be processed per clock. According to Firestorm SIMD and FP Instructions, the M1 can do floating point addition on 4 of it’s execution units, and can issue an 1 add per execution unit, per cycle, while Raptor lake (intel 13th gen) has one port (aka execution unit) that can do adds, but that unit can issue 2 adds per cycle.

Palli · December 8, 2023, 7:17pm

I can think of two reasons.

A. It has the enormous (for L1 cache): Apple M1 - Wikipedia

192+128 KB per core

[Does anyone know shown that way?]

Intel likes to emphasize:
https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html

36 MB Intel® Smart Cache

which is nice but not L1 cache, it’s L2 and/or last level, I’m not sure what it means.

The big numbers sell, and L1 never is big, bit I think it’s the most important number, for latency. You are always working on it (and registers).

L1 is much larger than the architected register file (or I suppose rename registers too).

When you’re working on L2, or L3 etc. then it’s because stuff didn’t fit in L1. Then it helps for throughput.

B. But the memory for M1 is much closer to the CPU, i.e. glued on I believe, so latency likely lower (it’s limited by speed-of-light, in mm distance not counted in cm).

Apple’s chips do not have as large of a RAM however, and it’s not extendible, and also unified (can be a good thing but also a bad thing, since you must compare to RAM plus GPU RAM to get a fair size comparison).

You’re stuck if the RAM is not enough, and largest sizes for Apple are very expensive. You can swap to flash. [Maybe we’ll see swapping to second RAM tier at some point further way from the CPU.]

I’m thinking huge RAM sizes is getting outdated, and flash may do, except to train neural networks. Even to just use the largest Falcon model needs at least 400 GB (but 1-bit networks are going to help, 2-bit are now in use).

[To train the latest Gemini AI, i.e. beating OpenAI GPT-4, took more than one Google datacenter…]

Topic		Replies	Views
MacOS ARM64 no faster than emulated x86? Performance	17	2078	January 22, 2022
LLVM code changes if code is wrapped in function Performance	2	328	March 15, 2023
Very different performance on M1 mac, native vs rosetta Performance mac-m1	14	3291	September 20, 2023
Speeding up julia on aarch64 Internals & Design aarch64 , arm	15	2458	April 29, 2020
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1879	June 27, 2021

Different `@code_llvm` output on macos and x86

Related topics