Different `@code_llvm` output on macos and x86

Hi,
I wonder about the different outputs I obtain from @code_llvm with the same Julia version (1.10.rc2) on different architectures (arm vs x86). The following script:


f(a,b) = a .+ b
f (generic function with 1 method)

@code_llvm debuginfo=:none f((1.,2.,3.,4.),(5.,6.,7.,8.))

returns this on x86 (Intel 13900K)

define void @julia_f_117([4 x double]* noalias nocapture noundef nonnull sret([4 x double]) align 8 dereferenceable(32) %0, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %2) #0 {
top:
  %3 = bitcast [4 x double]* %1 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = bitcast [4 x double]* %2 to <4 x double>*
  %6 = load <4 x double>, <4 x double>* %5, align 8
  %7 = fadd <4 x double> %4, %6
  %8 = bitcast [4 x double]* %0 to <4 x double>*
  store <4 x double> %7, <4 x double>* %8, align 8
  ret void
}

and that on arm (apple m1 max)

define void @julia_f_142([4 x double]* noalias nocapture noundef nonnull sret([4 x double]) align 8 dereferenceable(32) %0, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %1, [4 x double]* nocapture noundef nonnull readonly align 8 dereferenceable(32) %2) #0 {
top:
  %3 = getelementptr inbounds [4 x double], [4 x double]* %1, i64 0, i64 2
  %4 = getelementptr inbounds [4 x double], [4 x double]* %2, i64 0, i64 2
  %5 = bitcast [4 x double]* %1 to <2 x double>*
  %6 = load <2 x double>, <2 x double>* %5, align 8
  %7 = bitcast [4 x double]* %2 to <2 x double>*
  %8 = load <2 x double>, <2 x double>* %7, align 8
  %9 = fadd <2 x double> %6, %8
  %10 = bitcast [4 x double]* %0 to <2 x double>*
  store <2 x double> %9, <2 x double>* %10, align 8
  %newstruct.sroa.3.0..sroa_idx9 = getelementptr inbounds [4 x double], [4 x double]* %0, i64 0, i64 2
  %11 = bitcast double* %3 to <2 x double>*
  %12 = load <2 x double>, <2 x double>* %11, align 8
  %13 = bitcast double* %4 to <2 x double>*
  %14 = load <2 x double>, <2 x double>* %13, align 8
  %15 = fadd <2 x double> %12, %14
  %16 = bitcast double* %newstruct.sroa.3.0..sroa_idx9 to <2 x double>*
  store <2 x double> %15, <2 x double>* %16, align 8
  ret void
}

The versioninfo() outputs on both machines are given below

versioninfo() output on x86
julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795b (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, goldmont)
  Threads: 1 on 32 virtual cores
versioninfo() output on apple silicon
Julia Version 1.10.0-rc2
Commit dbb9c46795b (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores

I was expecting to see the same output and a difference with ‘@code_native’.
Is it because Julia assume a 128bit SIMD width on m1 chips ?

P.S. I got the example from the nice video https://youtu.be/W1hXttRmuks?si=49UMwwkVqPSFird_

You should see the same output for @code_llvm optimize=false. What you are seeing is LLVM is using AVX2 (256 bit vectors) for your intel chip and only 128 bit vectors for the M1 (since it knows that it will eventually be lowering to native code).

3 Likes

Thanks !
It makes sense.

Wonder how M1 (max) is so fast with only 128 bits SIMD instructions.

A lot of the answer is that vector size is only part of the story. You also care about how many vectors can be processed per clock. According to Firestorm SIMD and FP Instructions, the M1 can do floating point addition on 4 of it’s execution units, and can issue an 1 add per execution unit, per cycle, while Raptor lake (intel 13th gen) has one port (aka execution unit) that can do adds, but that unit can issue 2 adds per cycle.

I can think of two reasons.

A. It has the enormous (for L1 cache): Apple M1 - Wikipedia

192+128 KB per core

[Does anyone know shown that way?]

Intel likes to emphasize:
https://www.intel.com/content/www/us/en/products/sku/230496/intel-core-i913900k-processor-36m-cache-up-to-5-80-ghz/specifications.html

36 MB Intel® Smart Cache

which is nice but not L1 cache, it’s L2 and/or last level, I’m not sure what it means.

The big numbers sell, and L1 never is big, bit I think it’s the most important number, for latency. You are always working on it (and registers).

L1 is much larger than the architected register file (or I suppose rename registers too).

When you’re working on L2, or L3 etc. then it’s because stuff didn’t fit in L1. Then it helps for throughput.

B. But the memory for M1 is much closer to the CPU, i.e. glued on I believe, so latency likely lower (it’s limited by speed-of-light, in mm distance not counted in cm).

Apple’s chips do not have as large of a RAM however, and it’s not extendible, and also unified (can be a good thing but also a bad thing, since you must compare to RAM plus GPU RAM to get a fair size comparison).

You’re stuck if the RAM is not enough, and largest sizes for Apple are very expensive. You can swap to flash. [Maybe we’ll see swapping to second RAM tier at some point further way from the CPU.]

I’m thinking huge RAM sizes is getting outdated, and flash may do, except to train neural networks. Even to just use the largest Falcon model needs at least 400 GB (but 1-bit networks are going to help, 2-bit are now in use).

[To train the latest Gemini AI, i.e. beating OpenAI GPT-4, took more than one Google datacenter…]