Runtime (memory) on M1 Macbooks: something is not right

The following link: Trying to understand memory usage

If I do this (see link above) on my Macbook 13" with M1 processor and 8GB RAM and SSD (Julia Rosetta Version 1.7.1 (2021-12-22)):

julia> using LinearAlgebra
julia> using BenchmarkTools

julia> function func1()
a = rand(100,1000)
return norm(a,2)
end

julia> @benchmark begin
for i in 1:10000
func1()
end
end

BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 92.445 s (0.04% GC) to evaluate,
with a memory estimate of 7.45 GiB, over 20000 allocations.

Something is not right here. A runtime of 92 seconds (the poster in the original thread, see link above, had it back in 4 seconds).

My Python version takes (time python test.py): 5.41s user 0.05s system 99% cpu 5.493 total

Even if Python does not (which I don’t know just speculation) copy a new array every time it advances func1() in the loop there is no reason why Julia is so slow (note: I haven’t checked if python does the same as the Julia code 1-based indexing):

import numpy as np

def func1():
a = np.random.rand(100,1000)
#return np.linalg.norm(a, axis=0)
return np.linalg.norm(a, axis=1)

def benchmark():
for i in range(0,10000):
tmp = func1()
#print statement won’t make a difference in terms of timing
#print('i ', i, np.sum(tmp))
return β€˜okay’

print(benchmark())

Edit: My Python is Python 3.8.12, [Clang 10.0.0 ] :: Anaconda, Inc. on darwin

I can’t say much about any M1/Julia issues but one general thing which catches my eye is that you don’t have to loop for benchmarks. To use @benchmark in Julia you just do

@benchmark func1()

You can set the number of samples to take for benchmarking with

BenchmarkTools.DEFAULT_PARAMETERS.samples = 10000

but 10000 is already the default value.

This is what I get:

julia> @benchmark  func1()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   97.700 ΞΌs …   7.997 ms  β”Š GC (min … max):  0.00% …  0.00%
 Time  (median):     324.900 ΞΌs               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   384.368 ΞΌs Β± 419.740 ΞΌs  β”Š GC (mean Β± Οƒ):  12.67% Β± 11.68%

Comparing it with a proper python benchmark:

import numpy as np

def func1():
    a = np.random.rand(100,1000)
    return np.linalg.norm(a, axis=1)

import timeit
num_runs = 10000
duration = timeit.Timer(func1).timeit(number = num_runs)
avg_duration = duration/num_runs
print(f'On average it took {avg_duration} seconds')

On average it took 0.0010858759800000002 seconds

Which yields 3x faster python/numpy. However, this seems to be more comparable, perhaps you can check out on your M1 system?

Also, your benchmark is vastly dominated by the generation of random numbers. If you’re interested in benchmarking the norm function, just do that:

julia> using BenchmarkTools, LinearAlgebra

julia> function func1()
           a = rand(100,1000)
           return norm(a,2)
       end
func1 (generic function with 1 method)

julia> @benchmark func1()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  138.996 ΞΌs … 848.082 ΞΌs  β”Š GC (min … max): 0.00% … 53.99%
 Time  (median):     140.990 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   151.041 ΞΌs Β±  44.092 ΞΌs  β”Š GC (mean Β± Οƒ):  3.20% Β±  8.53%

  β–ˆβ–…β–„β–‚β–‚β–                                                        ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–‡β–†β–†β–…β–…β–ƒβ–β–β–„β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–‡β–†β–‡β–‡β–‡ β–ˆ
  139 ΞΌs        Histogram: log(frequency) by time        412 ΞΌs <

 Memory estimate: 781.30 KiB, allocs estimate: 2.

julia> @benchmark norm(a, 2) setup=(a = rand(100, 1000))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  34.553 ΞΌs … 105.522 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     35.873 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.324 ΞΌs Β±   4.416 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–‡β–ˆβ–ˆβ–†β–…β–…β–„β–ƒβ–ƒβ–ƒβ–‚β–‚β–‚β–β–β–                                            β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–‡β–†β–†β–…β–†β–†β–‡β–„β–†β–†β–…β–…β–…β–†β–†β–…β–†β–†β–…β–…β–…β–…β–…β–†β–†β–…β–†β–†β–…β–…β–…β–…β–…β–„ β–ˆ
  34.6 ΞΌs       Histogram: log(frequency) by time        59 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

It seems both of these answers - while providing sound advice - are slightly missing the point of OP. They are running a benchmark which, irrespective of the merits of its exact implementation, should take on the order of 5 seconds (indeed I can confirm that on my system I see min 4.3, median 4.7 seconds) but takes twenty times as long on their M1 Rosetta system.

It would be helpful if anyone could try this on M1 Rosetta to see if it is a problem specific to that setup. (@giordano I was under the impression you had an M1?)

Yes, as a way to time norm this is awful, but as some kind of memory stress-test… I guess it runs into some limitation of Rosetta? No, see below. On an M1 with 16GB of memory:

julia> @time for i in 1:10_000
           func1()
       end
  4.307090 seconds (20.00 k allocations: 7.451 GiB, 13.03% gc time)  # native
150.006328 seconds (30.00 k allocations: 7.451 GiB, 0.09% gc time)  # rosetta

Maybe one run isn’t so different?

julia> @btime func1();
  min 313.833 ΞΌs, mean 397.210 ΞΌs (2 allocations, 781.30 KiB)  # native
  min 14.915 ms, mean 15.115 ms (2 allocations, 781.30 KiB)  # rosetta

julia> 15.115 / 0.397
38.07304785894206

julia> function func2()
       a = fill(pi/2,100,1000)  # instead of random numbers
       return norm(a,2)
       end
func2 (generic function with 1 method)

julia> @btime func2();
  min 218.042 ΞΌs, mean 331.412 ΞΌs (2 allocations, 781.30 KiB)
  min 13.020 ms, mean 13.299 ms (2 allocations, 781.30 KiB)
1 Like

Will have more time later. But even reducing the loop counter from 1e4 to 1e3 it takes just too long:

@benchmark begin
for i in 1:1000
func1()
end
end

BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 9.309 s (0.12% GC) to evaluate,
with a memory estimate of 762.99 MiB, over 2000 allocations.

Actually I am not interested in micro benchmarks but the above doesn’t seem right. I really wonder if something is interfering with my Julia installation. Watching β€˜Activity Monitor’ while the above little program runs I cannot see that Julia consumers a lot of memory or putting the systems to a standstill on my 8GB RAM and SSD powered little Macbook.

My OSX: macOS Big Sur 11.6.1

x86-64 Julia 1.6:

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark norm(a, 2) setup=(a = rand(100, 100))
BenchmarkTools.Trial: 5057 samples with 1 evaluation.
 Range (min … max):  910.791 ΞΌs …   5.149 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     946.209 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   969.911 ΞΌs Β± 142.010 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‡β–…β–…β–†β–†β–…β–…β–…β–„β–ƒβ–‚β–‚β–‚β–β–β–β–β–                                           β–‚
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–‡β–‡β–†β–†β–…β–…β–†β–†β–†β–…β–…β–…β–†β–…β–‡β–…β–…β–β–ƒβ–„β–…β–ƒβ–„β–„β–…β–β–…β–β–ƒβ–ƒβ–„β–„β–β–„β–…β–ƒ β–ˆ
  911 ΞΌs        Histogram: log(frequency) by time       1.44 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.6.4
Commit 35f0c911f4 (2021-11-19 03:54 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, westmere)

aarch64 Julia 1.7:

julia> using LinearAlgebra, BenchmarkTools

julia> @benchmark norm(a, 2) setup=(a = rand(100, 100))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.375 ΞΌs … 52.667 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     12.667 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   13.038 ΞΌs Β±  1.833 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„β–ˆβ–„  ▆▆▄▁▁▅▄▂▂  ▁ ▁ ▁                                       β–‚
  β–ˆβ–ˆβ–ˆβ–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–„β–β–β–„β–β–ƒβ–β–ƒβ–β–ƒβ–β–β–β–ƒβ–β–β–β–ƒβ–ƒβ–β–β–β–„β–β–β–ƒβ–ƒβ–„β–ƒβ–ƒβ–β–ƒβ–„β–ƒβ–β–ƒ β–ˆ
  12.4 ΞΌs      Histogram: log(frequency) by time      18.2 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.7.0
Commit 3bf9d17731 (2021-11-30 12:12 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)

When you benchmark only norm and not the generation of random numbers, the ratio is closer to 75 than 40. Yes, norm seems to be particularly bad under Rosetta (sorry, I don’t have the same version of Julia for both architectures)

The generic fallback one is less affected than the BLAS version:

julia> @btime norm(a, 2) setup=(a = rand(100, 100));
  min 20.333 ΞΌs, mean 20.716 ΞΌs (0 allocations)  # native, 1.8
  min 1.483 ms, mean 1.491 ms (0 allocations)  # rosetta, 1.7 + openblas

julia> @btime LinearAlgebra.generic_norm2(a) setup=(a = rand(100, 100));
  min 25.708 ΞΌs, mean 26.305 ΞΌs (0 allocations)  # native
  min 71.167 ΞΌs, mean 71.660 ΞΌs (0 allocations)  # rosetta

If I use norm2 it will do the following and reduce the time significantly for 1e4 iterations in the loop:

julia> @benchmark begin
for i in 1:10000
func1()
end
end
BenchmarkTools.Trial: 1 sample with 1 evaluation.
Single result which took 5.356 s (0.95% GC) to evaluate,
with a memory estimate of 7.45 GiB, over 20000 allocations.

Just for info and if someone knows the answer: β€˜norm2’ is different to β€˜norm’ because?

norm2 doesn’t accept the axis option: ERROR: MethodError: no method matching generic_norm2(::Matrix{Float64}, ::Int64)

Maybe I know the answer: norm2 is meant to calc the norm for axis=2.

Thanks

What axis argument? The second argument of norm is the order of the norm (2 by default, BTW)

I thought it is the axis argument (the Python also lets you specify the axis over which you want the norm).

But in my original code changing it to order 2:

def func1():
a = np.random.rand(100,1000)
return np.linalg.norm(a, 2)

also increases the runtime a lot to:

python test.py 555.32s user 102.05s system 725% cpu 1:30.58 total

This is the problem with those implicit arguments and converting code.

Can anyone please confirm that np.linag.norm(a,2) does the same as norm2(a) or norm(a,2) for that matter.