Arm64 julia/docker on AWS EC2: cpu target recommendations

I’m just looking for some addtional info here from people who have more experience.

Can anyone recommend JULIA_CPU_TARGET settings for running Julia inside a Linux docker image on arm64 Amazon EC2?

Can anyone provide some more information about what’s happening?

We’ve migrated a Julia-based microservice from an x64 docker image to aarch64/arm64/armv8, as we are moving to using an all arm64 cluster on AWS EC2.

On x64 we used the debian-based official Julia image from dockerhub.

On aarch64 we started getting precompile errors e.g. “illegal instruction” when instantiating the service project during docker build.

We tinkered with setting JULIA_CPU_TARGET, but it seemed like our problems were solved without setting JULIA_CPU_TARGET by switching to an amazonlinux2 docker image instead. In it I install Juliaup and get official Julia that way. At least, that worked temporarily.

Then a couple of days ago I added some debug info to the docker build, so that it is logged in case of future breaks, and this seems to have broken the build again :slight_smile:

In particular, now versioninfo(verbose = true) runs into “illegal instruction”. It dies at the step of trying to compute the LoadAvg. I get the message Load Avg: Invalid instruction at and then some memory addresses.

versioninfo(verbose=false) works but then I get precompilation failures in some packages.

Anyway it seems like sometimes the docker build goes through, with or without precompilation failures, and sometimes not. When it goes through I’m able to run the microservice. At service startup it does some precompilation and these (so far) succeed.

Previously I did try instantiating the project with --pkgimages=no but then got fatal precompilation failures during service startup.

My main concern now is that if we get to a state where it looks as though everything is good, maybe it isn’t. Maybe we’ll run into illegal instructions while running the service, or build failures a few days later after merging some seemingly unrelated code change inside the service.

Here’s some info reported by Julia from within the docker image as it is being built. The first 5 lines are Sys constants and the rest is from versioninfo().

MACHINE: aarch64-linux-gnu
ARCH: aarch64
KERNEL: Linux
CPU_NAME: neoverse-v1
CPU_THREADS: 8
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
 Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 8 × unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, neoverse-v1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

So it looks like we certainly need to set JULIA_CPU_TARGET, and in that case, we may as well switch back to the official debian-based Julia docker image.

From Julia itself (Base Julia ISA targets) I get this possibility for aarch64:

JULIA_CPU_TARGET="generic;cortex-a57;thunderx2t99;carmel,clone_all;apple-m1,base(3);neoverse-512tvb,base(3)"

Should I modify this? Use as is? Use something else?

Any knowledge you can share would be appreciated.

1 Like

This is not a direct solution, but I had some similar issues with Julia on ARM EC2, and managed to solve them by installing Julia copying the instructions from this Dockerfile (everything from line 9-83): julia/1.10/bookworm/Dockerfile at master · docker-library/julia · GitHub . I also found more success using Fargate than plain EC2, I have no idea why but the same docker image and settings would get segfaults on EC2.

After switching to Fargate and installing Julia via the instructions above, I did not end up needing to set or any other environment variables.

1 Like

I also found more success using Fargate than plain EC2

Thanks for that suggestion. I will check with our devops guys since we do use Fargate for some things.

The Dockerfile you’ve linked is what we were using before (Debian bookworm, Julia 1.10), by using FROM julia:1.10 in our Dockerfile.

I notice that our current Amazon Linux 2 build is pulling in a version of Amazon Linux based on the 4.14 Linux kernel, via FROM amazonlinux:2. I thought they used something more recent than that - I’ve seen they’re using kernel 5.10 now. Something else to check with devops.

You either identify a common minimum ISA for all the systems where you’re going to run container on (and ARM ISAs are wildly different in general), or you use the same CPU target as the one you shared where there’s generic as safe common fallback. You may drop targets like apple-m1 and neoverse-512tvb if you happen to know you aren’t going to run the container on those systems.

1 Like

Thanks for the replies.

After some experimentation here’s what we’re going to try for now.

JULIA_CPU_TARGET="generic;cortex-a57;thunderx2t99;carmel,clone_all;apple-m1,base(3);neoverse-n1,clone_all;neoverse-512tvb,clone_all;neoverse-v1,base(6);neoverse-v2,base(6)"

Up until apple-m1,base(3) I’ve stuck with what Julia itself uses. After that the Arm64 processors listed are as follows:

  • neoverse-n1: used in Graviton 2 EC2 instances
  • neoverse-v1: used in Graviton 3 and 3E EC2 instances
  • neoverse-v2: used in Graviton 4 EC2 instances

(from EC2 Instance Types)

neoverse-v1 and neoverse-v2 both use neoverse-512tvb as a base because

neoverse-512tvb is special in that it does not refer to a specific core, but instead refers to all Neoverse cores that (a) implement SVE and (b) have a total vector bandwidth of 512 bits a cycle. Unless overridden by -march, -mcpu=neoverse-512tvb generates code that can run on a Neoverse V1 core, since Neoverse V1 is the first Neoverse core with these properties.

(from the GCC manual)

I haven’t set neoverse-n1 to use neoverse-512tvb as a base because it predates neoverse-v1, and so I set it to clone_all.

There are EC2 Arm64 instances based on Apple M1 and M2 but I figure the Julia setup should be sufficient. Will find out if we ever use such an instance.

Clearly this setting for JULIA_CPU_TARGET could be improved but so far so good. Precompilation passes without failures and the service runs inside the Docker image hosted on one of our EC2 Arm64 instances.

1 Like

There is one remaining problem. It doesn’t affect building the Docker image or running the service inside the container so far. Maybe I’m needlessly concerned.

While testing Docker image building I included this line in Dockerfile:

RUN julia -e "using InteractiveUtils; versioninfo(verbose = true)"

to get system info as Julia saw it inside the container. As I noted in my original post this would give Load Avg: Invalid instruction. (With verbose = false there’s no problem.)

In the test runs, Julia detects the core as Sys.CPU_NAME == "neoverse-v1". I also tried running the command with and without the command line setting --cpu-target=neoverse-v1, but in each case I would get the same result. (Stack trace included below.)

I wonder if this is a failure

  1. because of a bug in Julia
  2. because running inside a Docker image makes it difficult for Julia to detect some processor features
  3. because the Linux kernel is 4.14 (this is what we get on EC2 at present)

I mention #3 only because neoverse-v1 uses SVE instructions and support for SVE was not introduced until kernel 4.15. But of course Amazon may have backported the support to their version of 4.14 (4.14.343-261.564.amzn2.aarch64). Anyway that’s probably not the cause of the failure. However there may be other subtle problems with an older Linux. 4.14 LTS reached end of life in January.

Some additional circumstantial evidence that #2 could be an issue: when I build the official Julia Docker image on my M1 Mac, Julia running inside the container reports Sys.CPU_NAME == "generic", whereas running Julia directly on my machine I correctly get Sys.CPU_NAME == "apple-m1".

Anyway I’m not sure where to report this issue, or maybe I should try first to see if we can get an instance running a later Linux kernel version, since they’re available, and test there.

[11/11] RUN julia --cpu-target=neoverse-v1  -e "using InteractiveUtils; versioninfo(verbose = true)"
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  uname: Linux 4.14.343-261.564.amzn2.aarch64 #1 SMP Tue May 7 02:23:29 UTC 2024 aarch64 unknown
  CPU: unknown: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz        451 s          1 s         23 s       4097 s          0 s
       #2     0 MHz        365 s          3 s         25 s       4385 s          0 s
       #3     0 MHz        315 s          1 s         24 s       4458 s          0 s
       #4     0 MHz        276 s          1 s         19 s       4496 s          0 s
       #5     0 MHz        838 s          3 s         56 s       3660 s          0 s
       #6     0 MHz        225 s          2 s         24 s       4498 s          0 s
       #7     0 MHz        370 s          2 s         28 s       4388 s          0 s
       #8     0 MHz        333 s          2 s         22 s       4437 s          0 s
  Memory: 15.439155578613281 GB (14914.6328125 MB free)
  Uptime: 481.96 sec
  Load Avg: Invalid instruction at 0xffff91757148: 0x04a0e3ea

[7] signal (4.1): Illegal instruction
in expression starting at none:1
iterate at ./range.jl:901 [inlined]
vcat at ./range.jl:1375 [inlined]
_print_matrix at ./arrayshow.jl:205
print_matrix at ./arrayshow.jl:171
print_matrix at ./arrayshow.jl:171 [inlined]
#versioninfo#65 at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/InteractiveUtils/src/InteractiveUtils.jl:158
versioninfo at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/InteractiveUtils/src/InteractiveUtils.jl:97
unknown function (ip: 0xffff916ec04f)
_jl_invoke at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:3077
versioninfo at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/InteractiveUtils/src/InteractiveUtils.jl:97
unknown function (ip: 0xffff921470eb)
_jl_invoke at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/interpreter.c:635
jl_interpret_toplevel_thunk at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/toplevel.c:877
jl_toplevel_eval_flex at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
exec_options at ./client.jl:291
_start at ./client.jl:552
jfptr__start_82654 at /usr/local/julia/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-armageddon-0/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
unknown function (ip: 0xffff9217777f)
__libc_start_main at /lib/aarch64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x400a1b)
unknown function (ip: 0x400a1b)
Allocations: 627086 (Pool: 626483; Big: 603); GC: 1
Illegal instruction (core dumped)
ERROR: process "/bin/sh -c julia --cpu-target=neoverse-v1  -e \"using InteractiveUtils; versioninfo(verbose = true)\"" did not complete successfully: exit code: 132

OK I think I’ve got it.

Apparently we use 6g EC2 instances for development, and a mix of 6g and 7g for production.

The 6g instances e.g. r6g are based on Graviton 2 (neoverse-n1), and the 7g instances on Graviton 3 (neoverse-v1).

But in the docker container running on the r6g instance, Julia reports Sys.CPU_NAME as neoverse-v1.

And running lscpu at a shell prompt in the container, I get

Architecture:                       aarch64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             8
On-line CPU(s) list:                0-7
Vendor ID:                          ARM
Model name:                         Neoverse-V1
Model:                              1

So the docker container itself says it is running on neoverse-v1, when actually the host system is neoverse-n1.

Now when I run using --cpu-target=neoverse-n1, versioninfo(verbose=true) completes:

[11/11] RUN julia --cpu-target=neoverse-n1 -e "using InteractiveUtils; versioninfo(verbose = true)"
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  uname: Linux 4.14.343-261.564.amzn2.aarch64 #1 SMP Tue May 7 02:23:29 UTC 2024 aarch64 unknown
  CPU: unknown: 
              speed         user         nice          sys         idle          irq
       #1     0 MHz        433 s          1 s         28 s       1722 s          0 s
       #2     0 MHz        706 s          2 s         35 s       1749 s          0 s
       #3     0 MHz        256 s          2 s         23 s       2256 s          0 s
       #4     0 MHz        414 s          0 s         25 s       2108 s          0 s
       #5     0 MHz        240 s          4 s         42 s       1954 s          0 s
       #6     0 MHz        517 s          1 s         32 s       1940 s          0 s
       #7     0 MHz        204 s          1 s         28 s       2320 s          0 s
       #8     0 MHz        372 s          0 s         27 s       2130 s          0 s
  Memory: 15.439155578613281 GB (14915.63671875 MB free)
  Uptime: 257.28 sec
  Load Avg:  2.45  1.55  0.67
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, neoverse-v1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
...

So in my previous message it looks like #3 is not an issue (since the core is actually not neoverse-v1 but the earlier neoverse-n1), and the likely culprit is #2: docker making it difficult to get the right cpu info.

I’ll report this as a docker issue.

In the meantime it is probably a good idea to distinguish whether your docker container is running on a Graviton 2 instance, and to include the Julia command line option --cpu-target=neoverse-n1 if so.

4 Likes

One more update.

We build our docker iimage for EC2 with AWS CodeBuild, and apparently that is also run in a docker container, and we have no control over the instance it runs on. So in fact we’re unsure whether the docker build is ultimately running on Graviton 2, 3 or something else. In the AWS CodeBuild container lscpu reports a Neoverse-V1, and inside the containers we run during docker build within AWS CodeBuild we also see Neoverse-V1, and within those Julia sees Neoverse-V1, but crashes with “illegal instruction” on versioninfo(verbose=true) at the point of computing a load average unless passed the command line switch --cpu-target=neoverse-n1.

As suggested elsewhere, an old library or binary could be misinterpreting/misrepresenting the CPU, perhaps on the host that AWS CodeBuild runs on.

It could still also be that docker itself is interfering with determining the correct CPU. As I noted earlier, on my machine Julia reports Sys.CPU_NAME == "generic" when run from a container within the official Julia docker image, but correctly reports “apple-m1” when run directly on the machine, which is an M1 Pro Mac.

Anyway, we’re going with the JULIA_CPU_TARGET environment variable listed earlier and running our service with the --cpu-target=neoverse-n1 switch for now, when we know it is running on a 6g instance. So far things have been ok with that setup.

1 Like