[ANN] NumaAllocators.jl: Non-Uniform Memory Access extension of ArrayAllocators.jl for HPC


NumaAllocators.jl extends ArrayAllocators.jl to provide a mechanism for allocating arrays on specific Non-Uniform Memory Access nodes. NumaAllocators.jl is a subdirectory package within the ArrayAllocators.jl repository.

Non-Uniform Memory Access (NUMA)

NUMA (Non-Uniform Memory Access) is a processor architecture where physical processors have distinct access to different parts of memory. This typically applies to high end workstations and high performance computing clusters where motherboards have multiple processor sockets.

NUMA-aware applications can allocate memory on specific NUMA nodes. Processes and threads can be affinitized to specific NUMA nodes. This package only deals with the memory allocation portion though it provides some utilities to figure out how many NUMA nodes there are and to which NUMA node the current processor is attached.

Example Usage

julia> using NumaAllocators

julia> NumaAllocators.highest_numa_node()

julia> NumaAllocators.current_numa_node()

julia> @time A_node_0 = Array{Int}(numa(0), 1024, 1024);
  0.000030 seconds (9 allocations: 240 bytes)

julia> @time B_node_1 = Array{Int}(numa(1), 1024, 1024);
  0.000023 seconds (9 allocations: 240 bytes)

julia> @time C_node_0 = Array{Int}(numa(0), 1024, 1024);
  0.000018 seconds (9 allocations: 240 bytes)

julia> @time D_node_1 = Array{Int}(numa(1), 1024, 1024);
  0.000023 seconds (9 allocations: 240 bytes)

julia> @time fill!(A_node_0, 0);
  0.002841 seconds

julia> @time fill!(B_node_1, 1);
  0.002420 seconds

julia> @time fill!(C_node_0, 2);
  0.003195 seconds

julia> @time fill!(D_node_1, 3);
  0.002378 seconds

Non-Intuitive Results on Intel Skylake processors

julia> @time copyto!(C_node_0, A_node_0);
  0.002604 seconds

julia> @time copyto!(C_node_0, B_node_1);
  0.002112 seconds

julia> @time copyto!(D_node_1, A_node_0);
  0.004126 seconds

julia> @time copyto!(D_node_1, B_node_1);
  0.003600 seconds

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cascadelake)

Naively, I would have thought that copying between memory allocations on the local NUMA node would be faster than when a remote NUMA node is involved. Initializing memory on the local NUMA node (1) using fill! using C’s memset is slightly faster than initializing memory on the remote NUMA node (0).

When copying memory, it seems there may be advantages to copying memory between NUMA nodes. Specifically, the fastest case is using copyto! (using C’s memmove) to copy memory from the local NUMA node (1) to the remote NUMA node (0). This takes less than 60% of the time for copying between memory allocations on the local NUMA node (1).

In discussions with @carstenbauer on Slack, it seems that this asymmetry in copying speeds between NUMA nodes versus staying on local NUMA nodes may also exist on some recent AMD processors as well but perhaps to a lesser degree. Regardless, I highly encourage you to use a profile guided optimizations to figure out how to apply this package.

ServerFault Report

A post on ServerFault reports a similar phenomenon on Skylake processors and provides benchmarking codes. It also suggests that removing one of the processors from the system may result in higher memory performance than when two sockets are both occupied. I have also personally observed this when performing the experiment of removing and replacing a processor on a two processor NUMA system.

Discussions with the OEM and then Intel, suggests that this behavior is expected.

Implementation Notes

On Linux, NUMA allocation is implemented using NUMA_jll.jl which packages numactl.

On Windows, this package uses VirtualAllocExNuma:


I have extended ArrayAllocators.jl to provide support for Non-Uniform Memory Access (NUMA) by providing the package NumaAllocators.jl for Linux and Windows. Allocating memory on specific NUMA nodes can provide surprising optimizations for some applications that may be hardware specific.

Twitter Discussion

See the following link for a similar posting on Twitter:


This looks awesome!

1 Like