Performance detoriation with the increase of array size

A MWE, in REPL, Julia 1.9.4:

julia> using BenchmarkTools

julia> function myset!(y, x, n, ind)
                  @views for i = 1 : n
                      x[ind[i], :, :] .= y[:, :, i]
                  end
              end
myset! (generic function with 1 method)

julia> cachex = zeros(ComplexF64, 96, 48, 120); cachey = zeros(ComplexF64, 48, 120, 49);

julia> cachex2 = zeros(ComplexF64, 96, 48, 15); cachey2 = zeros(ComplexF64, 48, 15, 49);

julia> ind = [i+20 for i = 1 : 49];

julia> @btime myset!($cachey, $cachex, 49, $ind)
  1.162 ms (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  74.148 Ī¼s (0 allocations: 0 bytes)

The array size increased by 8*, while the time increased by 15.7*. Whatā€™s the reason?


I believe this has the similar cause with my previous problem, in that the increase in computation cost results in a more-than-linear increase in computation time. The current problem, however, is much simpler, and has nothing to do with BLAS threads, function inlining, and so on. It also happens even if the CPU usage in the node is very low.

This is just data locality. The bigger the array is, the less of it fits in your L1/L2 cache so the slower the acesses are.

2 Likes

I then try a batched version:

julia> function myset!(y, x, n, ind, bsize, nb)
           @views for i = 1 : n
               for ib = 1 : nb
                   bs = bsize * (ib-1) + 1
                   be = bsize * ib
                   x[ind[i], :, bs:be] .= y[:, bs:be, i]
               end
           end
       end
myset! (generic function with 2 methods)

julia> @btime myset!($cachey, $cachex, 49, $ind, 15, 8)
  1.158 ms (0 allocations: 0 bytes)

But this shows no improvement. Am I doing it wrongly?


Another, even more complex version:

px = Vector{Ptr{ComplexF64}}(); py = Vector{Ptr{ComplexF64}}();
julia> for i = 1 : 49, ib = 1 : 8
           push!(px, pointer(cachex, ind[i] + 15 * (ib-1) * 96 * 48))
           push!(py, pointer(cachey, 1 + (i-1) * 48 * 120 + 15 * (ib-1) * 48))
       end

julia> function myset2!(py, px, n, nb)
           for ii = 1 : n * nb
               BLAS.blascopy!(48 * 15, py[ii], 1, px[ii], 96)
           end
       end
myset2! (generic function with 1 method)

julia> @btime myset2!($py, $px, 49, 8)
  1.359 ms (0 allocations: 0 bytes)

you arenā€™t doing anything wrong. Itā€™s just fundamentally going to be nonlinearly slower due to CPU architecture.

1 Like

My rough understanding is that there are two possibilities. One is the cache miss happens for each x[ind[i], :, :] .= y[:, :, i]. If this is the case, then the batch version for the large array should be faster, since its inner loops have identical size as the small array, and there will be nearly identical cache miss. Another possibility is the cache miss happens between different x[ind[i], :, :] .= y[:, :, i] calls. Batching will have little effect in this case. Does the fact that batching makes little difference hint the second possibility is the real case?

Iā€™ve also checked with lscpu, the L1d and L1i cache are 32K, and the L2 cache is 1024K. The size of cachex2 is 16 * 96 * 48 * 15 / 1024 = 1080K, but only roughly half of it is actually used. Is it then the case that the small array can be totally fit into the cache within one fetch, which explains the fewer cache misses?

You can measure the cache and memory performance with either LIKWID.jl or LinuxPerf.jl

Youā€™re almost surely hitting cache sizes. Just post the output of lscpu ā€“ on my machine I get

$ lscpu 
...
  Model name:             Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
...
Caches (sum of all):      
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     2 MiB (8 instances)
  L3:                     16 MiB (1 instance)
...

Hence 2/8 == 0.25MB of L2 per core. The total data that needs to fit into cache is

julia> (length(cachey)+length(cachex))*16 *1e-6
13.363199999999999

julia> (length(cachey2)+length(cachex2))*16 *1e-6
1.6703999999999999

So both datasets are mostly benchmarking L3 bandwidth.

Another consideration is that your ind are linear. Hence, you have a good chance that the relevant cacheline for the next i in cachex is still present. This effect goes away if you shuffle! ind. The sizes are

julia> 48* 120*16 * 1e-3
92.16
julia> 48* 15*16 * 1e-3
11.52

This tells you that most of the writes will hit L2 for the large dataset, and will hit L1 in the small dataset.

Indeed, the scaling is pretty linear on my machine:

julia> @btime myset!($cachey, $cachex, 49, $ind)
  604.290 Ī¼s (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  79.131 Ī¼s (0 allocations: 0 bytes)

If I shuffle ind, then

julia> @btime myset!($cachey, $cachex, 49, $ind)
  698.001 Ī¼s (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  80.715 Ī¼s (0 allocations: 0 bytes)

The standard model of ā€œbandwidthā€ scaling by datasize is a step function: First you have L1, then L2, then L3, then main memory, then (optionally) swap. The real graph wonā€™t be quite as nice due to various effects, but thatā€™s a good starting point (e.g.: You may have the occasional L2 hit inside the L3 regime; also, L3 may be logically shared, but different slices may be colocated with different cores, and obviously less distance means faster access).

2 Likes

The relevant outputs in lscpu are:

CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95

The output does not contain the number of instances.
Iā€™m not sure whether I understand your explaination correctly. My current understanding is:

  1. Most of the time is used to copy data from memory into L3 cache and vice versa.
  2. In your machine, both datasets can and can only fit into the L3 cache, so only one round is needed in both cases.
  3. Thus, the time needed for this copy is linear with respect to the size.
  4. If the size of L3 cache is smaller, and the large dataset cannot fit into it, then it can only fill the L3 cache, perform part of the copy, then swap the data in the L3 cache out and load the rest of the data. In this case, the performance will be worse than linear. If the size of L3 cache is large, however, it should still be linear. In my machine, the output L3 size is larger, but since there are many more cores, Iā€™m not sure whether the available L3 size is larger or smaller.
  5. Each ā€œsliced planeā€ in one loop is 11.5K and 92.2K, respectively, while the L1 and L2 per core is 32K and 256K. As a result, for the small dataset, the input and output slice can be simultaneously in L1, thus the copy can be done in L1. For the large dataset, L1 can only hold part of the input slice, so it is a copy from L1 to L2. Can the batched version fix this problem? Anyway, it is a small factor comparing to the L3-memory copy, so you still get good linearity in your machine.

Is my understanding correct?

Yes.

L3 is shared, so the relevant figure is 36M on your machine vs 16M on my machine.

However, you have a 2 CPU system and this whole NUMA business going on. I could imagine that this is the issue ā€“ maybe the relevant data ends up in the L3 of the wrong CPU / NUMA node, due to some misconfiguration? Specs suggest 36M / chip, so you should have 72M L3 total, no idea whatā€™s going on here.

Unfortunately, this is where my knowledge taps out. Iā€™m not in HPC, so my experience with such machines is limited. Somebody else needs to chime in how to debug this :frowning:

I think it should start with perf-style profiling, in order to verify that this is indeed L3 bandwidth bound and to see whether you get NUMA misses. Also, you should vary the size parameter and plot benchmark results against size.

Thanks, Iā€™ll try it next week.

发č‡Ŗꈑēš„ iPhone

You can control the numa behaviour with the linux command numactl, e.g. start julia with numactl --membind=0 --cpunodebind 0 julia to use only numa node 0. The default is normally to interleave memory between the numa nodes. You can also ensure that juliaā€™s threads donā€™t move between the cores by setting the environment variable JULIA_EXCLUSIVE=1 before you start julia.