Performance detoriation with the increase of array size

jsjie · November 7, 2024, 2:37am

A MWE, in REPL, Julia 1.9.4:

julia> using BenchmarkTools

julia> function myset!(y, x, n, ind)
                  @views for i = 1 : n
                      x[ind[i], :, :] .= y[:, :, i]
                  end
              end
myset! (generic function with 1 method)

julia> cachex = zeros(ComplexF64, 96, 48, 120); cachey = zeros(ComplexF64, 48, 120, 49);

julia> cachex2 = zeros(ComplexF64, 96, 48, 15); cachey2 = zeros(ComplexF64, 48, 15, 49);

julia> ind = [i+20 for i = 1 : 49];

julia> @btime myset!($cachey, $cachex, 49, $ind)
  1.162 ms (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  74.148 μs (0 allocations: 0 bytes)

The array size increased by 8*, while the time increased by 15.7*. What’s the reason?

I believe this has the similar cause with my previous problem, in that the increase in computation cost results in a more-than-linear increase in computation time. The current problem, however, is much simpler, and has nothing to do with BLAS threads, function inlining, and so on. It also happens even if the CPU usage in the node is very low.

Oscar_Smith · November 7, 2024, 2:51am

This is just data locality. The bigger the array is, the less of it fits in your L1/L2 cache so the slower the acesses are.

jsjie · November 7, 2024, 3:00am

I then try a batched version:

julia> function myset!(y, x, n, ind, bsize, nb)
           @views for i = 1 : n
               for ib = 1 : nb
                   bs = bsize * (ib-1) + 1
                   be = bsize * ib
                   x[ind[i], :, bs:be] .= y[:, bs:be, i]
               end
           end
       end
myset! (generic function with 2 methods)

julia> @btime myset!($cachey, $cachex, 49, $ind, 15, 8)
  1.158 ms (0 allocations: 0 bytes)

But this shows no improvement. Am I doing it wrongly?

Another, even more complex version:

px = Vector{Ptr{ComplexF64}}(); py = Vector{Ptr{ComplexF64}}();
julia> for i = 1 : 49, ib = 1 : 8
           push!(px, pointer(cachex, ind[i] + 15 * (ib-1) * 96 * 48))
           push!(py, pointer(cachey, 1 + (i-1) * 48 * 120 + 15 * (ib-1) * 48))
       end

julia> function myset2!(py, px, n, nb)
           for ii = 1 : n * nb
               BLAS.blascopy!(48 * 15, py[ii], 1, px[ii], 96)
           end
       end
myset2! (generic function with 1 method)

julia> @btime myset2!($py, $px, 49, 8)
  1.359 ms (0 allocations: 0 bytes)

Oscar_Smith · November 7, 2024, 3:22am

you aren’t doing anything wrong. It’s just fundamentally going to be nonlinearly slower due to CPU architecture.

jsjie · November 7, 2024, 4:36am

My rough understanding is that there are two possibilities. One is the cache miss happens for each x[ind[i], :, :] .= y[:, :, i]. If this is the case, then the batch version for the large array should be faster, since its inner loops have identical size as the small array, and there will be nearly identical cache miss. Another possibility is the cache miss happens between different x[ind[i], :, :] .= y[:, :, i] calls. Batching will have little effect in this case. Does the fact that batching makes little difference hint the second possibility is the real case?

I’ve also checked with lscpu, the L1d and L1i cache are 32K, and the L2 cache is 1024K. The size of cachex2 is 16 * 96 * 48 * 15 / 1024 = 1080K, but only roughly half of it is actually used. Is it then the case that the small array can be totally fit into the cache within one fetch, which explains the fewer cache misses?

sgaure · November 7, 2024, 9:02am

You can measure the cache and memory performance with either LIKWID.jl or LinuxPerf.jl

foobar_lv2 · November 7, 2024, 10:37am

You’re almost surely hitting cache sizes. Just post the output of lscpu – on my machine I get

$ lscpu 
...
  Model name:             Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
...
Caches (sum of all):      
  L1d:                    256 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     2 MiB (8 instances)
  L3:                     16 MiB (1 instance)
...

Hence 2/8 == 0.25MB of L2 per core. The total data that needs to fit into cache is

julia> (length(cachey)+length(cachex))*16 *1e-6
13.363199999999999

julia> (length(cachey2)+length(cachex2))*16 *1e-6
1.6703999999999999

So both datasets are mostly benchmarking L3 bandwidth.

Another consideration is that your ind are linear. Hence, you have a good chance that the relevant cacheline for the next i in cachex is still present. This effect goes away if you shuffle! ind. The sizes are

julia> 48* 120*16 * 1e-3
92.16
julia> 48* 15*16 * 1e-3
11.52

This tells you that most of the writes will hit L2 for the large dataset, and will hit L1 in the small dataset.

Indeed, the scaling is pretty linear on my machine:

julia> @btime myset!($cachey, $cachex, 49, $ind)
  604.290 μs (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  79.131 μs (0 allocations: 0 bytes)

If I shuffle ind, then

julia> @btime myset!($cachey, $cachex, 49, $ind)
  698.001 μs (0 allocations: 0 bytes)

julia> @btime myset!($cachey2, $cachex2, 49, $ind)
  80.715 μs (0 allocations: 0 bytes)

The standard model of “bandwidth” scaling by datasize is a step function: First you have L1, then L2, then L3, then main memory, then (optionally) swap. The real graph won’t be quite as nice due to various effects, but that’s a good starting point (e.g.: You may have the occasional L2 hit inside the L3 regime; also, L3 may be logically shared, but different slices may be colocated with different cores, and obviously less distance means faster access).

jsjie · November 8, 2024, 3:53am

The relevant outputs in lscpu are:

CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              36608K
NUMA node0 CPU(s):     0-23,48-71
NUMA node1 CPU(s):     24-47,72-95

The output does not contain the number of instances.
I’m not sure whether I understand your explaination correctly. My current understanding is:

Most of the time is used to copy data from memory into L3 cache and vice versa.
In your machine, both datasets can and can only fit into the L3 cache, so only one round is needed in both cases.
Thus, the time needed for this copy is linear with respect to the size.
If the size of L3 cache is smaller, and the large dataset cannot fit into it, then it can only fill the L3 cache, perform part of the copy, then swap the data in the L3 cache out and load the rest of the data. In this case, the performance will be worse than linear. If the size of L3 cache is large, however, it should still be linear. In my machine, the output L3 size is larger, but since there are many more cores, I’m not sure whether the available L3 size is larger or smaller.
Each “sliced plane” in one loop is 11.5K and 92.2K, respectively, while the L1 and L2 per core is 32K and 256K. As a result, for the small dataset, the input and output slice can be simultaneously in L1, thus the copy can be done in L1. For the large dataset, L1 can only hold part of the input slice, so it is a copy from L1 to L2. Can the batched version fix this problem? Anyway, it is a small factor comparing to the L3-memory copy, so you still get good linearity in your machine.

Is my understanding correct?

foobar_lv2 · November 8, 2024, 9:40am

Yes.

L3 is shared, so the relevant figure is 36M on your machine vs 16M on my machine.

However, you have a 2 CPU system and this whole NUMA business going on. I could imagine that this is the issue – maybe the relevant data ends up in the L3 of the wrong CPU / NUMA node, due to some misconfiguration? Specs suggest 36M / chip, so you should have 72M L3 total, no idea what’s going on here.

Unfortunately, this is where my knowledge taps out. I’m not in HPC, so my experience with such machines is limited. Somebody else needs to chime in how to debug this

I think it should start with perf-style profiling, in order to verify that this is indeed L3 bandwidth bound and to see whether you get NUMA misses. Also, you should vary the size parameter and plot benchmark results against size.

jsjie · November 8, 2024, 11:47am

Thanks, I’ll try it next week.

发自我的 iPhone

sgaure · November 9, 2024, 8:52am

You can control the numa behaviour with the linux command numactl, e.g. start julia with numactl --membind=0 --cpunodebind 0 julia to use only numa node 0. The default is normally to interleave memory between the numa nodes. You can also ensure that julia’s threads don’t move between the cores by setting the environment variable JULIA_EXCLUSIVE=1 before you start julia.

jsjie · November 22, 2024, 6:17am

Sorry for the late response. @sgaure both numactl and JULIA_EXCLUSIVE=1 seem to have no effect. @foobar_lv2 the perf result is a little bit strange to me. The results differ among runs, and this is one output:

julia> using LinuxPerf
julia> @measure myset!(cachey, cachex, 49, ind)
┌───────────────────────┬─────────────┬─────────────┐
│                       │ Events      │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│             hw:cycles │ 445,939,725 │ 66.5 %      │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_access │ 4,938,737   │ 66.8 %      │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_misses │ 643,872     │ 66.8 %      │
├───────────────────────┼─────────────┼─────────────┤
│           hw:branches │ 133,008,095 │ 66.7 %      │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ 3,819,242   │ 66.7 %      │
├───────────────────────┼─────────────┼─────────────┤
│       hw:instructions │ 661,081,453 │ 66.7 %      │
├───────────────────────┼─────────────┼─────────────┤
│       sw:ctx_switches │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│        sw:page_faults │ 2,432       │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:minor_page_faults │ 2,432       │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:major_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│     sw:cpu_migrations │ 0           │ 100.0 %     │
└───────────────────────┴─────────────┴─────────────┘


julia> @measure myset!(cachey2, cachex2, 49, ind)
┌───────────────────────┬─────────────┬─────────────┐
│                       │ Events      │ Active Time │
├───────────────────────┼─────────────┼─────────────┤
│             hw:cycles │ 395,608     │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_access │ 19,294      │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│       hw:cache_misses │ 18,699      │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│           hw:branches │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│ hw:branch_mispredicts │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       hw:instructions │ did not run │ 0 %         │
├───────────────────────┼─────────────┼─────────────┤
│       sw:ctx_switches │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│        sw:page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:minor_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│  sw:major_page_faults │ 0           │ 100.0 %     │
├───────────────────────┼─────────────┼─────────────┤
│     sw:cpu_migrations │ 0           │ 100.0 %     │
└───────────────────────┴─────────────┴─────────────┘

It seems to be a branches/page_faults issue rather than cache misses?
Anyway, I may have found another solution for my use case, so I will temporarily turn away from this problem and come back later when I have more time. Thanks for your advices.

sgaure · November 22, 2024, 9:31am

A modest amount of minor page faults should not have such an impact.

I have replicated the behaviour on one machine I have. With LIKWID.jl I have measured memory and cache performance. Most metrics scale linearly (i.e. approximately by 8), except L2 DTLB misses, i.e.

run1() = for _ in 1:1000; myset!(cachey, cachex, 49, ind); end
run2() = for _ in 1:1000; myset!(cachey2, cachex2, 49, ind); end
# I made the variables const.

julia> @perfmon "TLB" run1();
...
L2 DTLB request rate │ 0.0541984 
L2 DTLB miss rate │ 0.0344288 
L2 DTLB miss ratio │  0.635237 
...
julia> @perfmon "TLB" run2();
...
L2 DTLB request rate │  0.0540632
L2 DTLB miss rate │ 6.44465e-5 
L2 DTLB miss ratio │ 0.00119206 
...

TLB misses are generally bad, each one can take hundreds of cycles to serve. TLB - translation lookaside buffer, is the “directory” of which memory pages are in use. The directory has limited size. The easy fix is to increase the page size from 4KB to e.g. 2MB, i.e. use “huge pages”. Then fewer pages are used, and the TLB isn’t exhausted.

I don’t know if this is your problem.

There are various ways to use huge pages, I’m not certain if julia supports it per allocation. Perhaps some package can do this? Also, one can place the application in a hugetlbfs partition. Otherwise it’s possible to compile a kernel with larger pages. The drawback is that it’s then impossible to use less than 2MB memory. This might not be a big problem for a HPC-computer.

foobar_lv2 · November 22, 2024, 10:48am

The easiest way is via

$cat /sys/kernel/mm/transparent_hugepage/enabled 
[always] madvise never
$ cat /sys/kernel/mm/transparent_hugepage/defrag
always defer [defer+madvise] madvise never

Julia is afaiu not very good at using madvise appropriately, and many systems default to madvise as their THP settings. The above settings respect explicit madvise, but do a pretty appropriate thing in absence of explicit user choices (which is: Use hugepages where possible, and always try to compact in a background thread).

Topic		Replies	Views
Slower @threads than serial for array computations Julia at Scale	26	2716	May 7, 2020
Huge performance fluctuations in parallel benchmark: insights? Performance parallel , multithreading , benchmarktools	52	2686	December 1, 2021
Poor performance multiplying many (large) matrices multithreaded Performance question , linearalgebra	11	2501	July 13, 2020
Data structures for threaded computing Performance	23	2884	October 23, 2019
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	4013	February 11, 2019

Performance detoriation with the increase of array size

Related topics