Loop performance issue for a specific data size

Hello everyone!

I’m testing a population-based algorithm and have encountered performance issues, particularly when looping with large populations (i.e., using large data matrices).

Everything works fine when using matrix sizes of 15000x1500 and 20000x2000 and smaller. However, upon increasing the matrix size to 25000x2500, I notice a significant performance drop around iteration number 810 (I’m using 1000 max iterations). At this moment, the algorithm starts taking more than 10 times longer per iteration even after reaching the second run of the same algorithm.

At first, I suspected it might be due to suboptimal code on my part, potentially causing issues in the GC. I even tested adding a GC.gc() before initiating the second run, but it didn’t have any effect on the performance.

Strangely, upon running the algorithm with an even larger population of 30000x3000 and 50000x5000, I observed no performance drop throughout the run.

It is not related to lack of memory, as my system has about 50% RAM free during the computation.

I’m using Julia version 1.9.4 and I’ve observed this behavior on my desktop with a AMD CPU and in my laptop with an Intel CPU.

Bellow, I attach code with a minimal reproducible example and some logs from test done with my AMD system.

MRE code
using Dates

function alg(f::Function, f_min::Number, f_max::Number, n_var::Int, n_pop::Int, n_iter::Int)
    start = time()
    pop = Array{Float64}( undef, n_pop, n_var )
    for i = 1:n_var
        pop[:, i] = f_min .+ rand( n_pop ) .* ( f_max - f_min )
    end
    pop_c = Array{Float64}( undef, n_pop )
    for i = 1:n_pop
        pop_c[i] = f( view( pop, i, : ) )
    end
    _time = now()
    i = 1   
    while i <= n_iter
        if i % 10 == 0
            println( "\t # $(now())\t- ITER: $(i)\t- MAIN LOOP...  \t 10x iter: $(now()-_time)" )
            _time = now()
        end
        xb = pop[ argmin( pop_c ), : ] 
        xw = pop[ argmax( pop_c ), : ] 
        for p in 1:n_pop
            xn = pop[p, :] + rand( n_var ) .* ( xb - abs.( pop[p, :] ) ) - rand( n_var ) .* ( xw - abs.( pop[p, :] ) )
            xn = max.( min.( xn, f_max ), f_min )
            xn_c = f( xn )
            if xn_c < pop_c[p]
                pop[p, :] = xn                 
                pop_c[p] = xn_c      
            end 
        end
        i += 1
    end
    return minimum( pop_c ), string( time() - start )
end

function fun(x)
    n = length(x)
    z = zeros(n)
    for i in 1:n÷2
        z[2*i-1] = 10000 * x[2*i-1] * x[2*i] - 1
        z[2*i] = exp( -x[2*i-1] ) + exp( -x[2*i] ) - 1.0001
    end
    return sum( abs.( z ) )
end

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
dim = 2500
pop = dim*10
n_iter = 1000

for run in 1:3
    println("[$(dim), $(pop), $(n_iter)] - Run: $(run)" )
    alg( fun, -10, 10, dim , pop, n_iter )
end
Log - Run with 25000x2500 (with performance issues)
[2000, 25000, 1000] - Run: 1
 # 2023-11-30T17:54:42.845      - ITER: 10      - MAIN LOOP...     10x iter: 31029 milliseconds
 # 2023-11-30T17:55:13.549      - ITER: 20      - MAIN LOOP...     10x iter: 30639 milliseconds
 # 2023-11-30T17:55:42.978      - ITER: 30      - MAIN LOOP...     10x iter: 29000 milliseconds
 # 2023-11-30T17:56:10.945      - ITER: 40      - MAIN LOOP...     10x iter: 28013 milliseconds
 # 2023-11-30T17:56:37.901      - ITER: 50      - MAIN LOOP...     10x iter: 26171 milliseconds
 # 2023-11-30T17:57:03.347      - ITER: 60      - MAIN LOOP...     10x iter: 24827 milliseconds
 # 2023-11-30T17:57:27.636      - ITER: 70      - MAIN LOOP...     10x iter: 24071 milliseconds
 # 2023-11-30T17:57:51.104      - ITER: 80      - MAIN LOOP...     10x iter: 22444 milliseconds
 # 2023-11-30T17:58:13.315      - ITER: 90      - MAIN LOOP...     10x iter: 20687 milliseconds
 # 2023-11-30T17:58:33.598      - ITER: 100     - MAIN LOOP...     10x iter: 19461 milliseconds
(...)
 # 2023-11-30T18:01:31.810      - ITER: 200     - MAIN LOOP...     10x iter: 16608 milliseconds
(...)
 # 2023-11-30T18:04:17.420      - ITER: 300     - MAIN LOOP...     10x iter: 16126 milliseconds
(...)
 # 2023-11-30T18:06:57.865      - ITER: 400     - MAIN LOOP...     10x iter: 16575 milliseconds
(...)
 # 2023-11-30T18:09:42.826      - ITER: 500     - MAIN LOOP...     10x iter: 16739 milliseconds
(...)
 # 2023-11-30T18:12:26.032      - ITER: 600     - MAIN LOOP...     10x iter: 16195 milliseconds
(...)
 # 2023-11-30T18:15:05.542      - ITER: 700     - MAIN LOOP...     10x iter: 16136 milliseconds
(...)
 # 2023-11-30T18:17:49.975      - ITER: 800     - MAIN LOOP...     10x iter: 16133 milliseconds
(...)
 ## 2023-11-30T18:17:59.621      - ITER: 806     - MAIN LOOP...      1x iter: 1616 milliseconds
 ## 2023-11-30T18:18:01.157      - ITER: 807     - MAIN LOOP...      1x iter: 1536 milliseconds
 ## 2023-11-30T18:18:02.557      - ITER: 808     - MAIN LOOP...      1x iter: 1400 milliseconds
 ## 2023-11-30T18:18:03.824      - ITER: 809     - MAIN LOOP...      1x iter: 1266 milliseconds
 ## 2023-11-30T18:18:06.362      - ITER: 810     - MAIN LOOP...      1x iter: 2537 milliseconds
 ## 2023-11-30T18:18:09.024      - ITER: 811     - MAIN LOOP...      1x iter: 2662 milliseconds
 ## 2023-11-30T18:18:11.141      - ITER: 812     - MAIN LOOP...      1x iter: 2117 milliseconds
 ## 2023-11-30T18:18:12.445      - ITER: 813     - MAIN LOOP...      1x iter: 1304 milliseconds
 ## 2023-11-30T18:18:28.505      - ITER: 814     - MAIN LOOP...      1x iter: 16060 milliseconds
 ## 2023-11-30T18:18:55.896      - ITER: 815     - MAIN LOOP...      1x iter: 27391 milliseconds
 ## 2023-11-30T18:19:23.303      - ITER: 816     - MAIN LOOP...      1x iter: 27407 milliseconds
 ## 2023-11-30T18:19:50.742      - ITER: 817     - MAIN LOOP...      1x iter: 27439 milliseconds
(...)
 # 2023-11-30T19:00:30.412      - ITER: 900     - MAIN LOOP...     10x iter: 263298 milliseconds
(...)
 # 2023-11-30T19:46:04.056      - ITER: 1000    - MAIN LOOP...     10x iter: 271844 milliseconds

[2500, 25000, 1000] - Run: 2
 # 2023-11-30T19:50:37.485      - ITER: 10      - MAIN LOOP...     10x iter: 273319 milliseconds
 # 2023-11-30T19:55:13.279      - ITER: 20      - MAIN LOOP...     10x iter: 275794 milliseconds
(...)
Log - Run with 30000x3000 (OK)
[3000, 30000, 1000] - Run: 1
 # 2023-12-01T18:51:34.073      - ITER: 10      - MAIN LOOP...     10x iter: 43813 milliseconds
 # 2023-12-01T18:52:16.072      - ITER: 20      - MAIN LOOP...     10x iter: 41987 milliseconds
 # 2023-12-01T18:52:53.990      - ITER: 30      - MAIN LOOP...     10x iter: 37918 milliseconds
 # 2023-12-01T18:53:22.260      - ITER: 40      - MAIN LOOP...     10x iter: 28270 milliseconds
 # 2023-12-01T18:53:49.793      - ITER: 50      - MAIN LOOP...     10x iter: 27533 milliseconds
 # 2023-12-01T18:54:17.857      - ITER: 60      - MAIN LOOP...     10x iter: 28062 milliseconds
 # 2023-12-01T18:54:45.207      - ITER: 70      - MAIN LOOP...     10x iter: 27350 milliseconds
 # 2023-12-01T18:55:12.597      - ITER: 80      - MAIN LOOP...     10x iter: 27390 milliseconds
 # 2023-12-01T18:55:40.331      - ITER: 90      - MAIN LOOP...     10x iter: 27734 milliseconds
 # 2023-12-01T18:56:08.854      - ITER: 100     - MAIN LOOP...     10x iter: 28523 milliseconds
(...)
 # 2023-12-01T18:59:46.940      - ITER: 200     - MAIN LOOP...     10x iter: 21731 milliseconds
(...)
 # 2023-12-01T19:03:22.487      - ITER: 300     - MAIN LOOP...     10x iter: 21225 milliseconds
(...)
 # 2023-12-01T19:06:57.561      - ITER: 400     - MAIN LOOP...     10x iter: 21493 milliseconds
(...)
 # 2023-12-01T19:10:30.105      - ITER: 500     - MAIN LOOP...     10x iter: 21296 milliseconds
(...)
 # 2023-12-01T19:14:04.754      - ITER: 600     - MAIN LOOP...     10x iter: 21295 milliseconds
(...)
 # 2023-12-01T19:17:46.691      - ITER: 700     - MAIN LOOP...     10x iter: 22236 milliseconds
(...)
 # 2023-12-01T19:21:27.534      - ITER: 800     - MAIN LOOP...     10x iter: 21383 milliseconds
(...)
 # 2023-12-01T19:25:03.962      - ITER: 900     - MAIN LOOP...     10x iter: 21615 milliseconds
 (...)
 # 2023-12-01T19:28:42.516      - ITER: 1000    - MAIN LOOP...     10x iter: 21844 milliseconds

[3000, 30000, 1000] - Run: 2
 # 2023-12-01T19:29:05.372      - ITER: 10      - MAIN LOOP...     10x iter: 19463 milliseconds
 # 2023-12-01T19:29:27.061      - ITER: 20      - MAIN LOOP...     10x iter: 21689 milliseconds
 # 2023-12-01T19:29:48.748      - ITER: 30      - MAIN LOOP...     10x iter: 21686 milliseconds
(...)
Log - Run with 50000x5000 (OK)
[5000, 50000, 1000] - Run: 1
 # 2023-11-30T20:29:34.222      - ITER: 10      - MAIN LOOP...     10x iter: 91541 milliseconds
 # 2023-11-30T20:31:02.215      - ITER: 20      - MAIN LOOP...     10x iter: 87992 milliseconds
 # 2023-11-30T20:32:23.846      - ITER: 30      - MAIN LOOP...     10x iter: 81628 milliseconds
 # 2023-11-30T20:33:36.234      - ITER: 40      - MAIN LOOP...     10x iter: 72385 milliseconds
 # 2023-11-30T20:34:49.438      - ITER: 50      - MAIN LOOP...     10x iter: 73202 milliseconds
 # 2023-11-30T20:36:02.304      - ITER: 60      - MAIN LOOP...     10x iter: 72862 milliseconds
 # 2023-11-30T20:37:16.345      - ITER: 70      - MAIN LOOP...     10x iter: 74035 milliseconds
 # 2023-11-30T20:38:31.253      - ITER: 80      - MAIN LOOP...     10x iter: 74902 milliseconds
 # 2023-11-30T20:39:45.898      - ITER: 90      - MAIN LOOP...     10x iter: 74642 milliseconds
 # 2023-11-30T20:41:03.194      - ITER: 100     - MAIN LOOP...     10x iter: 77291 milliseconds
(...)
 # 2023-11-30T20:53:31.399      - ITER: 200     - MAIN LOOP...     10x iter: 77181 milliseconds
(...)
 # 2023-11-30T21:05:56.724      - ITER: 300     - MAIN LOOP...     10x iter: 76036 milliseconds
(...)
 # 2023-11-30T21:19:39.973      - ITER: 400     - MAIN LOOP...     10x iter: 82673 milliseconds
(...)
 # 2023-11-30T21:32:59.258      - ITER: 500     - MAIN LOOP...     10x iter: 78104 milliseconds
(...)
 # 2023-11-30T21:45:59.476      - ITER: 600     - MAIN LOOP...     10x iter: 78104 milliseconds
(...)
 # 2023-11-30T21:59:11.549      - ITER: 700     - MAIN LOOP...     10x iter: 79619 milliseconds
(...)
 # 2023-11-30T22:12:12.002      - ITER: 800     - MAIN LOOP...     10x iter: 77864 milliseconds
(...)
 # 2023-11-30T22:25:21.053      - ITER: 900     - MAIN LOOP...     10x iter: 72679 milliseconds
(...)
 # 2023-11-30T22:38:42.306      - ITER: 1000    - MAIN LOOP...     10x iter: 73025 milliseconds

[3000, 30000, 1000] - Run: 2
 # 2023-11-30T22:40:03.960      - ITER: 10      - MAIN LOOP...     10x iter: 80654  milliseconds
(...)

Lots of things to make faster here. Your inner loop is doing zillions of allocations, your loops are in the wrong order for locality, you aren’t using views for slices

For example,

xn = pop[p, :] + rand( n_var ) .* ( xb - abs.( pop[p, :] ) ) - rand( n_var ) .* ( xw - abs.( pop[p, :] ) )
xn = max.( min.( xn, f_max ), f_min )

could be done in-place in a single-pass over the array if you (1) pre-allocate xn outside the loop (since it is always the same size and type of array), (2) use views (e.g. via the @views macro), and (3) use broadcasting more carefully. Assuming xn is pre-allocated, and you put @views in front of the function declaration to make all slices (in the whole function at once!) use views, change the code to:

@. xn = clamp(pop[p, :] + rand() * (xb - abs(pop[p, :])) - rand() * (xw - abs(pop[p, :]),
              f_min, f_max)

Note also that @. changes + and - to .+ and .-, respectively which is important for performance — even though + and - is defined for two arrays (being a valid vector operation), it isn’t fused with adjacent “dot” operations into a single loop, whereas .+ and .- are. Note also that rand() gets turned into rand.(), which calls rand() for each element (is “fused” with the other loops) rather than allocating a whole temporary array like rand(n_var). Read this blog post. As a bonus, it’s arguably more readable this way.

(I’m also using clamp instead of combining min and max, but that is just for convenience.)

Again, the slices here are in the wrong order for locality. Since it is your code and you can choose your data structures, I would transpose the dimensions of your pop array so that you are slicing it as pop[:, p] instead of pop[p, :], since a column pop[:, p] is contiguous in Julia.

6 Likes

Thank you @stevengj for your detailed response!

I will carefully consider your suggestions to enhance the code and rerun the tests.

However, can you provide further clarification regarding why the performance issue specifically emerged after the 810th iteration in the run with dimensions 25000x2500, while not observed in cases with lower or higher dimensions, despite employing the same code?