Performance vs # of Alloc Unintuitive behavior

Hi, new to Julia here. I needed to develop a tool (mainly 200x200 vectors and matrices operations).
I wrote a very messy code at start. It had tons of allocations. Measures with @btime performance (I pre-launched everything to let it pre-compile):

117.465 s (14679695 allocations: 21.74 GiB)

When the tool needed evolution, I decided to refactor it and tried to eliminate all allocation by defining workspace/buffers in the structs that were holding the data and been passed into functions.
Now, very similar code, no additional functionalities, I get:

148.709 s (305 allocations: 2.10 GiB)

So, basically, despite drastically reducing allocations, the code is running slower. And this is also by adding @fastmath on a couple of operations wrt to the previous messy code.

I tried different inputs to validate the result, which is consistent on those values.

I also used the Profiling tool to see if I had any issues or type inconsistencies, but everything was perfectly type stable, no global stuff, and it had minimal computation time to just do the operations that needed to be done. Instead, the previous messy code was totally type unstable, with global stuff being called all the time

So, my only remaining thought is that, by reducing allocations, I traded GC time to memory load/store time and got a worse result. A very worse result considering everything else

Any idea on why I’m getting these results?
Any suggestion for a new Julia dev?

E.

Hi, Eventine. Could you share some code snippets, including the ones before and after the allocation reduction?

1 Like

It is possible for code to be slow without allocating. So maybe your new code is just slower but allocates less. Also see Performance Tips · The Julia Language.

But it really is impossible to say much without having some runnable code to look at.

4 Likes

One mechanism by which code that allocates less can be slower is when you allocate a new contiguous array from a noncontiguous view into another array. Sometimes you only get auto-vectorization and fast memory access on the contiguous array and not the view even though the view saves an allocation.

3 Likes

Sadly I cannot share the code. I was mainly looking into some insights and advices.
I can try to squeeze the relevant parts of code into this: (and please, tell me if there are obscene things from Julia PoV, as I’m trying to learn!)

using LinearAlgebra, BenchmarkTools, Random

# --- Setup Dimensions ---
# Test 1: Similar execution
const M, K, N = 200, 180, 200 # M=rows of A, B, C; K=cols of A, rows of VAL1; N=cols of C, VAL1

# Test 2: worse execution
#const M, K, N = 2000, 1800, 2000 # M=rows of A, B, C; K=cols of A, rows of VAL1; N=cols of C, VAL1


const VAL1 = rand(Float64, K, N) 


struct MyWorkspace{T}
    result::Matrix{T}
    tmp1::Matrix{T}
    tmp2::Matrix{T}
end

struct MyStruct{T}
    myval::Matrix{T} 
    workspace::MyWorkspace{T}
end

# --- Functions ---
function my_old_function(A::Matrix, B::Matrix, C::Matrix)
    # A (M x K) * VAL1 (K x N) -> tmp1 (M x N)
    tmp1 = A * VAL1 
    # tmp1 (M x N) .^ B (M x N) -> tmp2 (M x N)
    tmp2 = tmp1 .^ B

    return C * tmp2
end

function my_new_function!(s::MyStruct{T}, A::Matrix{T}, B::Matrix{T}, C::Matrix{T}) where T
    mul!(s.workspace.tmp1, A, s.myval)
    @fastmath @. s.workspace.tmp2 = s.workspace.tmp1 ^ B
    mul!(s.workspace.result, C, s.workspace.tmp2)
    return nothing
end

# --- Benchmarking Setup ---

A = rand(Float64, M, K) # 200 x 180 
B = rand(Float64, M, N) # 200 x 200 
C = rand(Float64, M, M) # 200 x 200 

# Initialize the workspace and struct for the new function
workspace = MyWorkspace{Float64}(
    rand(Float64, M, N), # result (200 x 200)
    rand(Float64, M, N), # tmp1 (200 x 200)
    rand(Float64, M, N)  # tmp2 (200 x 200)
)

myval = copy(VAL1)
s = MyStruct{Float64}(myval, workspace)

ITER = 10_000

# --- Benchmark Calls ---

println("\n--- Benchmarking my_old_function (Allocating) ---")
@btime my_old_function($A, $B, $C)

println("--- Benchmarking my_new_function! (In-place) ---")
@btime my_new_function!($s, $A, $B, $C)

println("\n--- Benchmarking my_old_function (Allocating) ---")
@btime for i in 1:ITER my_old_function($A, $B, $C); end
println("--- Benchmarking my_new_function! (In-place) ---")
@btime for i in 1:ITER my_new_function!($s, $A, $B, $C); end

Test 1:

--- Benchmarking my_old_function (Allocating) ---
  1.085 ms (9 allocations: 937.73 KiB)
--- Benchmarking my_new_function! (In-place) ---
  1.065 ms (0 allocations: 0 bytes)

--- Benchmarking my_old_function (Allocating) ---
  19.456 s (118979 allocations: 8.94 GiB)
--- Benchmarking my_new_function! (In-place) ---
  18.470 s (28979 allocations: 609.06 KiB)

Test 2:

--- Benchmarking my_old_function (Allocating) ---
  408.595 ms (9 allocations: 91.55 MiB)
--- Benchmarking my_new_function! (In-place) ---
  420.609 ms (0 allocations: 0 bytes)

# I couldn't @btime the second Test on my pc, it is taking too long!

Very interesting!
I checked and I don’t do “strictly” noncontiguous indexing when @view, but I do several @view on rows (which may be non-contiguous, since Julia is column-wise (?) ).

If that would be the culprit, how do you suggest I approach this? Should I transpose matrices or just allocate the view?

E.