Performance vs # of Alloc Unintuitive behavior

Hi, new to Julia here. I needed to develop a tool (mainly 200x200 vectors and matrices operations).
I wrote a very messy code at start. It had tons of allocations. Measures with @btime performance (I pre-launched everything to let it pre-compile):

117.465 s (14679695 allocations: 21.74 GiB)

When the tool needed evolution, I decided to refactor it and tried to eliminate all allocation by defining workspace/buffers in the structs that were holding the data and been passed into functions.
Now, very similar code, no additional functionalities, I get:

148.709 s (305 allocations: 2.10 GiB)

So, basically, despite drastically reducing allocations, the code is running slower. And this is also by adding @fastmath on a couple of operations wrt to the previous messy code.

I tried different inputs to validate the result, which is consistent on those values.

I also used the Profiling tool to see if I had any issues or type inconsistencies, but everything was perfectly type stable, no global stuff, and it had minimal computation time to just do the operations that needed to be done. Instead, the previous messy code was totally type unstable, with global stuff being called all the time

So, my only remaining thought is that, by reducing allocations, I traded GC time to memory load/store time and got a worse result. A very worse result considering everything else

Any idea on why I’m getting these results?
Any suggestion for a new Julia dev?

E.

Hi, Eventine. Could you share some code snippets, including the ones before and after the allocation reduction?

1 Like

It is possible for code to be slow without allocating. So maybe your new code is just slower but allocates less. Also see Performance Tips · The Julia Language.

But it really is impossible to say much without having some runnable code to look at.

8 Likes

One mechanism by which code that allocates less can be slower is when you allocate a new contiguous array from a noncontiguous view into another array. Sometimes you only get auto-vectorization and fast memory access on the contiguous array and not the view even though the view saves an allocation.

5 Likes

Sadly I cannot share the code. I was mainly looking into some insights and advices.
I can try to squeeze the relevant parts of code into this: (and please, tell me if there are obscene things from Julia PoV, as I’m trying to learn!)

using LinearAlgebra, BenchmarkTools, Random

# --- Setup Dimensions ---
# Test 1: Similar execution
const M, K, N = 200, 180, 200 # M=rows of A, B, C; K=cols of A, rows of VAL1; N=cols of C, VAL1

# Test 2: worse execution
#const M, K, N = 2000, 1800, 2000 # M=rows of A, B, C; K=cols of A, rows of VAL1; N=cols of C, VAL1


const VAL1 = rand(Float64, K, N) 


struct MyWorkspace{T}
    result::Matrix{T}
    tmp1::Matrix{T}
    tmp2::Matrix{T}
end

struct MyStruct{T}
    myval::Matrix{T} 
    workspace::MyWorkspace{T}
end

# --- Functions ---
function my_old_function(A::Matrix, B::Matrix, C::Matrix)
    # A (M x K) * VAL1 (K x N) -> tmp1 (M x N)
    tmp1 = A * VAL1 
    # tmp1 (M x N) .^ B (M x N) -> tmp2 (M x N)
    tmp2 = tmp1 .^ B

    return C * tmp2
end

function my_new_function!(s::MyStruct{T}, A::Matrix{T}, B::Matrix{T}, C::Matrix{T}) where T
    mul!(s.workspace.tmp1, A, s.myval)
    @fastmath @. s.workspace.tmp2 = s.workspace.tmp1 ^ B
    mul!(s.workspace.result, C, s.workspace.tmp2)
    return nothing
end

# --- Benchmarking Setup ---

A = rand(Float64, M, K) # 200 x 180 
B = rand(Float64, M, N) # 200 x 200 
C = rand(Float64, M, M) # 200 x 200 

# Initialize the workspace and struct for the new function
workspace = MyWorkspace{Float64}(
    rand(Float64, M, N), # result (200 x 200)
    rand(Float64, M, N), # tmp1 (200 x 200)
    rand(Float64, M, N)  # tmp2 (200 x 200)
)

myval = copy(VAL1)
s = MyStruct{Float64}(myval, workspace)

ITER = 10_000

# --- Benchmark Calls ---

println("\n--- Benchmarking my_old_function (Allocating) ---")
@btime my_old_function($A, $B, $C)

println("--- Benchmarking my_new_function! (In-place) ---")
@btime my_new_function!($s, $A, $B, $C)

println("\n--- Benchmarking my_old_function (Allocating) ---")
@btime for i in 1:ITER my_old_function($A, $B, $C); end
println("--- Benchmarking my_new_function! (In-place) ---")
@btime for i in 1:ITER my_new_function!($s, $A, $B, $C); end

Test 1:

--- Benchmarking my_old_function (Allocating) ---
  1.085 ms (9 allocations: 937.73 KiB)
--- Benchmarking my_new_function! (In-place) ---
  1.065 ms (0 allocations: 0 bytes)

--- Benchmarking my_old_function (Allocating) ---
  19.456 s (118979 allocations: 8.94 GiB)
--- Benchmarking my_new_function! (In-place) ---
  18.470 s (28979 allocations: 609.06 KiB)

Test 2:

--- Benchmarking my_old_function (Allocating) ---
  408.595 ms (9 allocations: 91.55 MiB)
--- Benchmarking my_new_function! (In-place) ---
  420.609 ms (0 allocations: 0 bytes)

# I couldn't @btime the second Test on my pc, it is taking too long!

Very interesting!
I checked and I don’t do “strictly” noncontiguous indexing when @view, but I do several @view on rows (which may be non-contiguous, since Julia is column-wise (?) ).

If that would be the culprit, how do you suggest I approach this? Should I transpose matrices or just allocate the view?

E.

1 Like

As always, benchmarking is just a metric. I would look into profiling your code to see where it’s actually spending a lot of time.

Yeah, that’s likely your culprit.

I’d just try doing a materialized getindex for the rows, and see if that recovers the lost performance or not.

Depends. I’d try both the transposed approach and try materializing the view and see which is better. If you need both column and row access, then you’ll probably want to materialize the view.

1 Like

At the scale of matrices with hundreds of rows/cols entries, matrix multiplication is expensive enough that the overhead of an allocation (and subsequent GC) is pretty negligible. Scalar exponentiation is cheaper, but still expensive enough that I would only expect a modest improvement.

If you no longer care about the contents of tmp1, you can write lines like
tmp2 = tmp1 .^ B as tmp1 .= tmp1 .^ B to overwrite the input with the result as you go. The (small) performance advantage of this is that the processor only need to access two arrays from memory rather than three, which can help if memory bandwidth limits were slowing you down.

Slicing rows is not necessarily awful. Copying them into contiguous memory might help or it might do almost nothing. It depends on exactly how it is used.

1 Like

Ok, thanks!
But assuming they take exactly the same time to execute, which would it be the most Julian way to write that type of code (assume it running in an hot loop). Is it “old way” with temporary array or the new way with the pre-allocated workspace?

(I’m asking to understand if I wasted my time, or if at least I wrote decent code)

It doesn’t really matter since it seems the code here is not representative of your actual problem (with row views / copies), and most of the time is spent in the big matrix operations anyway, but note that the reason why you have so many allocations here is because you forgot to either make ITER a const, or interpolate it:

julia> @btime for i in 1:ITER my_new_function!($s, $A, $B, $C); end
  11.925 s (28979 allocations: 609.06 KiB)

julia> @btime for i in 1:$ITER my_new_function!($s, $A, $B, $C); end
  12.093 s (0 allocations: 0 bytes)
1 Like

The code you posted (mutating or not) looks reasonable. If you want something a tad more Julian, I’d replace your ::Matrix function input type restrictions with ::AbstractMatrix to make the functions more flexible (but keep your struct definitions strictly typed to using ::Matrix).

@btime already runs the code multiple times (if the code runs in less than a few seconds), so there isn’t a need to write a loop within @btime to do repetitions. If you want to see extra info, try replacing it with @benchmark.

2 Likes