Poor performance passing data structures as function argument

I have a compute-heavy function kernel that I am trying to use inside another function. This function takes several arrays and other data as arguments, and I was trying to make the code cleaner by passing an object as argument, but the performance drops markedly when I do so. Here’s a quick code for demonstration.

# Passing all data as explicit arguments:
function compute1(nx, ny, nz, arr1, arr2)
    for k in 1:nz, j in 1:ny, i in 1:nx
        tmp = arr1[i, j, k] * arr2[i, j, k]
        tmp2 = tmp + arr1[i, j, k]
    end
end

function start_compute1()
    nx = 10
    ny = 10
    nz = 200
    arr1 = zeros(Float64, (nx, ny, nz))
    arr2 = ones(Float64, (nx, ny, nz))
    compute1(nx, ny, nz, arr1, arr2)
end

@benchmark start_compute1()
BenchmarkTools.Trial: 
  memory estimate:  312.66 KiB
  allocs estimate:  4
  --------------
  minimum time:     39.613 μs (0.00% GC)
  median time:      157.446 μs (0.00% GC)
  mean time:        192.640 μs (20.39% GC)
  maximum time:     6.807 ms (97.69% GC)
  --------------
  samples:          10000
  evals/sample:     1

This is the version that performs. My first attempt at passing a object with the data was with a dicionary:

function compute2(data)
    nx, ny, nz = data[:nx], data[:ny], data[:nz]
    arr1 = data[:arr1]
    arr2 = data[:arr2]
    
    for k in 1:nz, j in 1:ny, i in 1:nx
        tmp = arr1[i, j, k] * arr2[i, j, k]
        tmp2 = tmp + arr1[i, j, k]
    end
end

function start_compute2()
    nx = 10
    ny = 10
    nz = 200
    arr1 = zeros(Float64, (nx, ny, nz))
    arr2 = ones(Float64, (nx, ny, nz))
    data = Dict()
    data[:nx], data[:ny], data[:nz] = nx, ny, nz
    data[:arr1] = arr1
    data[:arr2] = arr2
    compute2(data)
end

@benchmark start_compute2()
BenchmarkTools.Trial: 
  memory estimate:  2.58 MiB
  allocs estimate:  124409
  --------------
  minimum time:     2.178 ms (0.00% GC)
  median time:      2.327 ms (0.00% GC)
  mean time:        2.593 ms (10.94% GC)
  maximum time:     8.116 ms (70.25% GC)
  --------------
  samples:          1928
  evals/sample:     1

Seems like the problem is that it is not optimising for what is inside the Dict. I have also tried to use a struct with a well-defined data type instead of Dict, or including type annotations for arr1 and arr2 (e.g. arr1 = data[:arr1]::Array{Float64, 3} ), but the problem persists.

Is there a way to recover the performance without having to spell out all individual arguments into the compute2 function? Perhaps there is an obvious solution, but I’m not finding it.

Any suggestions would be appreciated.

Your Dict is a Dict{Any,Any}.
Thus the compiler can’t see the type of the content, and since it’s content is a mixture of different types (Some Int and some Vector{Float64}) you can’t set the type of the content to a specific type without an abstract typle like Any or a Union (and I don’t think in this circumstance a Union would optimize well anyway)
Don’t use dictionaries for this kind of thing.
use NamedTuple’s instead.

beyond just the type thing there are other reasons not to use dictionary to just hold colections of variables.
I did a writeup a while back JuliaLang Antipatterns

I have also tried to use a struct with a well-defined data type instead of Dict ,

That should have worked, show us how you defined the type?

6 Likes

Actually, using a struct with a defined data type solves the problem. I hadn’t tested in this simple example, only in the more complicated version. Here it is:

struct MyData{T <: AbstractFloat}
    nx::Int
    ny::Int
    nz::Int
    arr1::Array{T, 3}
    arr2::Array{T, 3}
end

function compute3(data::MyData)
    nx, ny, nz = data.nx, data.ny, data.nz
    for k in 1:nz, j in 1:ny, i in 1:nx
        tmp = data.arr1[i, j, k] * data.arr2[i, j, k]
        tmp2 = tmp + data.arr1[i, j, k]
    end
end

function start_compute3()
    nx = 10
    ny = 10
    nz = 200
    arr1 = zeros(Float64, (nx, ny, nz))
    arr2 = ones(Float64, (nx, ny, nz))
    data = MyData(nx, ny, nz, arr1, arr2)
    compute3(data)
end

@benchmark start_compute3()
BenchmarkTools.Trial: 
  memory estimate:  312.66 KiB
  allocs estimate:  4
  --------------
  minimum time:     35.819 μs (0.00% GC)
  median time:      166.580 μs (0.00% GC)
  mean time:        216.611 μs (22.51% GC)
  maximum time:     12.367 ms (97.94% GC)
  --------------
  samples:          10000
  evals/sample:     1

Now my problem is how to be more specific on the struct with some Unitful types, but that is another question…

2 Likes

Minimal working examples like this are a bit confusing, because I cannot tell which parts are the important ones. I think you have simplified it so much that it doesn’t make sense anymore.

Firstly, if you only use nx, ny, nz to carry around the size of the arrays, you simply should not. The arrays carry around the information about their own sizes, which you can query using the size function, or axes or just eachindex.

Secondly, your compute functions actually don’t do anything. They return nothing (literally), and they do not modify anything, so in principle the compiler doesn’t have to do anything at all, and could skip the entire computation.

So this is a toy example, presumably, but I think you have stripped away too much, including any meaningful computation, while leaving obvious improvements on the table. Then it’s hard to see what to improve except for the type instability. So right now, my suggestion for speeding up your code would be to write:

function start_compute4()
    return nothing
end

It produces the same result as your code, and is significantly faster. But is this really what you want to do?

3 Likes

Sorry for the confusing example. Will be more explicit in my next question.

No need to apologize, I hope I didn’t sound too critical. Minimal working examples (MWEs) are hard to create. Actually, I wanted to know if there is anything more you wanted help with in your example. That’s the trouble with MWEs, that it can sometimes be hard to tell exactly what the trouble is.

3 Likes