Should Julia be able to optimize away small temporary arrays?

The following code comes from a real-world example with a beginner Julia-programmer calculating the position vector for a scalar field. The second version with the array replaced by a tuple is about 10x faster. This is because in each loop iteration the array causes a memory allocation, which is avoided with the tuple.

function run_it_array()
    nx, ny, nz = 100, 200, 300

    x = Array{Float64}(undef, 3)

    Δx = [3., 5., 2.]

    for i=1:nx, j=1:ny, k=1:nz
        @. x = [i,j,k] * Δx
    end
end

function run_it_tuple()
    nx, ny, nz = 100, 200, 300

    x = Array{Float64}(undef, 3)

    Δx = [3., 5., 2.]

    for i=1:nx, j=1:ny, k=1:nz
        @. x = (i,j,k) * Δx
    end
end

The second is about 10x faster:

julia> @time run_it_array()
  0.133853 seconds (6.00 M allocations: 457.764 MiB, 2.62% gc time)

julia> @time run_it_tuple()
  0.014502 seconds (2 allocations: 160 bytes)

From a non-computer-science perspective this feels like nuts… “This one weird trick will speed up your code 10x…” :stuck_out_tongue_winking_eye: (Sorry for the sarcasm; I’m just translating the real-world experience. It gives vibes of the 90s when people were sharing their weirdest C-code pointer-magic.)

So my question: Should Julia be able to optimize the array version? Going one step further, should Julia even be able to pre-allocate x on its own without requiring the programmer to do it?

4 Likes

Also note:

julia> using StaticArrays

julia> function run_it_static_array()
           nx, ny, nz = 100, 200, 300

           x = Array{Float64}(undef, 3)

           Δx = [3., 5., 2.]

           for i=1:nx, j=1:ny, k=1:nz
               @. x = SA[i,j,k] * Δx
           end
       end
run_it_static_array (generic function with 1 method)

julia> @btime run_it_static_array()
  20.409 ms (2 allocations: 160 bytes)
1 Like

Thanks, I am aware of StaticArrays. However, it feels like an even weirder trick… (and it is not needed in this case).

1 Like

The TLDR is that it should, and that there is current work in the compiler on escape analysis that facilitates this type of operation.

15 Likes

If we think about the semantics of the two versions, I don’t think the difference is weird or like C pointer magic at all. The array version asks for a block of mutable memory of (possibly) changing size, while the tuple version cannot be written to or change its size, due to being immutable. If mutability would not have a cost, it’d feel much weirder to me!

3 Likes

On the other hand, if a user writes an array literal filled with numeric literals which clearly doesn’t change size or escape the function context, the compiler should have enough information to treat it as static without needing to reach for an external package. That work is in progress (as noted above), but I can understand OP’s surprise that Julia’s mature & sophisticated lowering/compilation toolchain doesn’t yet do so.

3 Likes

That’s the thing, julias lowering/compilation isn’s that sophisticated. The most sophisticated thing about is the type inference and multiple dispatch. We don’t do that much optimization on the julia level yet. We just generate reasonable LLVM IR and LLVM does the rest.

4 Likes

We’re getting there. 1.8 added the effect system, and 1.9 will likely have some form of escape analysis.

5 Likes

Sure, and that’s great. But, imho, then we’re talking serious optimization magic, much more so than the current situation, where different data structures have different performance characteristics that are easily explainable and understandable.

So, imho, this is not a weird performance trick at all, different data structures are good for different cases.

3 Likes

Is this actually feasible to detect for vectors? I can easily imagine heap-to-stack optimizations for mutable structs because the size is fixed, and while I’m not as certain, I think >=2-dimensional arrays also can’t change their size. But vectors definitely have dynamic sizes, and while it’s simple enough to check for push!, popat!, etc. in the method, these calls could also happen in very nested method calls. Is the compiler capable of ruling out resizing in all nested non-inlined calls, or would it just give up even when the programmer knows none of those methods resize the array?

That said, mutable StaticArrays could end up on the stack more readily than they already do.

1 Like

No, it can’t do that due to (in part) arrays being implemented in C right now. More or less every interaction with arrays does a ccall, which you can’t easily optimize away, even with effect analysis, since you can’t really add blanket-effects to ccall (they may do anything after all).

To sort of prove my point, here are the timings of a second run on my machine, which has a relatively recent version of master, which already includes a big chunk of that effect analysis (as far as I’m aware):

julia> @time run_it_array()
  0.255716 seconds (6.00 M allocations: 457.764 MiB, 7.29% gc time)

julia> @time run_it_tuple()
  0.025800 seconds (2 allocations: 160 bytes)

julia> versioninfo()
Julia Version 1.9.0-DEV.991
Commit 707f59b980 (2022-07-17 18:14 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

The escape analysis we do have on master definitely does lots of stuff for julia native structs though, be they mutable or not, which makes e.g. MArray from StaticArrays work like a charm, provided you then return an SArray of the same data.

4 Likes