Should Julia be able to optimize away small temporary arrays?

hsgg · July 22, 2022, 6:23pm

The following code comes from a real-world example with a beginner Julia-programmer calculating the position vector for a scalar field. The second version with the array replaced by a tuple is about 10x faster. This is because in each loop iteration the array causes a memory allocation, which is avoided with the tuple.

function run_it_array()
    nx, ny, nz = 100, 200, 300

    x = Array{Float64}(undef, 3)

    Δx = [3., 5., 2.]

    for i=1:nx, j=1:ny, k=1:nz
        @. x = [i,j,k] * Δx
    end
end

function run_it_tuple()
    nx, ny, nz = 100, 200, 300

    x = Array{Float64}(undef, 3)

    Δx = [3., 5., 2.]

    for i=1:nx, j=1:ny, k=1:nz
        @. x = (i,j,k) * Δx
    end
end

The second is about 10x faster:

julia> @time run_it_array()
  0.133853 seconds (6.00 M allocations: 457.764 MiB, 2.62% gc time)

julia> @time run_it_tuple()
  0.014502 seconds (2 allocations: 160 bytes)

From a non-computer-science perspective this feels like nuts… “This one weird trick will speed up your code 10x…” (Sorry for the sarcasm; I’m just translating the real-world experience. It gives vibes of the 90s when people were sharing their weirdest C-code pointer-magic.)

So my question: Should Julia be able to optimize the array version? Going one step further, should Julia even be able to pre-allocate x on its own without requiring the programmer to do it?

mkitti · July 22, 2022, 6:29pm

Also note:

julia> using StaticArrays

julia> function run_it_static_array()
           nx, ny, nz = 100, 200, 300

           x = Array{Float64}(undef, 3)

           Δx = [3., 5., 2.]

           for i=1:nx, j=1:ny, k=1:nz
               @. x = SA[i,j,k] * Δx
           end
       end
run_it_static_array (generic function with 1 method)

julia> @btime run_it_static_array()
  20.409 ms (2 allocations: 160 bytes)

hsgg · July 22, 2022, 6:39pm

Thanks, I am aware of StaticArrays. However, it feels like an even weirder trick… (and it is not needed in this case).

Oscar_Smith · July 22, 2022, 6:40pm

The TLDR is that it should, and that there is current work in the compiler on escape analysis that facilitates this type of operation.

Sukera · July 22, 2022, 10:41pm

If we think about the semantics of the two versions, I don’t think the difference is weird or like C pointer magic at all. The array version asks for a block of mutable memory of (possibly) changing size, while the tuple version cannot be written to or change its size, due to being immutable. If mutability would not have a cost, it’d feel much weirder to me!

stillyslalom · July 22, 2022, 11:32pm

On the other hand, if a user writes an array literal filled with numeric literals which clearly doesn’t change size or escape the function context, the compiler should have enough information to treat it as static without needing to reach for an external package. That work is in progress (as noted above), but I can understand OP’s surprise that Julia’s mature & sophisticated lowering/compilation toolchain doesn’t yet do so.

gbaraldi · July 23, 2022, 12:34am

That’s the thing, julias lowering/compilation isn’s that sophisticated. The most sophisticated thing about is the type inference and multiple dispatch. We don’t do that much optimization on the julia level yet. We just generate reasonable LLVM IR and LLVM does the rest.

Oscar_Smith · July 23, 2022, 12:44am

We’re getting there. 1.8 added the effect system, and 1.9 will likely have some form of escape analysis.

DNF · July 23, 2022, 6:21am

Sure, and that’s great. But, imho, then we’re talking serious optimization magic, much more so than the current situation, where different data structures have different performance characteristics that are easily explainable and understandable.

So, imho, this is not a weird performance trick at all, different data structures are good for different cases.

Benny · July 23, 2022, 6:23am

Is this actually feasible to detect for vectors? I can easily imagine heap-to-stack optimizations for mutable structs because the size is fixed, and while I’m not as certain, I think >=2-dimensional arrays also can’t change their size. But vectors definitely have dynamic sizes, and while it’s simple enough to check for push!, popat!, etc. in the method, these calls could also happen in very nested method calls. Is the compiler capable of ruling out resizing in all nested non-inlined calls, or would it just give up even when the programmer knows none of those methods resize the array?

That said, mutable StaticArrays could end up on the stack more readily than they already do.

Sukera · July 23, 2022, 1:25pm

No, it can’t do that due to (in part) arrays being implemented in C right now. More or less every interaction with arrays does a ccall, which you can’t easily optimize away, even with effect analysis, since you can’t really add blanket-effects to ccall (they may do anything after all).

To sort of prove my point, here are the timings of a second run on my machine, which has a relatively recent version of master, which already includes a big chunk of that effect analysis (as far as I’m aware):

julia> @time run_it_array()
  0.255716 seconds (6.00 M allocations: 457.764 MiB, 7.29% gc time)

julia> @time run_it_tuple()
  0.025800 seconds (2 allocations: 160 bytes)

julia> versioninfo()
Julia Version 1.9.0-DEV.991
Commit 707f59b980 (2022-07-17 18:14 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.5 (ORCJIT, skylake)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_NUM_THREADS = 4

The escape analysis we do have on master definitely does lots of stuff for julia native structs though, be they mutable or not, which makes e.g. MArray from StaticArrays work like a charm, provided you then return an SArray of the same data.

Topic		Replies	Views
Vectors vs. Tuples and multiple dispatch New to Julia performance	3	946	April 14, 2021
What's the deal with StaticArrays? Performance staticarrays	5	1616	June 19, 2024
Why are tuple literals better for constant-propagation than array literals? Performance	12	1023	October 15, 2021
Static array over 1,024 elements is slow to compile? General Usage	11	3133	March 25, 2021
Why is Julia's generated assembly more complicated than Swift's for a simple function? Performance	16	2509	March 6, 2020

Should Julia be able to optimize away small temporary arrays?

Related topics