Slower with threads

  1. You want static arrays wherever you create lots of them in a loop. They won’t incur allocations because their size is known so they’re put on the stack.
  2. You can’t modify static arrays, but usually you can store a completely new static array where some values are different than in the previous version, and this is usually fast because, again, no allocation is needed. The compiler is pretty smart about making such things fast. You could look at GitHub - JuliaObjects/Accessors.jl: Update immutable data for example which helps with that.
  3. Sparse arrays are something very different. There the optimization is that most elements are zero, so only the non-zero elements are saved with their positions in the array.
  4. MMatrix is slower than SMatrix, the mutability prohibits some performance optimizations.

When I switch to the main job, the optimization stops working! I think my optimization is not optimizing!

What you want to do is avoid creating new arrays, allocating memory, on the heap. If you do need to allocate memory, you should see if you can reuse them. This is what the bang functions are for such as mul!. Another method is to use broadcast assignment .=.

julia> @allocated A = rand(1:10, 1024, 1024) # Allocates 8 MiB

julia> @allocated B = rand(1:10, 1024, 1024) # Allocates 8 MiB

julia> @allocated C = zeros(1024, 1024) # Allocates 8 MiB

julia> @allocated A .+ B # Allocates 8 MiB to store result

julia> @allocated C .= A .+ B # Allocates only 64 bytes, avoids 8 MiB allocation

julia> using LinearAlgebra

julia> @allocated A * B # allocations due to compilation

julia> @allocated A * B # > 8 MiB allocated to store result

julia> @allocated mul!(C, A, B) # allocation due to compilation

julia> @allocated mul!(C, A, B) # ~31 KiB needed for multiplication, avoid allocating 8 MiB

julia> @allocated C .= A .* B # elementwise multiplication avoids allocating 8 MiB

StaticArrays.jl uses an optimization for small amounts of memory. The larger strategy to control memory allocation tightly.

@allocated A = rand(1:10, 1024, 1024) is different from A = rand(1:10, 1024, 1024)?

help?> @allocated

  A macro to evaluate an expression, discarding the resulting value, instead
  returning the total number of bytes allocated during evaluation of the

  See also @time, @timev, @timed, and @elapsed.

  julia> @allocated rand(10^6)

No. I’m just measuring allocations

Got it. I measured, each thread should allocate around 50 KiB. I love learning Julia, but for now I’ll have to program the hard way: just writing physics equations, until I learn more. But I will use C .= A .+ B, in everything possible. Thanks mkitti.

Note that in Julia 1.8 we now have an allocation profiler: