A nice explanation of memory stack vs. heap

This is from the Rust Book:

Both the stack and the heap are parts of memory that are available to your code to use at runtime, but they are structured in different ways. The stack stores values in the order it gets them and removes the values in the opposite order. This is referred to as last in, first out . Think of a stack of plates: when you add more plates, you put them on top of the pile, and when you need a plate, you take one off the top. Adding or removing plates from the middle or bottom wouldn’t work as well! Adding data is called pushing onto the stack , and removing data is called popping off the stack .

All data stored on the stack must have a known, fixed size. Data with an unknown size at compile time or a size that might change must be stored on the heap instead. The heap is less organized: when you put data on the heap, you request a certain amount of space. The memory allocator finds an empty spot in the heap that is big enough, marks it as being in use, and returns a pointer , which is the address of that location. This process is called allocating on the heap and is sometimes abbreviated as just allocating . Pushing values onto the stack is not considered allocating. Because the pointer is a known, fixed size, you can store the pointer on the stack, but when you want the actual data, you must follow the pointer.

Think of being seated at a restaurant. When you enter, you state the number of people in your group, and the staff finds an empty table that fits everyone and leads you there. If someone in your group comes late, they can ask where you’ve been seated to find you.

Pushing to the stack is faster than allocating on the heap because the allocator never has to search for a place to store new data; that location is always at the top of the stack. Comparatively, allocating space on the heap requires more work, because the allocator must first find a big enough space to hold the data and then perform bookkeeping to prepare for the next allocation.

Accessing data in the heap is slower than accessing data on the stack because you have to follow a pointer to get there. Contemporary processors are faster if they jump around less in memory. Continuing the analogy, consider a server at a restaurant taking orders from many tables. It’s most efficient to get all the orders at one table before moving on to the next table. Taking an order from table A, then an order from table B, then one from A again, and then one from B again would be a much slower process. By the same token, a processor can do its job better if it works on data that’s close to other data (as it is on the stack) rather than farther away (as it can be on the heap). Allocating a large amount of space on the heap can also take time.

30 Likes

Ok, but can we “choose” which memory to use, or it done “automatically” ? And in such case, based on what?

1 Like

It’s automatic. But roughly:

  • Arrays, Dicts, and mutable structs are on the heap
  • StaticArrays, Tuples, Named Tuples and non-mutable structs are on the stack (but sometimes never exist at all)
19 Likes

Importantly, the compiler is free to change this in either direction if it can prove that it can safely (unobservably) do so. For example, if it knows that a mutable structure doesn’t outlive the current function then it can stack allocate it. Or in the other direction, if an immutable structure is large and it would be better to heap allocate it than to pass it around on the stack then it can also do that.

23 Likes

You can choose what kinds of data structures to use in your code. Take the following simple example:

using BenchmarkTools
using StaticArrays

function normal_array()
    arr = [1,2,3]
    return arr.^2
end

function static_array()
    arr = SVector(1,2,3)
    return arr.^2
end

julia> @btime normal_array()
  40.583 ns (2 allocations: 224 bytes)
3-element Array{Int64,1}:
 1
 4
 9

julia> @btime static_array()
  0.001 ns (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Int64,1,3} with indices SOneTo(3):
 1
 4
 9

Since we don’t need to change the length of the vector arr, we can use a StaticArray and obtain the massive performance gain that results (see how there are 0 allocations?). However, if you needed to change the length of the array, the StaticArray type won’t work:

function normal_array2()
    arr = [1,2,3]
    return push!(arr, arr.^2...)
end

function static_array2()
    arr = SVector(1,2,3)
    return push!(arr, arr.^2...)
end

julia> normal_array2()
6-element Array{Int64,1}:
 1
 2
 3
 1
 4
 9

julia> static_array2()
ERROR: MethodError: no method matching resize!(::SArray{Tuple{3},Int64,1,3}, ::Int64)
Closest candidates are:
  resize!(::Array{T,1} where T, ::Integer) at array.jl:1082
  resize!(::BitArray{1}, ::Integer) at bitarray.jl:799
  resize!(::VSCodeServer.JSON.Parser.PushVector, ::Integer) at c:\Users\mthel\.vscode\extensions\julialang.language-julia-1.0.10\scripts\packages\JSON\src\pushvector.jl:29
Stacktrace:
 [1] _append!(::SArray{Tuple{3},Int64,1,3}, ::Base.HasLength, ::Tuple{Int64,Int64,Int64}) at .\array.jl:987
 [2] append!(::SArray{Tuple{3},Int64,1,3}, ::Tuple{Int64,Int64,Int64}) at .\array.jl:981
 [3] push!(::SArray{Tuple{3},Int64,1,3}, ::Int64, ::Int64, ::Int64) at .\array.jl:982
 [4] static_array2() at .\REPL[20]:3
 [5] top-level scope at REPL[23]:1

Because “normal” arrays are resizable, they have to be allocated somewhere on the heap but since StaticArrays are not resizable, they can be stored on the stack (or maybe never exist at all as @mbauman notes?).

5 Likes

uh, I am experiencing something strange… I wanted to understand how much small is small , so trying benchmarking your example for different sizes of the array. The problem is that arrived at 10000 it suddenly stop working (it got stuck by several tenths of minutes by now).
I have 16GB RAM and top reports only a small portion of it in use…

julia> using BenchmarkTools
julia> using StaticArrays
julia> function normal_array(n)
           arr = collect(1:n)
           return arr.^2
       end
normal_array (generic function with 1 method)
julia> function static_array(n)
           arr = SVector{n}(1:n)
           return arr.^2
       end
static_array (generic function with 1 method)

julia> @btime normal_array(5);
  60.526 ns (2 allocations: 256 bytes)
julia> @btime static_array(5);
  0.027 ns (0 allocations: 0 bytes)

julia> @btime normal_array(100);
  159.017 ns (2 allocations: 1.75 KiB)
julia> @btime static_array(100);
  10.585 ns (0 allocations: 0 bytes)

julia> @btime normal_array(1000);
  1.106 ÎĽs (2 allocations: 15.88 KiB)
julia> @btime static_array(1000);
  125.751 ns (0 allocations: 0 bytes)

julia> @btime normal_array(10000);
  12.117 ÎĽs (4 allocations: 156.41 KiB)
julia> @btime static_array(10000); # stuck here !!

Note that in the current implementation, working with large StaticArray s puts a lot of stress on the compiler, and becomes slower than Base.Array as the size increases. A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.

From StaticArrays documentation.

1 Like

ok… still I find counterintuitive that there is such a nonlinearity… at size 1000 StaticArrays is still 10 times faster that normal array, at 2000 StaticArrays get so stuck that you need to kill the terminal…

Thank you

2 Likes

I don’t know enough to be able to explain why it doesn’t scale but, at least for now, it’s just something you have to consider when using StaticArrays in your code. As an aside, I’ve been learning Rust for a couple of weeks now and Rust forces you to think about nearly every minute detail of your code and the outcomes it could produce which is proving to be very helpful in my Julia projects. I actually don’t think I’ll use Rust for very much but I’m going to continue practicing it because it’s making me a better Julia programmer. The nice thing in this case is that you can just use regular arrays and not worry about it…which is what you should probably be doing IMO until you have code that is mature and unlikely to change very much, at which point you can start identifying places to squeeze out every last drop of performance.

2 Likes

While that may be a limitation that Rust decided to have, it is not at all true for libc, and so it does not necessarily have to be true for Julia depending on how its implemented. (See man alloca() for details.) The reason that one does not pile willy-nilly onto the stack is that it is small. Typically it is configured as one megabyte per thread. And unfortunately there is no graceful error handling for when you run out.

4 Likes

Back in the day, when you ran out of stack weird things happened as the stack started to trample on code or other data. Since the advent of virtual memory managers, when you run out of stack space it triggers a virtual memory exception which may blow up your program but it won’t cause anomalous behavior in other programs.

1 Like

And for very small data used only locally within a function, e.g. a ComplexF64 value or a 3-component StaticArray, it can avoid both the heap and the stack and store the data directly in registers.

Moreover, because modern CPUs typically have more physical registers than are exposed in the instruction set, with register renaming my understanding is that the CPU itself can take a nominally stack-allocated value and store it in a register instead, e.g. to eliminate a register spill.

8 Likes