# Local arrays in for loops

Didi you see the `temp3` version above that does not allocate anything? That one solves the problem, the trick is to parameterize the input arrays and use the sizes inside the function. It might be my eyes, but the result seemed pretty to me.

I would still need a vector of vectors:

``````edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];
``````

Where `elementLength = size(elements[1],1)`. I use these later for various purposes. You got around this by not storing the edges for later use. How could I keep the edges while avoiding a global variable?
Thanks a lot!

The best I could come by is this:

``````function temp1(vertexList::AbstractVector{SVector{N,T1}},
elements::AbstractVector{SVector{N,T2}};
edges = zero(MVector{N,SVector{N,T1}})) where {N,T1,T2}
acc = zero(SVector{N,T1})
@inbounds for k in 1:size(elements,1)
vertices = vertexList[elements[k]];
for i in 1:N
edges[i] = vertices[i%N+1]-vertices[i]
end
edgesNorm = norm.(edges)
acc += edgesNorm
# edges available to do more stuff
end
return acc
end

``````

Here, `edges` is an optional parameter and the only possible allocation in the function. If you don’t pass it, you get:

``````julia> @btime temp1(\$vertexList,\$elements)
4.953 μs (1 allocation: 80 bytes)

``````

If you preallocate it and pass it, you get:

``````julia> edges = zero(MVector{3,SVector{3,Float64}});

julia> @btime temp1(\$vertexList,\$elements,edges=\$edges)
4.938 μs (0 allocations: 0 bytes)

``````

Thanks for the help, but this would break loop parallelization. I can’t come up with a better solution than the global var either.
It will do for now.
Thanks again!

If you can, please post here the final solution you have. I could not avoid all allocations in the case of the global variables. I do not know what I am missing there.

I don’t really know what’s happening, but I can’t get it down to zero anymore. I can get it down to one by doing something really weird.

``````using StaticArrays
using LinearAlgebra
N=1000;
vertexList = Vector{StaticArrays.SVector{3,Float64}}(undef,N);
elements = Vector{StaticArrays.SVector{3,Int}}(undef,N);
for i in 1:1000
vertexList[i] = SA_F64[rand(),rand(),rand()];
elements[i] = SA[rand(1:N),rand(1:N),rand(1:N)];
end
elementLength = length(elements[1]);

function temp1(vertexList,elements)
elementLength = length(elements[1]);
acc = SA_F64[0,0,0];
for k in 1:size(elements,1)
vertices = vertexList[elements[k]];
edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];
edgesNorm = norm.(edges);
acc += edgesNorm;
end
return acc;
end
@time temp1(vertexList,elements);
``````

Note that it only works if `elementLength` is defined twice. Otherwise, I get 18k allocations.
Is someone is able to explain this?

Well the part of the two definitions of the length of the vector is because, as I mentioned, the `@SVector` macro is evaluated at parse time. The `elementLength` used to define the length of the vector has nothing to do with the `elementLength` inside the function. Note this:

``````julia> using StaticArrays

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
ERROR: LoadError: UndefVarError: N not defined
Stacktrace:
[1] top-level scope at none:1
[2] eval(::Module, ::Any) at ./boot.jl:331
[3] @SVector(::LineNumberNode, ::Module, ::Any) at /home/leandro/.julia/packages/StaticArrays/l7lu2/src/SVector.jl:63
in expression starting at REPL[2]:1

julia> N = 3
3

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
f (generic function with 1 method)

julia> f(1)
1
(nothing, [0, 0, 0])

``````

Note that in the final call, which worked, the vector generated had still length 3. The function had already been defined when it was first parsed with `N=3`. That is why you need to define that length outside the function, but that breaks completely the generality of the function.

Besides that, my opinion is that the `edges` array you are willing to store and mutate is just not immutable enough to be stack allocated. If you want to have it mutable and flexible you will need to allocate it on the heap. But that may have some solution I am not seeing.

1 Like

I understood why I have to define it outside the function. What I meant to say in my previous post is that I also have to define it inside. That is the weird part. If I don’t, I get a lot of allocations.

Yes, that is weird.

Maybe THIS `elementLength` becomes type unstable and that causes the allocations. Not related anymore to the size of the static array.

Edit: yes, that is it:

``````julia> @code_warntype temp1(vertexList,elements)
Variables
#self#::Core.Compiler.Const(temp1, false)
vertexList::Array{SArray{Tuple{3},Float64,1,3},1}
elements::Array{SArray{Tuple{3},Int64,1,3},1}
acc::Any
@_5::Union{Nothing, Tuple{Int64,Int64}}
k::Int64
#7::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
vertices::SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}
256::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
edges::SArray{Tuple{3},_A,1,3} where _A
edgesNorm::Any

Body::Any
1 ─       (acc = Base.getindex(Main.SA_F64, 0, 0, 0))
│   %2  = Main.size(elements, 1)::Int64
│   %3  = (1:%2)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])
│         (@_5 = Base.iterate(%3))
│   %5  = (@_5 === nothing)::Bool
│   %6  = Base.not_int(%5)::Bool
└──       goto #4 if not %6
2 ┄ %8  = @_5::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (k = Core.getfield(%8, 1))
│   %10 = Core.getfield(%8, 2)::Int64
│   %11 = Base.getindex(elements, k)::SArray{Tuple{3},Int64,1,3}
│         (vertices = Base.getindex(vertexList, %11))
│   %13 = Main.:(var"#7#8")::Core.Compiler.Const(var"#7#8", false)
│   %14 = Core.typeof(vertices)::Core.Compiler.Const(SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}, false)
│   %15 = Core.apply_type(%13, %14)::Core.Compiler.Const(var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}, false)
│         (#7 = %new(%15, vertices))
│         (256 = #7)
│   %18 = Core.apply_type(Main.SVector, 3)::Core.Compiler.Const(SArray{Tuple{3},T,1,3} where T, false)
│   %19 = (256)(1)::Any
│   %20 = (256)(2)::Any
│   %21 = (256)(3)::Any
│   %22 = Core.tuple(%19, %20, %21)::Tuple{Any,Any,Any}
│         (edges = (%18)(%22))
A} where _A<:Tuple
│         (edgesNorm = Base.materialize(%24))
│         (acc = acc + edgesNorm)
│         (@_5 = Base.iterate(%3, %10))
│   %28 = (@_5 === nothing)::Bool
│   %29 = Base.not_int(%28)::Bool
└──       goto #4 if not %29
3 ─       goto #2
4 ┄       return acc

``````

The problem of course goes away if you set `elementLength` as const:

``````const elementLength = length(elements[1]);

``````

I was finally able to get rid of all allocations!

``````
function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
acc = zero(MVector{N,T})
for k in 1:size(elements,1)
vertices = vertexList[elements[k]]
edges = fill(zero(SVector{N,T}),MVector{N,SVector{N,T}})
for i in 1:N
edges[i] = vertices[i%N+1] - vertices[i]
acc[i] += norm(edges[i])
end
end
return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4(\$vertexList,\$elements)
4.237 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
670.809218674305
672.9989019068689
678.1027515172907

``````

What did the trick was to use `fill` to create the the `edges ` mutable array. For some reason `zero` allocated the array, but fill does not.

It was a good exercise. I apologize for perhaps taking too much of your time and perhaps discourage other people to answer your questions.

If I may leave a suggestion to get better advice from every one, always post your complete runnable code, with the output you are seeing, so people can readily copy and paste and reproduce the results, at each step. That helps a lot everyone trying to help.

edit:

Actually using `zero` also works. I don’t know why I didn’t get the solution before. I thought I had tried. Probably I got something wrong somewhere else:

``````function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
acc = zero(MVector{N,T})
for k in 1:size(elements,1)
vertices = vertexList[elements[k]]
edges = zero(MVector{N,SVector{N,T}})
for i in 1:N
edges[i] = vertices[i%N+1] - vertices[i]
acc[i] += norm(edges[i])
end
# edges available to do something else
end
return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4(\$vertexList,\$elements)
4.231 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
678.0587466021068
685.3610238961863
673.6792450131684

``````
2 Likes

Thanks a lot for your time and definitely don’t apologize after helping out so much!
Your example works and I could modify it a bit to make it more like my real code and it still worked. However, my real code does not work. For whatever reason, this method led to a lot of allocations and incremental changes from `@SVector [..., globalVar]` to using `zero(MVector...)` seemed to give me numbers of allocations that I could not explain at all (small changes would make the number jump by a factor of 35, which I don’t understand).
I spent several hours on this and whatever I do to your example to make it more like my code still leads to zero allocations, while all sorts of changes to my real code still led to hundreds of thousands of allocations.
I need to stop chasing this rabbit for a while. The `@SVector` version was only 10% faster than the version with tons of allocations, while the latter doesn’t require a global var. So I guess I’ll stick to that one for now.
Thanks again and I will post code from the start next time (as I said before, I thought this was a common issue, maybe with an easy solution!).

2 Likes

Just backing up a little bit, I started with your `temp1` from Local arrays in for loops - #8 by Ribeiro and made only two very small changes:

1. Removed all global variables
2. Made `v` an `SVector`

Here’s your `temp1` and my `temp1b`:

``````v = SVector((1:elementLength)...)
vNext = v.%elementLength .+ 1;

function temp1(vertexList,elements,N)
for k in 1:N
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
end
end

function temp1b(vertexList, elements, v, vNext)
for k in 1:length(vertexList)
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
end
end

@btime temp1(\$vertexList, \$elements, \$N)
@btime temp1b(\$vertexList, \$elements, \$v, \$vNext)
``````

and the difference is pretty striking:

``````julia> @btime temp1(\$vertexList, \$elements, \$N)
278.985 μs (9000 allocations: 656.25 KiB)

julia> @btime temp1b(\$vertexList, \$elements, \$v, \$vNext)
2.401 μs (0 allocations: 0 bytes)
``````

More than 100x faster, with no allocations, and no need to hard-code the number of elements.

The moral here is that the guidance against using global variables is incredibly important (that’s why it’s the first performance tip).

4 Likes

You see, @Ribeiro , that is why I was apologizing

At least there are some things in favour of what I have proposed. In the original problem your function was not returning anything, thus that was not really comparable, afterwards you added the computation of the `acc` (acceleration, I guess). Adding that to @rdeits `temp1b` gives:

``````function temp1b(vertexList, elements, v, vNext)
acc = zero(SVector{3,Float64})
for k in 1:length(vertexList)
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
acc += edgesNorm
end
acc
end
``````
``````julia> @btime temp1b(\$vertexList, \$elements, \$v, \$vNext)
7.639 μs (0 allocations: 0 bytes)

``````

Still no allocations, but about 2x slower than my `temp4`.

Also, if we include the allocation of the auxiliary arrays `v` and `vNext`, we get:

``````function temp1b_alloc(vertexList,elements)
elementLength = length(elements[1])
v = SVector((1:elementLength)...)
vNext = v.%elementLength .+ 1;
temp1b(vertexList, elements, v, vNext)
end
``````
``````julia> @btime temp1b_alloc(\$vertexList, \$elements)
9.019 μs (11 allocations: 416 bytes)
``````

Of course than it allocates and it is somewhat slower still.***

This kind of thing is probably associated to type instabilities. Sometimes they are not easy to track. Whenever you can, post more of your code here (or in other thread, to get s fresh start).

***As shown below, these allocations can be solved by setting `v` with an `ntuple`.

Ok, but then why does this:

``````function temp1c(vertexList,elements)
elementLength=size(elements[1],1)
v=SVector((1:elementLength)...)
vNext = v .% elementLength .+ 1
for k in 1:size(elements,1)
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
end
end
``````

perform 9011 allocations?

1 Like

`SVector((1:elementLength)...)` is not type-stable. Use `ntuple(i->i, elementLength)` instead because it can be inlined by the compiler.

``````function temp1c(vertexList, elements)
elementLength=size(elements[1],1)
v=SVector(ntuple(i->i, elementLength))
vNext = v .% elementLength .+ 1
for k in 1:size(elements,1)
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
end
end
``````

I’d also annotate the type of the second argument to avoid relying on the array being non-empty.

2 Likes

I’m pretty confused about type-stability. The documentation makes it sound like sometimes it’s good to tell the compiler what type variables are, but that you shouldn’t do it all the time.
All my functions and variables are type-fixed. So why should I or shouldn’t I specify the type of every variable and argument in my code?
Thanks a lot!

1 Like

I have to admit that you reached a case where solving those instabilities relied on subtle and non-intuitive changes for me. The use `ntuple` there to solve that part is something hard to understand (although it appears here from time to time).

The general answer to your question here is that “specifying” or, annotating, types, does not solve necessarily the problem of type instability. The instability is a property of the values the variables assume at run time.

Annotating the types required for functions can make your code not run by raising method errors if the types are not what you where expecting. That can help to track errors and indicate that something was not created with the expected type, but where the type stability has to be guaranteed is at the moment of creating the each value for each variable.

For example, the `temp1c` function (for which the compiler was able to infer types and non-allocate):

``````function temp1c(vertexList,elements)
elementLength = length(elements[1])
v = SVector(ntuple(i->i,elementLength))
vNext = v.%elementLength .+ 1;
temp1b(vertexList, elements, v, vNext)
end
``````

You could defined it as:

``````function temp1c(vertexList::Vector{SVector{N,Float64}},
elements::Vector{SVector{N,Int}}) where N
``````

which conforms the type of variable you are feeding to the function. That will raise an error if the variables are not of those types, but it has no performance gain at all (annotating or not, the compiler will know what the types of variables are).

Yet, that can be useful to detect type-instabilities. For example, if you generated the `elements` vector in a type-unstable form, the function becomes horribly slow:

``````julia> elements = Any[ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp1c(\$vertexList, \$elements)
314.053 μs (11006 allocations: 672.13 KiB)

``````

But if you had used the annotated version of the function, you would just get an error:

``````julia> @btime temp1c(\$vertexList, \$elements)
ERROR: MethodError: no method matching temp1c(::Array{SArray{Tuple{3},Float64,1,3},1}, ::Array{Any,1})

``````

which might help you to find where the instability happened (if it happened before that function call).

One thing: We could have figured out that in your version that used `v = SVector((1:elementLength)...)` there was a type-instability, by using the `@code_warntype` macro, which is always your friend:

``````julia> @code_warntype temp1c(vertexList,elements)
Variables
#self#::Core.Compiler.Const(temp1c, false)
vertexList::Array{SArray{Tuple{3},Float64,1,3},1}
elements::Array{SArray{Tuple{3},Int64,1,3},1}
elementLength::Int64
v::Any
vNext::Any

Body::SArray{Tuple{3},Float64,1,3}
1 ─ %1 = Base.getindex(elements, 1)::SArray{Tuple{3},Int64,1,3}
│        (elementLength = Main.length(%1))
│   %3 = (1:elementLength::Core.Compiler.Const(3, false))::Core.Compiler.Const(1:3, false)
│        (v = Core._apply_iterate(Base.iterate, Main.SVector, %3))
│   %5 = Base.broadcasted(Main.:%, v, elementLength::Core.Compiler.Const(3, false))::Any
│   %6 = Base.broadcasted(Main.:+, %5, 1)::Any
│        (vNext = Base.materialize(%6))
│   %8 = Main.temp1b(vertexList, elements, v, vNext)::SArray{Tuple{3},Float64,1,3}
└──      return %8

``````

Note that `v` and `vNext` are inferred to be of type `Any`, which is bad (and red in the REPL). Those are solved changing that to the `ntuple` generator (to figure out how to solve that type instability would be another problem, but probably one could ask a much more specific question in the forum if we had reached that point).

2 Likes

“Statically-sized” types like `SVector`s or tuples require some care to ensure type stability.

`Vector` from the standard library does not include its length in the type signature, so `collect(1:n)` is always a `Vector` regardless of the value of `n`.

On the other hand, calling `SVector(1:n...)` produces different types for different values of `n`, because the length of a `StaticArray` is included in the type.
Ideally, if `n` is a compile time constant, it should be propagated, and the type of `SVector(1:n...)` should be inferable, but the Julia compiler currently is unable to do that
`ntuple(f, n)` can be inlined for small `n` if `n` is known at compile time.
Julia compiler can detect that `elementLength = length(elements[1])` is a compile time constant if `elements[1]` is a static array (with length <=16) and, hence, infer `ntuple` return type statically.

Now, to specify types or not depends on the specific use case. Here, we heavily rely on `vertexList` and `elements` containing static vectors to get rid of allocations, so specifying types is justified.
You may also write a type-generic variant for other cases which might be less performant.

3 Likes

This is complicated =P.
Ok, so how can I add this to the code and avoid allocations?

``````function temp1c(vertexList,elements)
elementLength=size(elements[1],1)
v=SVector(ntuple(i->i,elementLength))
vNext = v .% elementLength .+ 1
randvec = SA_F64[1,2,3]
for k in 1:size(elements,1)
vertices = vertexList[elements[k]];
edges = vertices[vNext]-vertices[v];
edgesNorm = norm.(edges);
foo = SVector(ntuple(i->dot(edges[i],randvec),elementLength))
end
end
``````
1 Like