Local arrays in for loops

Didi you see the temp3 version above that does not allocate anything? That one solves the problem, the trick is to parameterize the input arrays and use the sizes inside the function. It might be my eyes, but the result seemed pretty to me.

I would still need a vector of vectors:

edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];

Where elementLength = size(elements[1],1). I use these later for various purposes. You got around this by not storing the edges for later use. How could I keep the edges while avoiding a global variable?
Thanks a lot!

The best I could come by is this:

function temp1(vertexList::AbstractVector{SVector{N,T1}},
               elements::AbstractVector{SVector{N,T2}};
               edges = zero(MVector{N,SVector{N,T1}})) where {N,T1,T2}
    acc = zero(SVector{N,T1})
    @inbounds for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        for i in 1:N
           edges[i] = vertices[i%N+1]-vertices[i]
        end
        edgesNorm = norm.(edges)
        acc += edgesNorm
        # edges available to do more stuff
    end
    return acc
end

Here, edges is an optional parameter and the only possible allocation in the function. If you don’t pass it, you get:

julia> @btime temp1($vertexList,$elements)
  4.953 μs (1 allocation: 80 bytes)

If you preallocate it and pass it, you get:

julia> edges = zero(MVector{3,SVector{3,Float64}});

julia> @btime temp1($vertexList,$elements,edges=$edges)
  4.938 μs (0 allocations: 0 bytes)

Thanks for the help, but this would break loop parallelization. I can’t come up with a better solution than the global var either.
It will do for now.
Thanks again!

If you can, please post here the final solution you have. I could not avoid all allocations in the case of the global variables. I do not know what I am missing there.

I don’t really know what’s happening, but I can’t get it down to zero anymore. I can get it down to one by doing something really weird.

using StaticArrays
using LinearAlgebra
N=1000;
vertexList = Vector{StaticArrays.SVector{3,Float64}}(undef,N);
elements = Vector{StaticArrays.SVector{3,Int}}(undef,N);
for i in 1:1000
    vertexList[i] = SA_F64[rand(),rand(),rand()];
    elements[i] = SA[rand(1:N),rand(1:N),rand(1:N)];
end
elementLength = length(elements[1]);

function temp1(vertexList,elements)
    elementLength = length(elements[1]);
    acc = SA_F64[0,0,0];
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];
        edgesNorm = norm.(edges);
        acc += edgesNorm;
    end
    return acc;
end
@time temp1(vertexList,elements);

Note that it only works if elementLength is defined twice. Otherwise, I get 18k allocations.
Is someone is able to explain this?

Well the part of the two definitions of the length of the vector is because, as I mentioned, the @SVector macro is evaluated at parse time. The elementLength used to define the length of the vector has nothing to do with the elementLength inside the function. Note this:

julia> using StaticArrays

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
ERROR: LoadError: UndefVarError: N not defined
Stacktrace:
 [1] top-level scope at none:1
 [2] eval(::Module, ::Any) at ./boot.jl:331
 [3] @SVector(::LineNumberNode, ::Module, ::Any) at /home/leandro/.julia/packages/StaticArrays/l7lu2/src/SVector.jl:63
in expression starting at REPL[2]:1

julia> N = 3
3

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
f (generic function with 1 method)

julia> f(1)
1
(nothing, [0, 0, 0])


Note that in the final call, which worked, the vector generated had still length 3. The function had already been defined when it was first parsed with N=3. That is why you need to define that length outside the function, but that breaks completely the generality of the function.

Besides that, my opinion is that the edges array you are willing to store and mutate is just not immutable enough to be stack allocated. If you want to have it mutable and flexible you will need to allocate it on the heap. But that may have some solution I am not seeing.

1 Like

I understood why I have to define it outside the function. What I meant to say in my previous post is that I also have to define it inside. That is the weird part. If I don’t, I get a lot of allocations.

Yes, that is weird.

Maybe THIS elementLength becomes type unstable and that causes the allocations. Not related anymore to the size of the static array.

Edit: yes, that is it:

julia> @code_warntype temp1(vertexList,elements)
Variables
  #self#::Core.Compiler.Const(temp1, false)
  vertexList::Array{SArray{Tuple{3},Float64,1,3},1}
  elements::Array{SArray{Tuple{3},Int64,1,3},1}
  acc::Any
  @_5::Union{Nothing, Tuple{Int64,Int64}}
  k::Int64
  #7::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
  vertices::SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}
  256::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
  edges::SArray{Tuple{3},_A,1,3} where _A
  edgesNorm::Any

Body::Any
1 ─       (acc = Base.getindex(Main.SA_F64, 0, 0, 0))
│   %2  = Main.size(elements, 1)::Int64
│   %3  = (1:%2)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])
│         (@_5 = Base.iterate(%3))
│   %5  = (@_5 === nothing)::Bool
│   %6  = Base.not_int(%5)::Bool
└──       goto #4 if not %6
2 ┄ %8  = @_5::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (k = Core.getfield(%8, 1))
│   %10 = Core.getfield(%8, 2)::Int64
│   %11 = Base.getindex(elements, k)::SArray{Tuple{3},Int64,1,3}
│         (vertices = Base.getindex(vertexList, %11))
│   %13 = Main.:(var"#7#8")::Core.Compiler.Const(var"#7#8", false)
│   %14 = Core.typeof(vertices)::Core.Compiler.Const(SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}, false)
│   %15 = Core.apply_type(%13, %14)::Core.Compiler.Const(var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}, false)
│         (#7 = %new(%15, vertices))
│         (256 = #7)
│   %18 = Core.apply_type(Main.SVector, 3)::Core.Compiler.Const(SArray{Tuple{3},T,1,3} where T, false)
│   %19 = (256)(1)::Any
│   %20 = (256)(2)::Any
│   %21 = (256)(3)::Any
│   %22 = Core.tuple(%19, %20, %21)::Tuple{Any,Any,Any}
│         (edges = (%18)(%22))
│   %24 = Base.broadcasted(Main.norm, edges)::Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(norm),_
A} where _A<:Tuple
│         (edgesNorm = Base.materialize(%24))
│         (acc = acc + edgesNorm)
│         (@_5 = Base.iterate(%3, %10))
│   %28 = (@_5 === nothing)::Bool
│   %29 = Base.not_int(%28)::Bool
└──       goto #4 if not %29
3 ─       goto #2
4 ┄       return acc


The problem of course goes away if you set elementLength as const:

const elementLength = length(elements[1]);

I was finally able to get rid of all allocations!


function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        edges = fill(zero(SVector{N,T}),MVector{N,SVector{N,T}})
        for i in 1:N
            edges[i] = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edges[i])
        end
    end
    return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4($vertexList,$elements)
  4.237 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 670.809218674305
 672.9989019068689
 678.1027515172907

What did the trick was to use fill to create the the edges mutable array. For some reason zero allocated the array, but fill does not.

It was a good exercise. I apologize for perhaps taking too much of your time and perhaps discourage other people to answer your questions.

If I may leave a suggestion to get better advice from every one, always post your complete runnable code, with the output you are seeing, so people can readily copy and paste and reproduce the results, at each step. That helps a lot everyone trying to help.

edit:

Actually using zero also works. I don’t know why I didn’t get the solution before. I thought I had tried. Probably I got something wrong somewhere else:

function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        edges = zero(MVector{N,SVector{N,T}})
        for i in 1:N
            edges[i] = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edges[i])
        end
        # edges available to do something else
    end
    return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4($vertexList,$elements)
  4.231 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 678.0587466021068
 685.3610238961863
 673.6792450131684

2 Likes

Thanks a lot for your time and definitely don’t apologize after helping out so much!
Your example works and I could modify it a bit to make it more like my real code and it still worked. However, my real code does not work. For whatever reason, this method led to a lot of allocations and incremental changes from @SVector [..., globalVar] to using zero(MVector...) seemed to give me numbers of allocations that I could not explain at all (small changes would make the number jump by a factor of 35, which I don’t understand).
I spent several hours on this and whatever I do to your example to make it more like my code still leads to zero allocations, while all sorts of changes to my real code still led to hundreds of thousands of allocations.
I need to stop chasing this rabbit for a while. The @SVector version was only 10% faster than the version with tons of allocations, while the latter doesn’t require a global var. So I guess I’ll stick to that one for now.
Thanks again and I will post code from the start next time (as I said before, I thought this was a common issue, maybe with an easy solution!).

2 Likes

Just backing up a little bit, I started with your temp1 from Local arrays in for loops - #8 by Ribeiro and made only two very small changes:

  1. Removed all global variables
  2. Made v an SVector

Here’s your temp1 and my temp1b:

v = SVector((1:elementLength)...)
vNext = v.%elementLength .+ 1;

function temp1(vertexList,elements,N)
    for k in 1:N
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

function temp1b(vertexList, elements, v, vNext)
    for k in 1:length(vertexList)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

@btime temp1($vertexList, $elements, $N)
@btime temp1b($vertexList, $elements, $v, $vNext)

and the difference is pretty striking:

julia> @btime temp1($vertexList, $elements, $N)
  278.985 μs (9000 allocations: 656.25 KiB)

julia> @btime temp1b($vertexList, $elements, $v, $vNext)
  2.401 μs (0 allocations: 0 bytes)

More than 100x faster, with no allocations, and no need to hard-code the number of elements.

The moral here is that the guidance against using global variables is incredibly important (that’s why it’s the first performance tip).

4 Likes

You see, @Ribeiro , that is why I was apologizing :smile:

At least there are some things in favour of what I have proposed. In the original problem your function was not returning anything, thus that was not really comparable, afterwards you added the computation of the acc (acceleration, I guess). Adding that to @rdeits temp1b gives:

function temp1b(vertexList, elements, v, vNext)
    acc = zero(SVector{3,Float64})
    for k in 1:length(vertexList)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
        acc += edgesNorm
    end
    acc
end
julia> @btime temp1b($vertexList, $elements, $v, $vNext)
  7.639 μs (0 allocations: 0 bytes)

Still no allocations, but about 2x slower than my temp4.

Also, if we include the allocation of the auxiliary arrays v and vNext, we get:

function temp1b_alloc(vertexList,elements)
  elementLength = length(elements[1])
  v = SVector((1:elementLength)...)
  vNext = v.%elementLength .+ 1;
  temp1b(vertexList, elements, v, vNext)
end
julia> @btime temp1b_alloc($vertexList, $elements)
  9.019 μs (11 allocations: 416 bytes)

Of course than it allocates and it is somewhat slower still.***

This kind of thing is probably associated to type instabilities. Sometimes they are not easy to track. Whenever you can, post more of your code here (or in other thread, to get s fresh start).

***As shown below, these allocations can be solved by setting v with an ntuple.

Ok, but then why does this:

function temp1c(vertexList,elements)
    elementLength=size(elements[1],1)
    v=SVector((1:elementLength)...)
    vNext = v .% elementLength .+ 1
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

perform 9011 allocations?

1 Like

SVector((1:elementLength)...) is not type-stable. Use ntuple(i->i, elementLength) instead because it can be inlined by the compiler.

function temp1c(vertexList, elements)
    elementLength=size(elements[1],1)
    v=SVector(ntuple(i->i, elementLength))
    vNext = v .% elementLength .+ 1
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

I’d also annotate the type of the second argument to avoid relying on the array being non-empty.

2 Likes

I’m pretty confused about type-stability. The documentation makes it sound like sometimes it’s good to tell the compiler what type variables are, but that you shouldn’t do it all the time.
All my functions and variables are type-fixed. So why should I or shouldn’t I specify the type of every variable and argument in my code?
Thanks a lot!

1 Like

I have to admit that you reached a case where solving those instabilities relied on subtle and non-intuitive changes for me. The use ntuple there to solve that part is something hard to understand (although it appears here from time to time).

The general answer to your question here is that “specifying” or, annotating, types, does not solve necessarily the problem of type instability. The instability is a property of the values the variables assume at run time.

Annotating the types required for functions can make your code not run by raising method errors if the types are not what you where expecting. That can help to track errors and indicate that something was not created with the expected type, but where the type stability has to be guaranteed is at the moment of creating the each value for each variable.

For example, the temp1c function (for which the compiler was able to infer types and non-allocate):

function temp1c(vertexList,elements)
  elementLength = length(elements[1])
  v = SVector(ntuple(i->i,elementLength))
  vNext = v.%elementLength .+ 1;
  temp1b(vertexList, elements, v, vNext)
end

You could defined it as:

function temp1c(vertexList::Vector{SVector{N,Float64}},
                elements::Vector{SVector{N,Int}}) where N

which conforms the type of variable you are feeding to the function. That will raise an error if the variables are not of those types, but it has no performance gain at all (annotating or not, the compiler will know what the types of variables are).

Yet, that can be useful to detect type-instabilities. For example, if you generated the elements vector in a type-unstable form, the function becomes horribly slow:

julia> elements = Any[ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp1c($vertexList, $elements)
  314.053 μs (11006 allocations: 672.13 KiB)

But if you had used the annotated version of the function, you would just get an error:

julia> @btime temp1c($vertexList, $elements)
ERROR: MethodError: no method matching temp1c(::Array{SArray{Tuple{3},Float64,1,3},1}, ::Array{Any,1})

which might help you to find where the instability happened (if it happened before that function call).

One thing: We could have figured out that in your version that used v = SVector((1:elementLength)...) there was a type-instability, by using the @code_warntype macro, which is always your friend:

julia> @code_warntype temp1c(vertexList,elements)
Variables
  #self#::Core.Compiler.Const(temp1c, false)
  vertexList::Array{SArray{Tuple{3},Float64,1,3},1}
  elements::Array{SArray{Tuple{3},Int64,1,3},1}
  elementLength::Int64
  v::Any
  vNext::Any

Body::SArray{Tuple{3},Float64,1,3}
1 ─ %1 = Base.getindex(elements, 1)::SArray{Tuple{3},Int64,1,3}
│        (elementLength = Main.length(%1))
│   %3 = (1:elementLength::Core.Compiler.Const(3, false))::Core.Compiler.Const(1:3, false)
│        (v = Core._apply_iterate(Base.iterate, Main.SVector, %3))
│   %5 = Base.broadcasted(Main.:%, v, elementLength::Core.Compiler.Const(3, false))::Any
│   %6 = Base.broadcasted(Main.:+, %5, 1)::Any
│        (vNext = Base.materialize(%6))
│   %8 = Main.temp1b(vertexList, elements, v, vNext)::SArray{Tuple{3},Float64,1,3}
└──      return %8


Note that v and vNext are inferred to be of type Any, which is bad (and red in the REPL). Those are solved changing that to the ntuple generator (to figure out how to solve that type instability would be another problem, but probably one could ask a much more specific question in the forum if we had reached that point).

2 Likes

“Statically-sized” types like SVectors or tuples require some care to ensure type stability.

Vector from the standard library does not include its length in the type signature, so collect(1:n) is always a Vector regardless of the value of n.

On the other hand, calling SVector(1:n...) produces different types for different values of n, because the length of a StaticArray is included in the type.
Ideally, if n is a compile time constant, it should be propagated, and the type of SVector(1:n...) should be inferable, but the Julia compiler currently is unable to do that :frowning:
ntuple(f, n) can be inlined for small n if n is known at compile time.
Julia compiler can detect that elementLength = length(elements[1]) is a compile time constant if elements[1] is a static array (with length <=16) and, hence, infer ntuple return type statically.

Now, to specify types or not depends on the specific use case. Here, we heavily rely on vertexList and elements containing static vectors to get rid of allocations, so specifying types is justified.
You may also write a type-generic variant for other cases which might be less performant.

3 Likes

This is complicated =P.
Ok, so how can I add this to the code and avoid allocations?

function temp1c(vertexList,elements)
  elementLength=size(elements[1],1)
  v=SVector(ntuple(i->i,elementLength))
  vNext = v .% elementLength .+ 1
  randvec = SA_F64[1,2,3]
  for k in 1:size(elements,1)
    vertices = vertexList[elements[k]];
    edges = vertices[vNext]-vertices[v];
    edgesNorm = norm.(edges);
    foo = SVector(ntuple(i->dot(edges[i],randvec),elementLength))
  end
end
1 Like