Local arrays in for loops

I don’t know if you solved that some other way. But note that your function temp1 is using a global definition of elementLength that is probably defined somewhere there to do:

 edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];

If I am wrong, please someone correct me, but you cannot define a static vector with a variable length at run time like that. If you don’t have elmentLength defined somewhere in the global scope at parse time, it returns the error:

julia> temp1(vertexList,elements)
┌ Error: Failed to revise /home/leandro/Drive/Work/JuliaPlay/ribeiro.jl
│   exception =
│    LoadError: UndefVarError: elementLength not defined

Edit: Actually I could make it run without the global definition of elementLength by initializing the array as with the explicit type signature, as:

edges = SVector{elementLength,SVector{elementLength,Float64}}([vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength])

But the function then gets much slower and allocates much more.

It was tricky to get rid of all allocations while keeping the function general for any element size, but I think this did the trick:

function temp3(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        for i in 1:N
            edge = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edge)
        end
    end
    return SVector{N,T}(acc)
end

julia> @btime temp3($vertexList,$elements)
  4.223 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 663.956508776588
 656.6284760140031

Code
using LinearAlgebra
using StaticArrays

function temp3(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        for i in 1:N
            edge = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edge)
        end
    end
    return SVector{N,T}(acc)
end

using BenchmarkTools

N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ]
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ]

@btime temp3($vertexList,$elements)


I figured that out (after a lot of messing around) right when I got your message! I had declared it globally in tests and didn’t notice it didn’t work locally.
So it works if you set it in the function title? I tried this:

function myfun!(inputs..., elements::Array{SArray{Tuple{elementLength},Int64,1,elementLength},1}, more inputs...) where {elementLength};
e = @SVector [vertices[v % elementLength+1]-vertices[v] for v in 1:elementLength];

And it doesn’t work. It needs to be an MArray to work?
Thanks a lot!

No, my understanding is that the @SVector macro is parsed at parse time. Thus a comprehension [ i for i in 1:N] that will be parsed by that macro has to be the N defined at the moment the expression is parsed.

The alternative is to not use the macro, as I mention above, and setting all the dimensions explicitly (even if parametrically), but then it became allocating there.

Passing as an argument does not work because the function does not get specialized to argument values. Thus the number has to be part of a type (you can do a trick with Val to pass a value as a type, maybe that works. But I think the other implementation is nicer)

Hm. Too bad. Guess I’ll have to set it globally. That’s not a deal breaker at the moment. Might make my life harder later, though.
Thanks a lot for the help!

Didi you see the temp3 version above that does not allocate anything? That one solves the problem, the trick is to parameterize the input arrays and use the sizes inside the function. It might be my eyes, but the result seemed pretty to me.

I would still need a vector of vectors:

edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];

Where elementLength = size(elements[1],1). I use these later for various purposes. You got around this by not storing the edges for later use. How could I keep the edges while avoiding a global variable?
Thanks a lot!

The best I could come by is this:

function temp1(vertexList::AbstractVector{SVector{N,T1}},
               elements::AbstractVector{SVector{N,T2}};
               edges = zero(MVector{N,SVector{N,T1}})) where {N,T1,T2}
    acc = zero(SVector{N,T1})
    @inbounds for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        for i in 1:N
           edges[i] = vertices[i%N+1]-vertices[i]
        end
        edgesNorm = norm.(edges)
        acc += edgesNorm
        # edges available to do more stuff
    end
    return acc
end

Here, edges is an optional parameter and the only possible allocation in the function. If you don’t pass it, you get:

julia> @btime temp1($vertexList,$elements)
  4.953 μs (1 allocation: 80 bytes)

If you preallocate it and pass it, you get:

julia> edges = zero(MVector{3,SVector{3,Float64}});

julia> @btime temp1($vertexList,$elements,edges=$edges)
  4.938 μs (0 allocations: 0 bytes)

Thanks for the help, but this would break loop parallelization. I can’t come up with a better solution than the global var either.
It will do for now.
Thanks again!

If you can, please post here the final solution you have. I could not avoid all allocations in the case of the global variables. I do not know what I am missing there.

I don’t really know what’s happening, but I can’t get it down to zero anymore. I can get it down to one by doing something really weird.

using StaticArrays
using LinearAlgebra
N=1000;
vertexList = Vector{StaticArrays.SVector{3,Float64}}(undef,N);
elements = Vector{StaticArrays.SVector{3,Int}}(undef,N);
for i in 1:1000
    vertexList[i] = SA_F64[rand(),rand(),rand()];
    elements[i] = SA[rand(1:N),rand(1:N),rand(1:N)];
end
elementLength = length(elements[1]);

function temp1(vertexList,elements)
    elementLength = length(elements[1]);
    acc = SA_F64[0,0,0];
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];
        edgesNorm = norm.(edges);
        acc += edgesNorm;
    end
    return acc;
end
@time temp1(vertexList,elements);

Note that it only works if elementLength is defined twice. Otherwise, I get 18k allocations.
Is someone is able to explain this?

Well the part of the two definitions of the length of the vector is because, as I mentioned, the @SVector macro is evaluated at parse time. The elementLength used to define the length of the vector has nothing to do with the elementLength inside the function. Note this:

julia> using StaticArrays

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
ERROR: LoadError: UndefVarError: N not defined
Stacktrace:
 [1] top-level scope at none:1
 [2] eval(::Module, ::Any) at ./boot.jl:331
 [3] @SVector(::LineNumberNode, ::Module, ::Any) at /home/leandro/.julia/packages/StaticArrays/l7lu2/src/SVector.jl:63
in expression starting at REPL[2]:1

julia> N = 3
3

julia> f(N) = (println(N), @SVector [0 for i in 1:N])
f (generic function with 1 method)

julia> f(1)
1
(nothing, [0, 0, 0])


Note that in the final call, which worked, the vector generated had still length 3. The function had already been defined when it was first parsed with N=3. That is why you need to define that length outside the function, but that breaks completely the generality of the function.

Besides that, my opinion is that the edges array you are willing to store and mutate is just not immutable enough to be stack allocated. If you want to have it mutable and flexible you will need to allocate it on the heap. But that may have some solution I am not seeing.

1 Like

I understood why I have to define it outside the function. What I meant to say in my previous post is that I also have to define it inside. That is the weird part. If I don’t, I get a lot of allocations.

Yes, that is weird.

Maybe THIS elementLength becomes type unstable and that causes the allocations. Not related anymore to the size of the static array.

Edit: yes, that is it:

julia> @code_warntype temp1(vertexList,elements)
Variables
  #self#::Core.Compiler.Const(temp1, false)
  vertexList::Array{SArray{Tuple{3},Float64,1,3},1}
  elements::Array{SArray{Tuple{3},Int64,1,3},1}
  acc::Any
  @_5::Union{Nothing, Tuple{Int64,Int64}}
  k::Int64
  #7::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
  vertices::SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}
  256::var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}
  edges::SArray{Tuple{3},_A,1,3} where _A
  edgesNorm::Any

Body::Any
1 ─       (acc = Base.getindex(Main.SA_F64, 0, 0, 0))
│   %2  = Main.size(elements, 1)::Int64
│   %3  = (1:%2)::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])
│         (@_5 = Base.iterate(%3))
│   %5  = (@_5 === nothing)::Bool
│   %6  = Base.not_int(%5)::Bool
└──       goto #4 if not %6
2 ┄ %8  = @_5::Tuple{Int64,Int64}::Tuple{Int64,Int64}
│         (k = Core.getfield(%8, 1))
│   %10 = Core.getfield(%8, 2)::Int64
│   %11 = Base.getindex(elements, k)::SArray{Tuple{3},Int64,1,3}
│         (vertices = Base.getindex(vertexList, %11))
│   %13 = Main.:(var"#7#8")::Core.Compiler.Const(var"#7#8", false)
│   %14 = Core.typeof(vertices)::Core.Compiler.Const(SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}, false)
│   %15 = Core.apply_type(%13, %14)::Core.Compiler.Const(var"#7#8"{SArray{Tuple{3},SArray{Tuple{3},Float64,1,3},1,3}}, false)
│         (#7 = %new(%15, vertices))
│         (256 = #7)
│   %18 = Core.apply_type(Main.SVector, 3)::Core.Compiler.Const(SArray{Tuple{3},T,1,3} where T, false)
│   %19 = (256)(1)::Any
│   %20 = (256)(2)::Any
│   %21 = (256)(3)::Any
│   %22 = Core.tuple(%19, %20, %21)::Tuple{Any,Any,Any}
│         (edges = (%18)(%22))
│   %24 = Base.broadcasted(Main.norm, edges)::Base.Broadcast.Broadcasted{StaticArrays.StaticArrayStyle{1},Nothing,typeof(norm),_
A} where _A<:Tuple
│         (edgesNorm = Base.materialize(%24))
│         (acc = acc + edgesNorm)
│         (@_5 = Base.iterate(%3, %10))
│   %28 = (@_5 === nothing)::Bool
│   %29 = Base.not_int(%28)::Bool
└──       goto #4 if not %29
3 ─       goto #2
4 ┄       return acc


The problem of course goes away if you set elementLength as const:

const elementLength = length(elements[1]);

I was finally able to get rid of all allocations!


function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        edges = fill(zero(SVector{N,T}),MVector{N,SVector{N,T}})
        for i in 1:N
            edges[i] = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edges[i])
        end
    end
    return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4($vertexList,$elements)
  4.237 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 670.809218674305
 672.9989019068689
 678.1027515172907

What did the trick was to use fill to create the the edges mutable array. For some reason zero allocated the array, but fill does not.

It was a good exercise. I apologize for perhaps taking too much of your time and perhaps discourage other people to answer your questions.

If I may leave a suggestion to get better advice from every one, always post your complete runnable code, with the output you are seeing, so people can readily copy and paste and reproduce the results, at each step. That helps a lot everyone trying to help.

edit:

Actually using zero also works. I don’t know why I didn’t get the solution before. I thought I had tried. Probably I got something wrong somewhere else:

function temp4(vertexList::AbstractVector{SVector{N,T}},elements) where {N,T}
    acc = zero(MVector{N,T})
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]]
        edges = zero(MVector{N,SVector{N,T}})
        for i in 1:N
            edges[i] = vertices[i%N+1] - vertices[i]
            acc[i] += norm(edges[i])
        end
        # edges available to do something else
    end
    return SVector{N,T}(acc)
end

using BenchmarkTools
N=1000
vertexList = [ rand(SVector{3,Float64}) for i in 1:N ];
elements = [ SVector(rand(1:N),rand(1:N),rand(1:N)) for i in 1:N ];

julia> @btime temp4($vertexList,$elements)
  4.231 μs (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 678.0587466021068
 685.3610238961863
 673.6792450131684

2 Likes

Thanks a lot for your time and definitely don’t apologize after helping out so much!
Your example works and I could modify it a bit to make it more like my real code and it still worked. However, my real code does not work. For whatever reason, this method led to a lot of allocations and incremental changes from @SVector [..., globalVar] to using zero(MVector...) seemed to give me numbers of allocations that I could not explain at all (small changes would make the number jump by a factor of 35, which I don’t understand).
I spent several hours on this and whatever I do to your example to make it more like my code still leads to zero allocations, while all sorts of changes to my real code still led to hundreds of thousands of allocations.
I need to stop chasing this rabbit for a while. The @SVector version was only 10% faster than the version with tons of allocations, while the latter doesn’t require a global var. So I guess I’ll stick to that one for now.
Thanks again and I will post code from the start next time (as I said before, I thought this was a common issue, maybe with an easy solution!).

2 Likes

Just backing up a little bit, I started with your temp1 from Local arrays in for loops - #8 by Ribeiro and made only two very small changes:

  1. Removed all global variables
  2. Made v an SVector

Here’s your temp1 and my temp1b:

v = SVector((1:elementLength)...)
vNext = v.%elementLength .+ 1;

function temp1(vertexList,elements,N)
    for k in 1:N
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

function temp1b(vertexList, elements, v, vNext)
    for k in 1:length(vertexList)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

@btime temp1($vertexList, $elements, $N)
@btime temp1b($vertexList, $elements, $v, $vNext)

and the difference is pretty striking:

julia> @btime temp1($vertexList, $elements, $N)
  278.985 μs (9000 allocations: 656.25 KiB)

julia> @btime temp1b($vertexList, $elements, $v, $vNext)
  2.401 μs (0 allocations: 0 bytes)

More than 100x faster, with no allocations, and no need to hard-code the number of elements.

The moral here is that the guidance against using global variables is incredibly important (that’s why it’s the first performance tip).

4 Likes

You see, @Ribeiro , that is why I was apologizing :smile:

At least there are some things in favour of what I have proposed. In the original problem your function was not returning anything, thus that was not really comparable, afterwards you added the computation of the acc (acceleration, I guess). Adding that to @rdeits temp1b gives:

function temp1b(vertexList, elements, v, vNext)
    acc = zero(SVector{3,Float64})
    for k in 1:length(vertexList)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
        acc += edgesNorm
    end
    acc
end
julia> @btime temp1b($vertexList, $elements, $v, $vNext)
  7.639 μs (0 allocations: 0 bytes)

Still no allocations, but about 2x slower than my temp4.

Also, if we include the allocation of the auxiliary arrays v and vNext, we get:

function temp1b_alloc(vertexList,elements)
  elementLength = length(elements[1])
  v = SVector((1:elementLength)...)
  vNext = v.%elementLength .+ 1;
  temp1b(vertexList, elements, v, vNext)
end
julia> @btime temp1b_alloc($vertexList, $elements)
  9.019 μs (11 allocations: 416 bytes)

Of course than it allocates and it is somewhat slower still.***

This kind of thing is probably associated to type instabilities. Sometimes they are not easy to track. Whenever you can, post more of your code here (or in other thread, to get s fresh start).

***As shown below, these allocations can be solved by setting v with an ntuple.

Ok, but then why does this:

function temp1c(vertexList,elements)
    elementLength=size(elements[1],1)
    v=SVector((1:elementLength)...)
    vNext = v .% elementLength .+ 1
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

perform 9011 allocations?

1 Like

SVector((1:elementLength)...) is not type-stable. Use ntuple(i->i, elementLength) instead because it can be inlined by the compiler.

function temp1c(vertexList, elements)
    elementLength=size(elements[1],1)
    v=SVector(ntuple(i->i, elementLength))
    vNext = v .% elementLength .+ 1
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end

I’d also annotate the type of the second argument to avoid relying on the array being non-empty.

2 Likes