Local arrays in for loops

Hello,
Is there a nice way to use vectors in arrays without spending too much time on allocations?
I have some parallel for loops and if I set 3 scalar variables to certain values, it seems to perform better than if I set one array with 3 elements (and it works in CUDA kernels). I don’t really know why the two are different at all.
It would make my code neater if I could use arrays. I understand StaticArrays are different from regular arrays for some reason. However, I would need to mutate them and have Vectors of StaticArrays.
Here’s an example of what I do in the for loop:

point1 = points[element[k][1]]; point2 = points[element[k][2]]; point3 = points[element[k][3]];
e1 = point2-point1; e2 = point3-point2; e3 = point1-point3;
E1 = norm(e1); E2 = norm(e2); E3 = norm(e3);

Note that each point is a StaticArray. I also do more complicated stuff and function calls.
I’d love to do:

vertices = points[panels[k]];
for v in 1:panelType
  vnext = v%numVertices+1;
  e[v] = vertices[vnext]-vertices[v];
  E[v] = norm(e[v]);
end

Thanks a lot!

Is there a reason you can’t use a multidimensional array?

@Mason you mean within my for loop have one variable per vertex per element? Because that would take up way too much memory and most of those variables are temporary.
Thanks!

You can do that with a vector of static arrays. In that code e[v] does not seem necessáry as an array (coukd be only a temporary vector). But that depends if you will use it later.

I don’t think your question is phrased very clearly and I’m having some trouble understanding. I am suggesting that instead of writing code like

v = [[1, 2, 3], [4, 5, 6]]

and accessing elements like

v[i][j]

you should instead write

m = [1 4
     2 5
     3 6]

and access elements as

m[i, j]

This will have the same (well, slightly less) memory requirements than an array of arrays, but will be more performant. If you’re using a vector of staticarrays, it’s usually better to just use GitHub - mateuszbaran/HybridArrays.jl: Arrays with both statically and dynamically sized axes in Julia

2 Likes

If you show us an actually runnable snippet of code that demonstrates a simplified version of your problem, we might be able to offer more specific advice.

5 Likes

I might be wrong, but maybe this note I have may be helpful: Immutable variables · JuliaNotes.jl

@Mason I started with a regular matrix and when I switched to a Vector of StaticArrays my code became 10x faster.
@leandromartinez98 Thanks for the link. I learned some stuff =].

I thought this was a common enough problem that I wouldn’t need to make a full example, but here’s one:

N=1000;
vertexList = Vector{StaticArrays.SVector{3,Float64}}(undef,N);
elements = Vector{StaticArrays.SVector{3,Int}}(undef,N);
for i in 1:1000
    vertexList[i] = SA_F64[rand(),rand(),rand()];
    elements[i] = SA[rand(1:N),rand(1:N),rand(1:N)];
end
elementLength = length(elements[1]);
v = collect(1:elementLength);
vNext = v.%elementLength .+ 1;

function temp1(vertexList,elements,N)
    for k in 1:N
        vertices = vertexList[elements[k]];
        edges = vertices[vNext]-vertices[v];
        edgesNorm = norm.(edges);
    end
end
function temp2(vertexList,elements,N)
    for k in 1:N
        vertex1= vertexList[elements[k][1]]; vertex2= vertexList[elements[k][2]]; vertex3= vertexList[elements[k][3]];
        edge1 = vertex2-vertex1; edge2 = vertex3-vertex2; edge3 = vertex1-vertex3;
        edgeNorm1 = norm(edge1); edgeNorm2 = norm(edge2); edgeNorm3 = norm(edge3);
    end
end
@btime temp1(vertexList,elements,N); # 1.054 ms (8000 allocations: 765.63 KiB)
@btime temp2(vertexList,elements,N); # 991.583 ns (0 allocations: 0 bytes)

So temp2 is waaaay better, but way less elegant. It also locks me in a certain elementLength.
Any tips?
Thanks a lot!

You mean temp2 is better right?

Note that N=length(vertexList), you don’t need to pass it as a parameter.

In temp1 you should passa v and vNext as parameters, otherwise they are global variables.

I think you can do something on the line you want. But then I have to get to the computer to try. :slightly_smiling_face:

Have you considered something like temp2 but with an additional loop on elementLength?

Sorry, I just edited that mistake.
I wrote the sample code real quick, so some stuff is odd.
But the changes you mentioned don’t make a difference.
Thanks!

I’m messing around a bit and this helps a lot:

edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];

Because before edges was an array of StaticArrays. Now it’s a StaticArray of StaticArrays. It still allocates, but now the results are more comparable:

@btime temp1(vertexList,elements,N); # 1.060 μs (2 allocations: 224 bytes)
@btime temp2(vertexList,elements,N); # 991.583 ns (0 allocations: 0 bytes)

Any way I can get that down to zero?

Probably this is allocating something. Can you make it a static vector as well?

Just a curiosity: I have never seen the % notation before. Is that from some package? Where did you find it?

Small things: while you may not see a difference in one specific example, it is strongly advised to avoid global variables. Thus, also compute the lengths of the vectors inside the function (elementLength, for example). That avoids having to pass them as parameters while guaranteeing that the function remains generic.

Also @btime requires interpolating the input variable ($). Otherwise you may get wrong results.

1 Like

No, that is a StaticArray.
% is the integer division remainder, as in most other languages. That v%len+1 thing is just a way to loop over the vertices so that v is 1,2,3 and v%len+1 is 2,3,1.
I do set the length inside my functions. Again, I wrote that example real quick.
Can you point me to some material explaining the usage of $? I’ve seen it in some code and never figured out what it does. The documentation isn’t clear enough for me.
Thanks!

1 Like

Yes, of course I have seen it :joy: It got me confused because of the context and because in Fortran that is used to refer to fields of custom types.

Not really. It has something to do with how variables are parsed in macros. But I could not find a clear explanation (for me) either. But one needs to add the $ to get proper results. Otherwise there might be some benchmark associated to parsing the variable in the macro which should not be counted.

Neither of your functions, temp1 or temp2, return anything, and they don’t appear to have any side effects, so in theory the compiler can decide to not do any work at all. You should make sure that each function actually returns something.

Also, as far as I can tell, each iteration is independent from the previous, so the compiler could also decide to just run the last iteration of the loop, since the previous iterations have no observable effect.

1 Like
N=1000;
vertexList = Vector{StaticArrays.SVector{3,Float64}}(undef,N);
elements = Vector{StaticArrays.SVector{3,Int}}(undef,N);
for i in 1:1000
    vertexList[i] = SA_F64[rand(),rand(),rand()];
    elements[i] = SA[rand(1:N),rand(1:N),rand(1:N)];
end

function temp1(vertexList,elements)
    elementLength = length(elements[1]);
    v = collect(1:elementLength);
    vNext = v.%elementLength .+ 1;
    acc = SA_F64[0,0,0];
    for k in 1:size(elements,1)
        vertices = vertexList[elements[k]];
        edges = @SVector [vertices[v%elementLength+1]-vertices[v] for v in 1:elementLength];
        edgesNorm = norm.(edges);
        acc += edgesNorm;
    end
    return acc;
end
function temp2(vertexList,elements)
    acc1 = 0.0;
    acc2 = 0.0;
    acc3 = 0.0;
    for k in 1:size(elements,1)
        vertex1= vertexList[elements[k][1]]; vertex2= vertexList[elements[k][2]]; vertex3= vertexList[elements[k][3]];
        edge1 = vertex2-vertex1; edge2 = vertex3-vertex2; edge3 = vertex1-vertex3;
        edgeNorm1 = norm(edge1); edgeNorm2 = norm(edge2); edgeNorm3 = norm(edge3);
        acc1 += edgeNorm1; acc2 += edgeNorm2; acc3 += edgeNorm3;
    end
    return acc1, acc2, acc3;
end
@btime temp1($vertexList,$elements); # 5.050 μs (2 allocations: 224 bytes)
@btime temp2($vertexList,$elements); # 4.886 μs (0 allocations: 0 bytes)

Maybe you can update the example. It’s not clear if you have fixed the global variables issue or not.

Edit: Aha! Simulpost :smiley:

You will never get down to zero allocations, since v and vNext are allocated. If you remove the collect, it will go down to one allocation.

(BTW you should never collect ranges, just leave them alone.)

1 Like

Oh, that was dumb. I don’t need v and vNext. I kept them in due to the previous version.
Yeah, removing those gets rid of the 2 allocations. Thanks!
Ok, using StaticArrays basically got rid of most of my problems. I got some other points in my real code where there’s some questionable stuff happening, but I can’t reproduce it too easily in a cutdown, for some reason.
Thanks a lot for the help, folks!

I did it due to the vNext vector. (1:3).%3.+1 doesn’t work.

jl>  (1:3) .% 3 .+ 1
3-element Vector{Int64}:
 2
 3
 1

The reason it didn’t work was parsing ambiguity of 3.+1. It could mean 3 .+ 1 or 3. + 1. The error message explains it: ERROR: syntax: invalid syntax "3.+"; add space(s) to clarify

(Never collect ranges :wink:)

5 Likes