From my understanding:
p_j .= pg[:,j]
first creates a new 3 element vector (with its malloc), which then in-place assigned gets to the existing p_j
p_j .= @view pg[:,j]
first creates a new view object (with its small malloc, containing size and ptr), which then in-place assigned gets to the existing p_j
For large matrixes, the view version is much more efficient, since it only contains the ptr and size members, compared to the non view version which copies all elements, unfortunately for this case the 3 elements and the view is both similar in size and because it is in the inner loop, it becomes a problem.
I was hoping with some trick like explicitly specifying the size
p_j[1:3] .= pg[1:3,j]
the compiler would be smart enough to realise it can unroll the loop and skip the interim allocation, similar to the tuple case. (Anyone know if there is some such trick?)
Thus I was forced to either manually write the assignment loop with a for (in which case I have sometimes struggled with the loop variable becoming a new assignment) or do it explicitly as I did, since you only worry about the first and last element.
The 75.5k allocations, still sounds like your test case had some assignment happening every time. Or was that during the file reading and not assignments in PackStep. When I tested, I only had the 3 assignments:
p_i = zeros(3)
p_j = zeros(3)
rij = zeros(3)
happening in each PackStep call. You can even get rid of them, by passing them in, but I would only really consider that since it makes using functions clumsy if the temp variables are big or if the function execution time is less than a 10ms or so. I still wish there was a way to tell Julia this function is only allowed to use stack and not dynamic memory allocations.
In terms of a minimum example, the code in the other thread helps, but doesn’t work due to needing to load a file. I replaced that with:
coords = randn(1000,2);
clamp!(coords, -1.0, 1.0)
but it means that if call the multiple times, you cannot compare final outputs exactly.
I still don’t understand why the vector approach should be so much slower, especially if @inbounds is specified. I thought on my side, I got them to within 10% (but now that I think about it, the vector approach used the conditional sqrt, while the tuple used sqrt every call, so maybe the vector version is significantly slower). But it probably makes some sense, for the tuple or SVector, the size is fixed and can be handled explicitly, while to the vector, vector operations need to cater for variable length. But still the code is now mostly scalar operations, so I didn’t expect vectors to be so bad (15 times?).
@inbounds should not break the simulation, unless there is a big problem in the code.
@fastmath might give different results due to different rounding if sequence of calculations are swapped. If that is a problem then I would be careful of trying to go to a GPU where the calculations are done by a different engine and where ArrayFire or OpenCL abstraction layers inbetween might cause different sequence of calculations.
@threads should ideally also not break the simulation, however I can think due to race conditions it is more likely to case weirdness if the code isn’t structured carefully.