Create set of of non-slice arrays referencing contiguous memory

bremez · March 20, 2024, 10:27pm

I have a use case where I manipulate an AbstractVector{Vector{Int64}}, each of the latter of the same length. I then do many element-by-element computations between pairs of vectors in the set. There are N~100 vectors of length L~100 each.

Initially I used a concrete Vector to store these individual vectors. However, I thought it stood to reason that I will get some performance boost if all the vectors are contiguous in memory, so I attempted to instead generate my buffers as collect(eachcol(zeros(Int64, L, N)) which is a Vector{SubArray}. However, I actually saw a performance regression (about x3) when using this layout. I can only ascribe this to the extra indexing arithmetic used to access the slices each SubArray. Since my L,N are dynamic (I in fact iterate over different combinations of parameters), I find the notion of using StaticArrays unappealing.

Is there a way / what is the best practice to create a set of arrays of the concrete Vector types, that reference contiguous underlying memory? (Is this even possible, ifVector supports push! etc resizing?) Should this be done with the new Memory type introduced in Julia 11? (Parenthetically, I would love to see a blog post showing how it is used; it seems the dev docs are still very terse?)

sgaure · March 20, 2024, 10:44pm

You can do something like this, but there might be better ways, and there might be other reasons than index computations for your performance problem. And make sure that the v is not garbage collected before you’re done with vecs.

v = zeros(Int, 10, 20);
vecs = [unsafe_wrap(Array, pointer(v, i), size(v, 1)) 
        for i in range(start=1, step=size(v, 1), length=size(v, 2))]

result = GC.@preserve v docomputation(vecs)

mkitti · March 20, 2024, 11:23pm

Could you share some code demonstrating the performance regression?

A major issue is likely alignment of the memory segments.

Consider the following two instructions. The first is the unaligned version and the second is the aligned version.

https://www.felixcloutier.com/x86/movupd

https://www.felixcloutier.com/x86/movapd

Julia aligns memory for you for vectors of a certain size. If you are taking arbitrary slices, then each column is likely no longer aligned.

bremez · March 20, 2024, 11:30pm

What’s the alignment size on x86? I had a fleeting memory it was 32 bits (or am I confusing this with struct padding?) which then would not be an issue if my eltype is Int64.

mkitti · March 20, 2024, 11:41pm

See the documentation I linked.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the memory operand. To move double precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

sgaure · March 21, 2024, 7:42am

Also, I can’t see any reason why contiguous vectors should improve performance. What matters is what fits in the cache. With the contiguous layout you may actually see performance degradation in parallel workloads due to false sharing. I.e. the end of one vector is in the same cache line (typically 64 bytes) as the beginning of the next. The architecture of the cache also matters.

foobar_lv2 · March 21, 2024, 10:05am

Linear prefetching comes to mind. TLB misses come to mind (you should probably echo "always" > /sys/kernel/mm/transparent_hugepage/enabled for julia). DRAM organization comes to mind (the thing with physical banks). That being said, we’re talking about 80KB, so well within L2.

bremez · March 21, 2024, 3:05pm

@mkitti Below as a MWE. Also, I always thought that memory alignment issues are abstracted by the CPU (like cache misses); I did not realize there are different operations for aligned and unaligned operands. Out of curiosity, how does the compiler/LLVM know which machine code to generate? Is alignment some guarantee the compiler knows from type inference?

Here’s the example:

function f(arrs, n)
           N = length(arrs); L = length(arrs[1]);
           for _ in 1:n
               i = rand(1:L); j = rand(1:L);
               for arr in arrs
                    arr[i] += arr[j]
               end
           end
       end

arrs2 = collect(eachcol(rand(1:100, 40, 20))) #Vector{SubArray}
arrs1 = map(collect, arrs2); #Vector{Vector{Int64}}
all(map(== , arrs1, arrs2)) # true
@btime f($arrs1, $(1024))
#  22.100 μs (0 allocations: 0 bytes)
@btime f($arrs2, $(1024))
#  27.400 μs (0 allocations: 0 bytes)

Topic		Replies	Views
Best way to handle a list of arrays (Asking for recommendations) General Usage	3	475	December 21, 2018
Strange performance of a loop Performance	37	3327	July 21, 2018
Aligned Julia Arrays? Performance	37	3212	May 21, 2018
Resize!(matrix) General Usage question	44	11219	April 22, 2018
Quick Longitudinal Access to arrays General Usage	34	2434	March 3, 2018

Create set of of non-slice arrays referencing contiguous memory

Related topics