Create set of of non-slice arrays referencing contiguous memory

I have a use case where I manipulate an AbstractVector{Vector{Int64}}, each of the latter of the same length. I then do many element-by-element computations between pairs of vectors in the set. There are N~100 vectors of length L~100 each.

Initially I used a concrete Vector to store these individual vectors. However, I thought it stood to reason that I will get some performance boost if all the vectors are contiguous in memory, so I attempted to instead generate my buffers as collect(eachcol(zeros(Int64, L, N)) which is a Vector{SubArray}. However, I actually saw a performance regression (about x3) when using this layout. I can only ascribe this to the extra indexing arithmetic used to access the slices each SubArray. Since my L,N are dynamic (I in fact iterate over different combinations of parameters), I find the notion of using StaticArrays unappealing.

Is there a way / what is the best practice to create a set of arrays of the concrete Vector types, that reference contiguous underlying memory? (Is this even possible, ifVector supports push! etc resizing?) Should this be done with the new Memory type introduced in Julia 11? (Parenthetically, I would love to see a blog post showing how it is used; it seems the dev docs are still very terse?)

You can do something like this, but there might be better ways, and there might be other reasons than index computations for your performance problem. And make sure that the v is not garbage collected before you’re done with vecs.

v = zeros(Int, 10, 20);
vecs = [unsafe_wrap(Array, pointer(v, i), size(v, 1)) 
        for i in range(start=1, step=size(v, 1), length=size(v, 2))]

result = GC.@preserve v docomputation(vecs)

1 Like

Could you share some code demonstrating the performance regression?

A major issue is likely alignment of the memory segments.

Consider the following two instructions. The first is the unaligned version and the second is the aligned version.

https://www.felixcloutier.com/x86/movupd

https://www.felixcloutier.com/x86/movapd

Julia aligns memory for you for vectors of a certain size. If you are taking arbitrary slices, then each column is likely no longer aligned.

What’s the alignment size on x86? I had a fleeting memory it was 32 bits (or am I confusing this with struct padding?) which then would not be an issue if my eltype is Int64.

See the documentation I linked.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit versions), 32-byte (256-bit version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection exception (#GP) will be generated. For EVEX encoded versions, the operand must be aligned to the size of the memory operand. To move double precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

Also, I can’t see any reason why contiguous vectors should improve performance. What matters is what fits in the cache. With the contiguous layout you may actually see performance degradation in parallel workloads due to false sharing. I.e. the end of one vector is in the same cache line (typically 64 bytes) as the beginning of the next. The architecture of the cache also matters.

2 Likes

Linear prefetching comes to mind. TLB misses come to mind (you should probably echo "always" > /sys/kernel/mm/transparent_hugepage/enabled for julia). DRAM organization comes to mind (the thing with physical banks). That being said, we’re talking about 80KB, so well within L2.

1 Like

@mkitti Below as a MWE. Also, I always thought that memory alignment issues are abstracted by the CPU (like cache misses); I did not realize there are different operations for aligned and unaligned operands. Out of curiosity, how does the compiler/LLVM know which machine code to generate? Is alignment some guarantee the compiler knows from type inference?

Here’s the example:

function f(arrs, n)
           N = length(arrs); L = length(arrs[1]);
           for _ in 1:n
               i = rand(1:L); j = rand(1:L);
               for arr in arrs
                    arr[i] += arr[j]
               end
           end
       end

arrs2 = collect(eachcol(rand(1:100, 40, 20))) #Vector{SubArray}
arrs1 = map(collect, arrs2); #Vector{Vector{Int64}}
all(map(== , arrs1, arrs2)) # true
@btime f($arrs1, $(1024))
#  22.100 μs (0 allocations: 0 bytes)
@btime f($arrs2, $(1024))
#  27.400 μs (0 allocations: 0 bytes)