SIMD.jl - vload() continuous blocks from higher-dimensional arrays?

jl_enthusiast · August 30, 2025, 11:29pm

Hi,

I am currently doing some practice projects in Julia, which involves using SIMD.jl for writing/generating microkernels. My issue is trying to determine what is the “best” way of loading Vec types from a higher-dimensional (currently 2D) array, where the loads are continuous (no need for gather ops).

In 1D, the following works:

using SIMD

arr = Vector{Float64}(undef, 100)

xs = vload(Vec{4,Float64}, arr, i)

In 2D however, linear indexing fails (lets say I want to load the elements of x[1:4, 1], i.e. those of linear index 1:4:

x = zeros(100, 100)

xs = vload(Vec{4,Float64}, x, 1)

(MethodError)

A workaround is using raw pointers:

x = zeros(100, 100)
xs = vload(Vec{4,Float64}, pointer(x, 1))

Which does work, but relies on raw pointers, and thus makes debugging/boundscheck a pain (and the code starts to look an awfully C-like…).

My question is, is there a better/safer/more idiomatic way of achieving this, or should I instead rewrite the kernels to operate on explicitly 1-D arrays (thus losing some abstraction)?

If yes, are there any examples of code where SIMD.jl is used in conjunction with explicitly higher-dimensional, but continuous loads?

Thank you for your help in advance.

DNF · August 30, 2025, 11:38pm

I cannot test now, but can’t you just work on vec(x), that turns x into a vector while reusing the underlying data?

BTW, I have only ever used the indexing interface in SIMD.jl, and find that so much nicer-looking. But almost everyone seems to be using vload/vstore. What is the advantage? It just looks needlessly more verbose to me.

jl_enthusiast · August 30, 2025, 11:54pm

Just checked, vec() works with loading from 2D arrays, however, it fails if its a view of an underlying 2D array (which I will likely also need down the line, however it makes sense that it fails since the view is not guaranteed to be continuous…).

using SIMD

x = zeros(128, 128)
u = @views x[:, 1:10] #continious-in-memory view


xu = vload(Vec{4,Float64}, pointer(x, 1)) #works
xy = vload(Vec{4,Float64}, vec(x), 1) #works

xp = vload(Vec{4,Float64}, pointer(u, 1)) #works
xv = vload(Vec{4,Float64}, vec(u), 1) #fails!

As of why everyone is using vload/vestore, no idea, I am only just beginning to properly dig into low-level SIMD control.

DNF · August 31, 2025, 2:13pm

I don’t know about all this stuff with vload, but I just use the indexing interface, avoiding pointers, like this:

julia> x = collect(reshape(1:24, 6, :));

julia> u = @views x[:, 1:3]
6×3 view(::Matrix{Int64}, :, 1:3) with eltype Int64:
 1   7  13
 2   8  14
 3   9  15
 4  10  16
 5  11  17
 6  12  18

julia> lane = VecRange{4}(1);

julia> u[lane]
<4 x Int64>[1, 2, 3, 4]

julia> u[lane+1]
<4 x Int64>[2, 3, 4, 5]

julia> u[lane+3]
<4 x Int64>[4, 5, 6, 7]

julia> u[lane+1, 2]  # also use it with 2D indexing
<4 x Int64>[8, 9, 10, 11]

Edit: made a more instructive example.

jl_enthusiast · August 31, 2025, 9:45pm

I believe this works for my current use case! (generated assembly looks good for my current test cases), thank you!

Topic		Replies	Views
Help defining (masked) vload and vstore operations for SArrays (or other isbits structs) using llvmcall Internals & Design llvm	9	1427	November 9, 2018
How to make the most of SIMD.jl when number of data elements is not divisible by SIMD width Performance simd	5	230	May 22, 2025
A simple SIMD.jl loop that is slower than a vanilla `@inbounds @simd` Performance simd	8	1984	June 27, 2021
@simd with array views General Usage performance , simd	5	1276	January 4, 2019
Static view General Usage question	5	1202	January 20, 2019

SIMD.jl - vload() continuous blocks from higher-dimensional arrays?

Related topics