Performant creation of vector of SVectors given a known formula

I want to create a vector of SVectors. I have to create all SVectors on the spot, given a formula that I know, and given also the dimensionality I know. The dimensionality can also be type-parameter, there is no problem with that (I already have a type that has it as a type parameter).

The problem is, the formula is accessing the elements of an array one by one, so I cannot reduce it to basic julia functions where SVectors know how to do their thing.

In short, this is what I am doing now:

function reconstruct(s::AbstractVector, D::Int,  Ď„::Int)
    L = length(s) - (D-1)*Ď„;
    T = eltype(s)
    data = Vector{SVector{D, T}}(L)
    for i in 1:L
        gen = (s[i+k*Ď„] for k in 0:(D-1))
        data[i] = SVector{D,T}(gen...)
    end
    V = typeof(s)
    T = eltype(s)
    return data
end

(once again: both Ď„ and D can be made type-parameters if need be)

I am quite sure that using a generator is not the most performant thing to do. I got a hunch that the most performant version would involve meta-programming and @generated? But unfortunately I do not have any metaprogramming skills whatsoever currently so I am asking for assistance!

If you get used to metaprogramming its actually not that hard :slight_smile:

This would be the generated function (where you have to make D a type parameter)

function reconstruct_impl(::Type{Val{D}}) where D
    gens = [:(s[i + $k*Ď„]) for k=0:D-1]

    quote
        L = length(s) - ($(D-1))*Ď„;
        T = eltype(s)
        data = Vector{SVector{$D, T}}(L)
        for i in 1:L
            data[i] = SVector{$D,T}($(gens...))
        end
        V = typeof(s)
        T = eltype(s)
        data
    end
end

@generated function reconstruct(s::AbstractVector, ::Val{D},  Ď„::Int) where D
    reconstruct_impl(Val{D})
end

You can check the generated code with reconstruct_impl(::Val{4}) and to call reconstruct you would have to do reconstruct(s, Val{4}(), Ď„).

I’d just add that using the Size trait for propagating sizes of static arrays is a good way to go.

We’d also like it easier to map over a “static” unit range, use generators & comprehensions, and so-on. ntuple(f, Val{D}) is often useful, also.

julia> function reconstruct2(s::AbstractVector{T}, V::Type{SVector{D, T}},  ?::Int) where {T, D}
           L = length(s) - (D-1)*? 
           v = zeros(V, L)
           for i in 1:L
               let i = i
                   v[i] = V(ntuple(k -> s[i+(k-1)*?], Val(D)))
               end
           end
           return v
       end

julia> a = rand(10^6);

julia> @btime reconstruct2(a, SVector{3, Float64}, 3);
  9.096 ms (2 allocations: 22.89 MiB)

seems pretty fast. Should be no need for a generated function here.

Thanks all for the kind replies!

@kristoffer.carlsson the first suggestion by Sascha benchmarks at 10.983 ms and yours at 15.318 ms in my machine (just for reference purposes).

Since you were all so kind to reply, maybe you will also reply to the following 3 questions as well:

  1. @kristoffer.carlsson why is this very mistical let i = i necessary?
  2. @saschatimme Why does this work without having to make Ď„ a type parameter as well?
  3. Is it possible to make this function a constructor for a type Reconstruction{T, D, τ} instead? I’ve been trying to make it happen but it seems the only way is to pass a type ::Reconstruction{T, D, τ} as an argument to the generated function itself.

Metaprogramming scares me but I really have to learn it to take full advantage of Julia!

1 Like

Because I don’t use the information during the code generation. You could also make τ a type parameter and then precompute the products 1τ, 2τ etc. But keep in mind that there will be a new function compiled for each type combination, so if the τ changes often this may be not a good idea.

Is it possible to make this function a constructor for a type Reconstruction{T, D, τ} instead? I’ve been trying to make it happen but it seems the only way is to pass a type ::Reconstruction{T, D, τ} as an argument to the generated function itself.

Can you not just define the constructor and then pass the type information to the generated function?

Yes, thank you. Doing something as simple as:


Reconstruction(s::AbstractVector{T}, D, Ď„) where {T} =
Reconstruction{T, D, Ď„}(reconstruct2(s, Val{D}(), Ď„))

worked fine!

A very simple solution that does not require you to play around too much with the type system and allows you to reason much better about data layout/performance:

function reconstruct_pointy(s::AbstractVector, D::Int,  tau::Int)
    L = length(s) - (D-1)*tau;
    T = eltype(s)
    assert(isbits(T))
    data = Vector{SVector{D, T}}(L)
    asmat = unsafe_wrap(Array, reinterpret(Ptr{T},pointer(data)), (D,L), false )
    for i in 1:L
        @simd for k in 0:(D-1)
            @inbounds  asmat[k+1,i] = s[i+k*tau]
        end
    end
    return data
end

Edit: This works because Vector{SVector} and Matrix are just syntactic sugar around the same bits.

If we have to write code like this, that would imo be a big failure.

6 Likes

As far as I know, there are various proposals underway to make mutation of immutables that are accessed by-reference easier and give it well-defined multi-threaded semantics.

This is not here yet, but ultimately the correct code should look like:

function reconstruct_future(s::AbstractVector, D::Int,  tau::Int)
    L = length(s) - (D-1)*tau;
    T = eltype(s)
    assert(isbits(T))
    data = Vector{SVector{D, T}}(L)
    for i in 1:L
        @simd for k in 1:D
            @inbounds  data[i][k] = s[i+(k-1)*tau]
        end
    end
    return data
end

But yes, this is currently a limitation (especially: Suppose you want to update data[i][k] without writing to data[i][k+1] which is possibly used by a different thread; currently this is undefined, and the compiler mostly just writes the fields you want to write, except when it doesn’t).

data[i] is immutable so I’m not sure what you mean with “updating” it. If you have multiple the location of data[i] that sounds like a race condition.

Matrix and Vector{SVector} are synonymous. You can independently update data[i][k]===asmat[k,i] for different k, same i, in different threads without any races; syntax that makes this hard without pointer-games is a shortcoming of julia, not the machine.

Today’s syntax requires us to “overwrite” the entirety of data[i] with the copied values. In 99% of the cases, the compiler removes the spurious reads and writes, such that this update only affects the single memory address of data[i][k]. If both threads try to update data[i][k1] and data[i][k2] at the same time, and the compiler was clever, there is no race condition. If the compiler was unclever, we get a race condition (e.g. thread one reads data[i][k1] and then stores it back, overwriting the update by thread two).

Yuyichao explained the current state in there https://discourse.julialang.org/t/on-modifying-immutables-by-reference-in-multithreaded-code/6580 and linked to the PR https://github.com/JuliaLang/julia/pull/21912. I really hope for Keno’s fix to be accepted at some point, in some variant.

Reinterpreting a matrix is a one-liner; dealing with fieldoffsets in Vector{some_struct} is much, much worse syntax.

I’m not so sure if the following completely satisfies your goal, but it performs 115X faster than your original implementation.

using BenchmarkTools
using StaticArrays

function reconstruct_insane(s::AbstractVector, D::Int,  Ď„::Int)
    L = length(s) - (D-1)*Ď„;
    T = eltype(s)
    data = Array{T}(D,L)
    for i = 0:D-1
      data[i+1,:] = s[1+i*Ď„:L+i*Ď„]' 
    end
    out = reinterpret(SVector{D,T}, data, (L,1))
end

const a = rand(10^6)
@btime reconstruct(a, 3, 4);
@btime reconstruct_insane(a, 3, 4);

  1.913 s (27997761 allocations: 1.26 GiB)
  16.554 ms (16 allocations: 45.78 MiB)

reinterpret gives a ReinterpretedArray on 0.7.

This gets 2x faster in my PC

function reconstruct(a, D, Ď„)
    T = eltype(a)
    n = length(a)-Ď„*(D-1)    
    out = Vector{T}(n*D)
    for i in 1:n, j in 1:D
        out[D*(i-1)+j] = a[i+(j-1)*Ď„]
    end
    
    return reinterpret(SVector{D, T}, out, (n,))
end

I second your question here. @kristoffer.carlsson, what’s the purpose of the let i = i?

Workaround for performance of captured variables in closures · Issue #15276 · JuliaLang/julia · GitHub

Regarding the reinterpretation variants:

It is better to allocate as Vector{SVector}, reinterpret for modification and return the original. That way all the “dirty-ness” is contained in your function (maybe even put a @noinline in front if your reinterpretation uses extra-dirty pointer games-- for example, my version could probably segfault if inlined and the result is never used).