Help with reducing allocations when indexing into custom Array

Hi everyone,
I am quite new to the language, and was wondering if people could help me optimising my piece of code. The idea for this array type is to reduce the total Base.summarysize for a vector of vector as much as possible. I did this by creating a type that has a “pointer” to the actual vector, like StridedArrays, but more minimal. The size of the array in StridedArrays does not really change.

struct FMat{N} <: AbstractArray{Int,1}
  data::Vector{<:Integer}
  @inline function FMat(vals::Vector{Vector{T}}) where {T}
    N = size(vals, 1)
    flattened = reduce(vcat, vals)
    A = Array{T,1}(undef, N + 1 + size(flattened, 1))
    A[1] = 1
    A[N+2:end] = flattened
    A[2:N+1] = accumulate(+, size(i, 1) for i in vals) .+ 1
    new{N}(A)
  end

  @inline function FMat{N}(vals::Vector{<:Int}) where {N}
    new{N}(vals)
  end
end

Base.length(::FMat{N}) where {N} = N

Base.size(::FMat{N}) where {N} = (N,)

@inline function Base.getindex(A::FMat{N}, i::Int) where {N}
  return A.data[A.data[i]+N+1:A.data[i+1]+N]
end

@inline function Base.getindex(A::FMat{N}, u::UnitRange) where {N}
  i_start = u.start
  i_stop = u.stop
  @views indices = A.data[i_start:i_stop+1] .- A.data[i_start] .+ 1
  a = vcat(indices, A.data[N+A.data[i_start]+1:N+A.data[i_stop+1]])
  return FMat{length(u)}(a)
end

Base.setindex!(A::FMat{N}, ::Int, ::Int) where {N} = A

IndexStyle(::Type{<:FMat}) = IndexLinear()

The code is able to halve the total memory size of the vector of vectors, which I was quite happy with. But then I started to check the actual performance of the indexing, it turned out the indexing is ~5x slower.

running

elements = Vector{Vector{Int}}()
for i in 1:7
  nrows = rand(UInt16) + 1
  for _ in 1:nrows
    push!(elements, abs.(rand(Int16, i)))
  end
end
fm = FMat(elements)

julia> @btime fm[end]
  190.954 ns (9 allocations: 240 bytes)
julia> @btime elements[end]
  30.271 ns (1 allocation: 16 bytes)

if i instead change it to

julia> @btime @view fm[end]
  28.810 ns (1 allocation: 48 bytes)

This of course is a big improvement, but I don’t want to keep writing @view everytime I want to get the element.

julia> @btime fm[end-5:end]
  2.708 μs (29 allocations: 1.77 KiB)
julia> @btime elements[end-5:end]
  118.097 ns (6 allocations: 192 bytes)
julia> @btime @view fm[end-5:end]
  80.627 ns (3 allocations: 96 bytes)
julia> @btime @view elements[end-5:end]
  94.805 ns (5 allocations: 128 bytes)

which in fact is a lot better. Any help would be greatly appreciated!

I don’t have the time to look into the details of this right now, but the type signature of the data field is abstract. There is no way for the compiler to determine the memory layout of FMat based on the type tag itself. You should make the element type of data into a type parameter, like this:

struct FMat{N, T}
   data::Vector{T}
1 Like

The first thing you should do is improve the benchmarking. Observe

julia> @btime elements[end]
  38.625 ns (1 allocation: 16 bytes)

julia> @btime $elements[end]
  2.000 ns (0 allocations: 0 bytes)

The variable elements is a global, which can significantly mess up benchmarking, so when using BenchmarkTools, interpolate global variables with $.

Secondly, I suggest not using the length N as a type parameter, but instead use the element type of the inner vectors, so this, instead:

struct FMat2{T} <: AbstractVector{T}
    data::Vector{T}
    len::Int
end

Using a type parameter for length means your code will almost always be type unstable, for example when doing slices. The compiler cannot predict the output type, and that is bad.

Or perhaps you don’t need any type parameter at all, if you always want Int.

Here’s a simple implementation. Two notes: I personally never use @inline, the compiler decides better than I do, and I strongly advise you not to use u.start or u.stop for ranges, those are internal implementation details, and not part of the API. use first(u) and last(u).

struct FMat2{T} <: AbstractVector{T}
    data::Vector{T}
    len::Int
    FMat2{T}(vals::Vector{T}, n=1) where {T<:Integer} = new{T}(vals, n)
end
function FMat2(vals::Vector{Vector{T}}) where {T<:Integer}
    N = length(vals)
    M = sum(length, vals)
    A = Vector{T}(undef, N + 1 + M)
    A[1] = 1
    len = 0
    i, j = 1, N+1
    for vv in vals
        A[i+=1] = (len += length(vv)) + 1
    end
    for vv in vals
        for v in vv
            A[j+=1] = v
        end
    end
    return FMat2{T}(A, N)
end
Base.length(fm::FMat2) = fm.len
Base.size(fm::FMat2) = (length(fm),)

function Base.getindex(A::FMat2, i::Int)
    N = length(A)
    return A.data[A.data[i]+N+1:A.data[i+1]+N]
end

function Base.getindex(A::FMat2{T}, u::UnitRange) where {T}
    i_start = first(u)
    i_stop = last(u)
    N = length(A)
    indices = @views A.data[i_start:i_stop+1] .- A.data[i_start] .+ 1
    a = vcat(indices, @view A.data[N+A.data[i_start]+1:N+A.data[i_stop+1]])
    return FMat2{T}(a, length(u))
end

Base.setindex!(A::FMat2{N}, ::Int, ::Int) where {N} = A

IndexStyle(::Type{<:FMat2}) = IndexLinear()

Try benchmarking, but remember the $.

I have not put @views on getindex, I’m not sure what you are actually looking for, but if you do that, scalar getindex is almost as fast as it is directly on elements.

Have you considered creating two vectors? One for the indices, and one for the data?

1 Like

Wow, this was incredibly helpful. Thank you very much. I did notice that your way of constructing the FMat is exactly the same speed, and amount and size of allocations as mine, which is nice to see. The method is now quite close to the elements[index]:

julia> @btime last($fm)
  3.443 ns (0 allocations: 0 bytes)
julia> @btime last($elements)
  2.218 ns(0 allocations: 0 bytes)

Still the time it take is ~1.5x longer, but that is already way better. Especially the parametric type that transfers the type information made a big difference.

I will check if splitting vector into a separate vector for the indices and one for elements makes a big difference in performance.

Yeah, this was the initial idea.
So the FMat will be used to represent elements in a finite element mesh. In the end I will create one for every dimension, so 0D, 1D, 2D, and 3D. All of which eventually will be stored in a struct, or vector. However, I recently watched a video on data-oriented design, which indicated that a vector of structs could increase the size of the data by a lot because of padding and alignment. I didn’t know whether this was an issue here, so I ended up choosing a single vector. I am going to make the other structure now, and compare the two.

Oh, on my computer my constructor is approximately twice as fast, and with half as much allocation.

Depends on the data types, and which format is better depends on how you are accessing it. In any case, you could use StructArrays.jl to make the API look like an array of structs when internally it is stored as a struct of arrays.

There are quite a few finite-element libraries in Julia already; I would look at how they store their meshes and data even if you want to want to write your own.

Consider using StaticArrays and just having a Vector{SVector{…}}. (edit: bad idea, mea culpa)

That is the wrong response!

You now learned that there is something something alignment padding that affects vectors of structs.

Nice! Now you know that there is something that you could learn, and what keywords to google, and once you do, you are empowered to worry about the issue.

Since you have not learned the details yet, you cannot take any other action to mitigate alignment padding. Using half-knowledge to maybe avoid a maybe-issue that you don’t understand at all is super counter-productive!

(I personally think that you should learn the details. But reasonable opinions on that may differ – do you learn programming with assembly or with lisp / julia? Matter of taste. If you knew any details, you’d know that alignment is not an issue for your constructions)

I agree with you on this. However, I don’t have enough time to figure out all the details to completely understand it. So I thought it wasn’t a bad idea to choose the approach, which I knew has a small footprint, but is a little bit harder to implement.

This on the other hand I do not agree with. The README.md clearly states that for larger arrays a regular array results in better performance:

Note that in the current implementation, working with large StaticArrays puts a lot of stress on the compiler, and becomes slower than Base.Array as the size increases. A very rough rule of thumb is that you should consider using a normal Array for arrays larger than 100 elements.

Or do you mean that the FMat should instead use with a Vector{SVector{Int}}. Where the inner SVector{Int} contains the node numbers? If so, would that really improve performance? I am not doing linear algebra operations with the vectors, but instead using them as indices for the node matrix.

I think I misunderstood what you’re doing here. Sorry! Disregard my ill-considered advice on that.

Having read your code more, I think I would probably go with two backing arrays?

struct FMat{T}
ptr::Vector{Int}
payload::Vector{T}
end
Base.getindex(f::FMat, idx) = view(f.payload, f.ptr[i]:f.ptr[i+1])

One of the big weirdnesses of your design is that the length of your FMat is part of the type. Is that really appropriate?