[ANN] FixedSizeArrays.jl: What Array probably should have been

Is it possible you introduced some type instability while switching to FixedSizeArrays.jl? Perhaps in some branch somewhere you return something like T[], which is not a FixedSizeVector? That could perhaps explain the increase in allocated memory.

If not that, perhaps it’s the issue pointed out by @foobar_lv2, where object identity would sometimes be preferable to value identity for the deduplicating effect.

1 Like

was special-cased in the compiler to be (effectively?) immutable.

This isn’t quite true. IIUC the actual “size” field was known by the compiler not to change, but the backing was still mutable (preventing optimization). Also as of 1.11, the compiler dropped this optimization since Array is just a normal Julia object.

3 Likes
julia> using FixedSizeArrays

julia> @noinline f(A::AbstractMatrix) = length(A)
f (generic function with 1 method)

julia> g() = f(FixedSizeMatrixDefault{Float64}(undef, 3, 3))
g (generic function with 1 method)

julia> h() = f(Matrix{Float64}(undef, 3, 3))
h (generic function with 1 method)

julia> code_llvm(g)
; Function Signature: g()
;  @ REPL[92]:1 within `g`
define i64 @julia_g_9806() local_unnamed_addr #0 {
top:
;  @ REPL[92] within `g`
  ret i64 9
}
julia> code_llvm(h)
; Function Signature: h()
;  @ REPL[91]:1 within `h`
define i64 @julia_h_9808() local_unnamed_addr #0 {
top:
  %gcframe1 = alloca [3 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe1, i8 0, i64 24, i1 true)
  %pgcstack = call ptr inttoptr (i64 4377493260 to ptr)(i64 4377493296) #12
  store i64 4, ptr %gcframe1, align 8
  %task.gcstack = load ptr, ptr %pgcstack, align 8
  %frame.prev = getelementptr inbounds nuw i8, ptr %gcframe1, i64 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe1, ptr %pgcstack, align 8
; ┌ @ boot.jl:651 within `Array`
; │┌ @ boot.jl:604 within `new_as_memoryref`
; ││┌ @ boot.jl:588 within `GenericMemory`
     %ptls_field = getelementptr inbounds nuw i8, ptr %pgcstack, i64 16
     %ptls_load = load ptr, ptr %ptls_field, align 8
     %"Memory{Float64}[]" = call noalias nonnull align 8 dereferenceable(96) ptr @ijl_gc_small_alloc(ptr %ptls_load, i32 664, i32 96, i64 4761282016) #7
     %"Memory{Float64}[].tag_addr" = getelementptr inbounds i8, ptr %"Memory{Float64}[]", i64 -8
     store atomic i64 4761282016, ptr %"Memory{Float64}[].tag_addr" unordered, align 8
     %memory_ptr = getelementptr inbounds nuw i8, ptr %"Memory{Float64}[]", i64 8
     %memory_data = getelementptr inbounds nuw i8, ptr %"Memory{Float64}[]", i64 16
     store ptr %memory_data, ptr %memory_ptr, align 8
     store i64 9, ptr %"Memory{Float64}[]", align 8
     %gc_slot_addr_0 = getelementptr inbounds nuw i8, ptr %gcframe1, i64 16
     store ptr %"Memory{Float64}[]", ptr %gc_slot_addr_0, align 8
; │└└
   %ptls_load11 = load ptr, ptr %ptls_field, align 8
   %"new::Array" = call noalias nonnull align 8 dereferenceable(48) ptr @ijl_gc_small_alloc(ptr %ptls_load11, i32 520, i32 48, i64 4761248240) #7
   %"new::Array.tag_addr" = getelementptr inbounds i8, ptr %"new::Array", i64 -8
   store atomic i64 4761248240, ptr %"new::Array.tag_addr" unordered, align 8
   %0 = getelementptr inbounds nuw i8, ptr %"new::Array", i64 8
   store ptr %memory_data, ptr %"new::Array", align 8
   store ptr %"Memory{Float64}[]", ptr %0, align 8
   %"new::Array.size_ptr" = getelementptr inbounds nuw i8, ptr %"new::Array", i64 16
   call void @llvm.memcpy.p0.p0.i64(ptr noundef nonnull align 8 dereferenceable(16) %"new::Array.size_ptr", ptr noundef nonnull align 8 dereferenceable(16) @"_j_const#1", i64 16, i1 false)
   store ptr %"new::Array", ptr %gc_slot_addr_0, align 8
; └
  %1 = call i64 @j_f_9812(ptr nonnull %"new::Array")
  %frame.prev14 = load ptr, ptr %frame.prev, align 8
  store ptr %frame.prev14, ptr %pgcstack, align 8
  ret i64 %1
}

Hence why the title of the JuliaCon talk :slightly_smiling_face: My personal hope is that at some point we’ll be able to have a fixed-size array in Base because it’s so much useful in numerical applications, of course the question that this brings is how to handle two different array types in Base

5 Likes

Cthulhu tells me there was a type instability in my code unrelated to FArray. Once I fixed that both the number of allocations and the total amount of allocated memory go up with FArray w.r.t plain Array =(

2 Likes

I’ve added MutableSmallVector to my benchmarks above.

I think that FixedSizeArrays is a great package, and that the title of this thread sounds exactly right. However, looking at the benchmarks (which do not cover all sizes and element types!), I’m asking myself where, say, FixedSizeVector really shines in the current ecosystem. If you only need a mutable, indexable container, then it’s indeed a lightweight solution.

If you want to have algebraic operations, then for large vectors the difference to Vector seems negligible. For small vectors whose size is known in advance, MVector is faster (and so is MutableFixedVector). If the size unknown, but with a small upper bound, then MutableSmallVector appears to be the better choice. (MVector and Mutable(Fixed|Small)Vector need isbits elements, but that is usually the case.)

For which sizes and/or element types can FixedSizeVector play out its strengths? Or would it be for higher-dimensional arrays?

4 Likes

In my mind the main competitor of FixedSizeArray is only Base.Array, in the role of a general purpose container. Specialised containers like MArray are different beasts: they can be useful when you want to do something smart in very specific cases (e.g. dispatching on the size, or when you know you only deal with very specific, small sizes), but they may also come at the cost of longer compilation latency.

Performance difference with Base.Array is probably negligible for microbenchmarks, but the benefit is in enabling compiler optimisations which are just impossible with the base type: improving effect inference of a inner function (if for example it doesn’t throw errors anymore) can have positive cascade effects in a larger program as a whole.

I expect that it’d take some time for the ecosystem to digest a new container array type like FixedSizeArray. Also, FixedSizeArrays.jl is by no means perfect: we had found some small missed optimisations in Julia itself along the way like inbounds propagation mostly missing in base/genericmemory.jl? · Issue #56145 · JuliaLang/julia · GitHub, Memory stores aren't vectorised in a `for` loop unless explicit at-inbounds is used · Issue #70 · JuliaArrays/FixedSizeArrays.jl · GitHub, so other small issues like that are possible, but ideally fixing them would have a positive impact on the whole ecosystem (it’s usually a matter of making Memory work efficiently).

9 Likes

How does this compare with SizedArrays from StaticArrays.jl? IIRC SizedArrays are also backed by normal memory and the only notable difference I can see is that the size information is stored as value instead of type parameter.

Besides the size being a type parameter versus an instance value, SizedArray wraps any AbstractArray, while FixedSizeArray is a DenseArray that wraps a DenseVector (default to Memory). For example, SizedArray can wrap a StepRange, but a FixedSizeArray can’t (but the constructor may copy the elements of an input AbstractArray into a Memory to wrap). Those two differences alone make the use cases fairly different, and there isn’t a straightforward answer to which is faster. The static size of SizedArray may leverage some optimized StaticArrays code, but it could wrap an AbstractArray where getindex takes 2 minutes each call.

4 Likes

I think the issue that this package is addressing is fundamentally the following:

julia> foo()=Base.llvmcall("""call void asm ";", "~{memory}"()
       ret void""", Tuple{} , Nothing)

julia> function bar(x)
       @inbounds a=x[1]
       foo()
       @inbounds b=x[1]
       a+b
       end

julia> function bar2(x)
       a=x.x
       foo()
       b=x.x
       a+b
       end
julia> mutable struct XM x::Int end

julia> @code_llvm bar2(XM(1))
; Function Signature: bar2(Main.XM)
;  @ REPL[52]:1 within `bar2`
define i64 @julia_bar2_5182(ptr noundef nonnull align 8 dereferenceable(8) %"x::XM") local_unnamed_addr #0 {
top:
;  @ REPL[52]:2 within `bar2`
; ┌ @ Base_compiler.jl:54 within `getproperty`
   %"x::XM.x" = load i64, ptr %"x::XM", align 8
; └
;  @ REPL[52]:3 within `bar2`
; ┌ @ REPL[50]:1 within `foo`
   call void asm ";", "~{memory}"() #4
; └
;  @ REPL[52]:4 within `bar2`
; ┌ @ Base_compiler.jl:54 within `getproperty`
   %"x::XM.x1" = load i64, ptr %"x::XM", align 8
; └
;  @ REPL[52]:5 within `bar2`
; ┌ @ int.jl:87 within `+`
   %0 = add i64 %"x::XM.x1", %"x::XM.x"
   ret i64 %0

julia> mutable struct XI const x::Int end

julia> @code_llvm bar2(XI(1))
; Function Signature: bar2(Main.XI)
;  @ REPL[52]:1 within `bar2`
define i64 @julia_bar2_5188(ptr noundef nonnull align 8 dereferenceable(8) %"x::XI") local_unnamed_addr #0 {
top:
;  @ REPL[52]:3 within `bar2`
; ┌ @ REPL[50]:1 within `foo`
   call void asm ";", "~{memory}"() #4
; └
;  @ REPL[52]:5 within `bar2`
; ┌ @ int.jl:87 within `+`
   %.unbox = load i64, ptr %"x::XI", align 8
   %0 = shl i64 %.unbox, 1
   ret i64 %0
; └
}

We see that the const allows us to avoid reloading. Now,

julia> @code_llvm bar([1])
; Function Signature: bar(Array{Int64, 1})
;  @ REPL[51]:1 within `bar`
define i64 @julia_bar_5191(ptr noundef nonnull align 8 dereferenceable(24) %"x::Array") local_unnamed_addr #0 {
top:
;  @ REPL[51]:2 within `bar`
; ┌ @ essentials.jl:953 within `getindex`
   %memoryref_data = load ptr, ptr %"x::Array", align 8
   %0 = load i64, ptr %memoryref_data, align 8
; └
;  @ REPL[51]:3 within `bar`
; ┌ @ REPL[50]:1 within `foo`
   call void asm ";", "~{memory}"() #5
; └
;  @ REPL[51]:4 within `bar`
; ┌ @ essentials.jl:953 within `getindex`
   %memoryref_data6 = load ptr, ptr %"x::Array", align 8
   %1 = load i64, ptr %memoryref_data6, align 8
; └
;  @ REPL[51]:5 within `bar`
; ┌ @ int.jl:87 within `+`
   %2 = add i64 %1, %0
   ret i64 %2
; └
}
julia> @code_llvm bar(zeros(Int, 1, 1))
; Function Signature: bar(Array{Int64, 2})
;  @ REPL[51]:1 within `bar`
define i64 @julia_bar_5195(ptr noundef nonnull align 8 dereferenceable(32) %"x::Array") local_unnamed_addr #0 {
top:
;  @ REPL[51]:2 within `bar`
; ┌ @ essentials.jl:953 within `getindex`
   %memoryref_data = load ptr, ptr %"x::Array", align 8
   %0 = load i64, ptr %memoryref_data, align 8
; └
;  @ REPL[51]:3 within `bar`
; ┌ @ REPL[50]:1 within `foo`
   call void asm ";", "~{memory}"() #5
; └
;  @ REPL[51]:4 within `bar`
; ┌ @ essentials.jl:953 within `getindex`
   %memoryref_data5 = load ptr, ptr %"x::Array", align 8
   %1 = load i64, ptr %memoryref_data5, align 8
; └
;  @ REPL[51]:5 within `bar`
; ┌ @ int.jl:87 within `+`
   %2 = add i64 %1, %0
   ret i64 %2
; └
}

That is a missed optimization! Julia could “simply” specialize code emission such that non-Vector Arrays have constant MemoryRef and size. If that optimization was still present today, then we would be able to work around Vector’s resizeability by using n x 1 matrices. (the load %memoryref_data5 = load ptr, ptr %"x::Array", align 8 for the Matrix case is the redundant one – the contents of the array are not constant and must be reloaded since we clobbered memory)

(main useful point here is the code snippet with llvmcall asm memory clobber to test optimizations)

1 Like

Is it possible to use the e.g. 64-byte aligned memory provided by AlignedAllocs.jl as the memory for a FixedSizeVector?

The memory backend can be any DenseVector:

1 Like

what am I missing? or what do I need to define/specialize for this to work?

using AlignedAllocs, FixedSizeArrays
alignedmem = memalign_clear(Float32, 8)
alignedmem[:] = 1.0f0:8.0f0
alignment(alignedmem) == 64 # or higher power of 2

fixedmem = FixedSizeVector{Float32, typeof(alignedmem)}(alignedmem)
alignment(fixedmem) == 32
alignment(fixedmem.mem) == 32
# and
alignment(FixedSizeVector{Float32}(alignedmem)) == 32

Sorry, I’m not very familiar with AlignedAllocs, maybe let’s move this discussion to another thread?

done
split to this topic

2 Likes

A post was merged into an existing topic: FixedSizeVectors with cache-aligned memory?