[ANN] ArrayAllocators.jl v0.3 composes with OffsetArrays.jl v1.12.1+ for faster zeros with offset indexing

ArrrayAllocators.jl v0.3.0+ and OffsetArrays.jl v1.12.1+ now compose. This means that you can now construct an OffsetArray of 0s using the following equivalent lines of code.

julia> using OffsetArrays, ArrayAllocators

julia> OA = OffsetArray{Int}(undef, -512:511); fill!(OA, 0);

julia> OA_calloc = OffsetArray{Int}(calloc, -512:511);

julia> OA[-512] == OA_calloc[-512]

julia> isequal(OA_calloc, OA)

In some cases, this can yield a reduction in the initialization time of the OffsetArray as illustrated below.

julia> using BenchmarkTools

julia> @btime begin
           OA2 = OffsetArray{Int}(undef, -512:511, -512:511)
           fill!(OA2, 0)
  1.483 ms (2 allocations: 8.00 MiB)

julia> @btime begin
           OA2 = OffsetArray{Int}(calloc, -512:511, -512:511)
  1.089 ms (8 allocations: 8.00 MiB)

julia> @btime begin
           fill!(OA2, 1)
       end setup = (OA2 = OffsetArray{Int}(undef, -512:511, -512:511));
  1.521 ms (0 allocations: 0 bytes)

julia> @btime begin
           fill!(OA2_calloc, 1)
       end setup = (OA2_calloc = OffsetArray{Int}(calloc, -512:511, -512:511));
  1.312 ms (0 allocations: 0 bytes)

I encourage you to vigorously benchmark your application when using this as the potential optimization may be operating system and hardware dependent.

Also with the subpackage NumaAllocators v0.2.0, you may now directly allocate an OffsetArray on a specific non-uniform memory architecture (NUMA) node.

julia> using OffsetArrays, NumaAllocators

julia> OA_numa_0 = OffsetArray{Int}(numa(0), -512:512, -9:9);

The composition was enabled by implementing Base.unsafe_wrap for OffsetArray. Neither package depends directly upon the other, so ensure that both packages are at the required version or later.

julia> ptr = Libc.calloc(1024, 8)
Ptr{Nothing} @0x0000000004e3ae10

julia> OA3 = unsafe_wrap(OffsetArray, Ptr{Int}(ptr), 1024);

If you missed the original ArrayAllocators.jl announcement, see below.


Another feature of ArrayAllocators.jl v0.3 is an alternate implementation of zeros via calloc. Note that this ArrayAllocators.zeros is not exported. Here ArrayAllocators.zeros(T, ...) is essentially just Array{T}(calloc, ...).

julia> using ArrayAllocators

julia> @time AAZ = ArrayAllocators.zeros(Int, 1024, 1024, 1024);
  0.029364 seconds (61.31 k allocations: 8.004 GiB, 99.89% compilation time)

julia> @time AAZ = ArrayAllocators.zeros(Int, 1024, 1024, 1024);
  0.000037 seconds (4 allocations: 8.000 GiB)

julia> @time BZ = Base.zeros(Int, 1024, 1024, 1024);
  4.448959 seconds (2 allocations: 8.000 GiB, 0.52% gc time)

julia> @time BZ = Base.zeros(Int, 1024, 1024, 1024);
  4.665584 seconds (2 allocations: 8.000 GiB, 2.48% gc time)

Note that on some operating systems this may defer the actual allocation of memory until writing.

julia> @time fill!(AAZ, 1)
  4.849603 seconds (8.17 k allocations: 491.429 KiB, 0.64% compilation time)

julia> @time fill!(AAZ, 2);
  1.634943 seconds

julia> @time fill!(BZ, 1);
  1.710879 seconds

julia> @time fill!(BZ, 2);
  1.717882 seconds

If you wanted to switch your code to use ArrayAllocators.zeros instead of Base.zeros, you can import it before your first reference to zeros.

julia> using ArrayAllocators: zeros

julia> @time zeros(Int, 1024, 1024, 1024);
  0.000028 seconds (4 allocations: 8.000 GiB)

julia> @time Base.zeros(Int, 1024, 1024, 1024);
  4.442414 seconds (2 allocations: 8.000 GiB, 0.10% gc time)

While this may look like a free lunch, note @Sukera’s caveats in the original post:

1 Like

Your package was a cool read! And as I somewhat picked up from the reading, calloc makes the bits 0 but not the elements zero, so I’ll watch out for that:

julia> using ArrayAllocators

julia> struct Zerone{T<:Real}

julia> Base.zero(::Type{Zerone{T}}) where T = Zerone{T}(oneunit(T))

julia> Z = zeros(Zerone{Float64}, 2, 2)
2×2 Matrix{Zerone{Float64}}:
 Zerone{Float64}(1.0)  Zerone{Float64}(1.0)
 Zerone{Float64}(1.0)  Zerone{Float64}(1.0)

julia> U = Array{Zerone{Float64}}(calloc, 2, 2)
2×2 Matrix{Zerone{Float64}}:
 Zerone{Float64}(0.0)  Zerone{Float64}(0.0)
 Zerone{Float64}(0.0)  Zerone{Float64}(0.0)

julia> Za = ArrayAllocators.zeros(Zerone{Float64}, 2, 2)
2×2 Matrix{Zerone{Float64}}:
 Zerone{Float64}(0.0)  Zerone{Float64}(0.0)
 Zerone{Float64}(0.0)  Zerone{Float64}(0.0)

I do have a question, though, if the array allocation is only deferred until next setindexing, am I right in thinking that alone would not improve performance? That the only scenario where I can save allocation time is if my OS has preallocated pages of 0s?

1 Like

Indeed - if “blob of zeros” is not a valid representation for your type, just using calloc will not help you.

Sort of. In general it depends on how your OS is handling calls to calloc in the libc you’re using. Linux at least has a dedicated zeroed page, to allow reads from calloced memory to be fast, but this of course doesn’t help with writes - the first write still has to allocate a page, which takes time. So in effect, the time an allocation takes is moved from the calloc call to the first real use of the memory.

There’s a bunch of discussion in this related thread:

but I personally don’t think calloc is a good idea, since it can mask performance problems due to the time being spent no longer being directly linked to the original allocation.

1 Like

My recommedation is to throughly benchmark your application. ArrayAllocators.jl makes it easier to switch allocation methods by providing a mechanism through the standard syntax.

Yes, the benchmarks will be confounded as some write operations will include allocation time.

The time savings will depend how you write. Allocation is not necessarily an all or none situation. Your operating system allocates “pages” of memory at a time. If you are not going to write to the entire array, there are potential savings. If you are going to write to all elements of the array, then this will not be very useful over undef.

1 Like