How to implement efficient custom array

How to implement a custom StridedArray or DenseArray?

I have some C-code which returns a pointer to an array, which I can wrap using unsafe_wrap. I need to build a custom array-implementation, but operations on this custom array is a lot slower than on the array (wrapped or not):

abstract type AbstractFooArray{T,N} <: DenseArray{T,N}

mutable struct FooArray{T,N} <: AbstractFooArray{T,N}

Base.IndexStyle(::Type{<:AbstractFooArray{T,N}}) where {T,N} = IndexLinear()
Base.getindex(a::AbstractFooArray{T,N}, i::Int) where {T,N} = a.array[i]
Base.setindex!(a::AbstractFooArray{T,N}, v, i::Int) where {T,N} = a.array[i] = v

Base.size(a::AbstractFooArray{T,N}) where {T,N} = size(a.array)

#Base.strides(a::AbstractFooArray{T,N}) where {T,N} = strides(a.array)
#Base.unsafe_convert(::Type{Ptr{T}}, a::AbstractFooArray{T,N}) where {T,N} = unsafe_convert(Ptr{T}, a.array)
#Base.has_fast_linear_indexing(a::AbstractFooArray{T,N}) where {T,N} = true

a = zeros(UInt8, (3, 2448, 2048));
b = unsafe_wrap(Array, pointer(a), size(a));

a_f = FooArray(a)
b_f = FooArray(b)
@time sum(a)
@time sum(b)
  0.002121 seconds (4 allocations: 160 bytes)
  0.002241 seconds (4 allocations: 160 bytes)
@time sum(a_f)
@time sum(b_f)
  0.562584 seconds (16 allocations: 400 bytes)
  0.546441 seconds (16 allocations: 400 bytes)

The reported numbers are after compilation warm-up (second run).

There might be other things but a biggy is that DenseArray is an abstract type and thus having a type field with that type is slow. See performance tips in the docs.


Thanks - that helped a lot - brought the gap to a factor ~ 5:

  0.002038 seconds (4 allocations: 160 bytes)
  0.002234 seconds (4 allocations: 160 bytes)
  0.010350 seconds (4 allocations: 160 bytes)
  0.009492 seconds (4 allocations: 160 bytes)

Maybe you don’t need a custom array type? You can create a Julia array that is backed by C-allocated memory. Unless you need different behavior a new type is unnecessary.

True, but in this case I would like to add reference counting with a custom dispose function to enable re-use of large arrays (returning it to a pool of arrays from which it can be re-used):