How to get a zero-overhead view?

In C, I pass around pointers and lengths, constructing views with pointer arithmetic. This is annoying to debug, but quite performant. Is it possible to get equivalent performance with Julia views? I am okay with segfaulting on out of bounds reads or writes.

So far, I’ve tried three view-like approaches, and have not been able to match the performance of manual pointer handling.

My implementations and benchmarks
abstract type ViewLike{T} <: AbstractVector{T} end

struct PointerView{T} <: ViewLike{T}
    ptr::Ptr{T}
    len::Int
end
Base.view(v::PointerView, range::UnitRange) = PointerView(v.ptr+(first(range)-1)*sizeof(v.ptr), length(range))
Base.getindex(v::PointerView, i::Int) = unsafe_load(v.ptr, i)
Base.setindex!(v::PointerView, x, i::Int) = unsafe_store!(v.ptr, x, i)
Base.size(v::PointerView) = (v.len,)
PointerView(v::Vector) = PointerView(pointer(v), length(v))

struct LoHiView{T} <: ViewLike{T}
    data::Vector{T}
    lo::Int
    hi::Int
end
Base.view(v::LoHiView, range::UnitRange) = LoHiView(v.data, v.lo+first(range)-1, v.lo+last(range)-1)
Base.getindex(v::LoHiView, i::Int) = @inbounds v.data[v.lo+i-1]
Base.setindex!(v::LoHiView, x, i::Int) = @inbounds v.data[v.lo+i-1] = x
Base.size(v::LoHiView) = (v.hi-v.lo+1,)
HiLoView(v::AbstractVector) = LoHiView(v, firstindex(v), lastindex(v))

struct ViewView{T} <: ViewLike{T}
    data::SubArray{T, 1, Vector{T}, Tuple{UnitRange{Int64}}, true}
end
Base.view(v::ViewView, range::UnitRange) = @inbounds ViewView(view(v.data, range))
Base.getindex(v::ViewView, i::Int) = @inbounds v.data[i]
Base.setindex!(v::ViewView, x, i::Int) = @inbounds v.data[i] = x
Base.size(v::ViewView) = size(v.data)
ViewView(v::AbstractVector) = ViewView(view(v, 1:length(v)))

function merge!(target::ViewLike, source::ViewLike)
    target_i = source_i = 1
    source_in_target_i = length(source)+1
    hi = length(target)
    while target_i < source_in_target_i <= hi
        if target[source_in_target_i] < source[source_i]
            x = target[source_in_target_i]
            source_in_target_i += 1
        else
            x = source[source_i]
            source_i += 1
        end
        target[target_i] = x
        target_i += 1
    end
    while target_i < source_in_target_i
        x = source[source_i]
        target[target_i] = x
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::ViewLike, source::ViewLike)
    len = length(source)
    len > 0 || return
    if len == 1
        target[1] = source[1]
        return
    end
    m = len >> 1
    mergesort_to!(view(target, m+1:len), view(source, m+1:len))
    mergesort!(view(source, 1:m), view(target, 1:m))
    merge!(target, view(source, 1:m))
end

function mergesort!(data::ViewLike, scratch::ViewLike)
    len = length(data)
    len > 1 || return
    m = len >> 1
    mergesort!(view(data, m+1:len), scratch)
    mergesort_to!(view(scratch, 1:m), view(data, 1:m))
    merge!(data, view(scratch, 1:m))
end

function mergesort_pointer!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(PointerView(v), PointerView(scratch))
    v
end

function mergesort_lohi!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(HiLoView(v), HiLoView(scratch))
    v
end

function mergesort_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(ViewView(v), ViewView(scratch))
    v
end

# -------------------------

function merge!(target::Ptr, source::Ptr, offset::Int, source_len::Int)
    # assert source[1:source_len] does not share memory with target[1:offset+source_len]
    target_i = source_i = 1
    source_in_target_i = 1+offset
    hi = source_len + offset
    while target_i < source_in_target_i <= hi
        if unsafe_load(target, source_in_target_i) < unsafe_load(source, source_i)
            x = unsafe_load(target, source_in_target_i)
            source_in_target_i += 1
        else
            x = unsafe_load(source, source_i)
            source_i += 1
        end
        unsafe_store!(target, x, target_i)
        target_i += 1
    end
    while target_i < source_in_target_i
        x = unsafe_load(source, source_i)
        unsafe_store!(target, x, target_i)
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::Ptr, source::Ptr, len::Int)
    len > 0 || return
    if len == 1
        unsafe_store!(target, unsafe_load(source, 1), 1)
        return
    end
    m = len >> 1
    mergesort_to!(target+m*sizeof(target), source+m*sizeof(source), len-m)
    mergesort!(source, m, target)
    merge!(target, source, m, len-m)
end

function mergesort!(source::Ptr, len::Int, scratch::Ptr)
    len > 1 || return
    m = len >> 1
    mergesort!(source+m*sizeof(source), len-m, scratch)
    mergesort_to!(scratch, source, m)
    merge!(source, scratch, m, len-m)
end

function mergesort_pointer_manual!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(pointer(v), length(v), pointer(scratch))
    v
end

# ---------------------

using Random
function test(f, n=500)
    for i in 1:n
        for _ in 1:n ÷ i
            x = randperm(i)
            sort(x) == f(copy(x)) || return x
        end
    end
end

function test(n::Int=500)
    for f in (mergesort_pointer_manual!, mergesort_pointer!, mergesort_lohi!, mergesort_view!)
        x = test(f, n)
        x === nothing || return f, x
    end
end

# ---------------------

using BenchmarkTools
@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.250 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.583 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.250 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.333 μs (1 allocation: 496 bytes)

I guess with zero-overhead you mean something non-allocating?

Some time ago I also played around with the unsafe_* methods without success. The best I could come up with was unsafe_wrap, because it can give you back a full fledged Vector (but I think without strides, but that wasn’t needed for my application) and so saves writing the boiler plate.
However, sizeof(Vector{Float64}) = 24 whereas sizeof(Ptr{Float64}) + sizeof(Int) = 16, so your PointerView struct would save 8 bytes (IDK if the struct’s size should be added here, does it even have one?).

Another bonus of the wrapped Vector is that you don’t risk segfaults, provided you GC.preserved the underlying data.

In the end I gave up, because I realized this was not the bottleneck for my application. But perhaps this can be improved on once the array stuff will be moved from C into julia.

No. By zero overhead I mean no extra cycles at runtime.

Is there something wrong with @view?

3 Likes

Yes. It is slower than all the versions in the OP. About 1.6x slower than manual pointer handling.

Implementation & benchmark
function merge!(target::AbstractArray, source::AbstractArray)
    target_i = source_i = 1
    source_in_target_i = length(source)+1
    hi = length(target)
    while target_i < source_in_target_i <= hi
        if target[source_in_target_i] < source[source_i]
            x = target[source_in_target_i]
            source_in_target_i += 1
        else
            x = source[source_i]
            source_i += 1
        end
        target[target_i] = x
        target_i += 1
    end
    while target_i < source_in_target_i
        x = source[source_i]
        target[target_i] = x
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::AbstractArray, source::AbstractArray)
    len = length(source)
    len > 0 || return
    if len == 1
        target[1] = source[1]
        return
    end
    m = len >> 1
    mergesort_to!(@view(target[m+1:len]), @view(source[m+1:len]))
    mergesort!(@view(source[1:m]), @view(target[1:m]))
    merge!(target, @view(source[1:m]))
end

function mergesort!(data::AbstractArray, scratch::AbstractArray)
    len = length(data)
    len > 1 || return
    m = len >> 1
    mergesort!(@view(data[m+1:len]), scratch)
    mergesort_to!(@view(scratch[1:m]), @view(data[1:m]))
    merge!(data, @view(scratch[1:m]))
end

function mergesort_at_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(v, scratch)
    v
end

using BenchmarkTools
@btime mergesort_at_view!(x) setup=(x=rand(100)) evals=1;
# 3.666 μs (1 allocation: 496 bytes)

Well yes, you’re not using @inbounds, so all of those accesses (by default) do bounds checking. 1.6x is IIRC about what the overhead for that is like. That’s what a higher level language than C does :person_shrugging:

1 Like

Adding @inbounds speeds this up by 8%, so that’s not it.

Implementation & benchmark
function merge!(target::AbstractArray, source::AbstractArray)
    @inbounds begin
        target_i = source_i = 1
        source_in_target_i = length(source)+1
        hi = length(target)
        while target_i < source_in_target_i <= hi
            if target[source_in_target_i] < source[source_i]
                x = target[source_in_target_i]
                source_in_target_i += 1
            else
                x = source[source_i]
                source_i += 1
            end
            target[target_i] = x
            target_i += 1
        end
        while target_i < source_in_target_i
            x = source[source_i]
            target[target_i] = x
            source_i += 1
            target_i += 1
        end
    end
end

function mergesort_to!(target::AbstractArray, source::AbstractArray)
    @inbounds begin
        len = length(source)
        len > 0 || return
        if len == 1
            target[1] = source[1]
            return
        end
        m = len >> 1
        mergesort_to!(@view(target[m+1:len]), @view(source[m+1:len]))
        mergesort!(@view(source[1:m]), @view(target[1:m]))
        merge!(target, @view(source[1:m]))
    end
end

function mergesort!(data::AbstractArray, scratch::AbstractArray)
    @inbounds begin
        len = length(data)
        len > 1 || return
        m = len >> 1
        mergesort!(@view(data[m+1:len]), scratch)
        mergesort_to!(@view(scratch[1:m]), @view(data[1:m]))
        merge!(data, @view(scratch[1:m]))
    end
end

function mergesort_at_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    @inbounds mergesort!(v, scratch)
    v
end

@btime mergesort_at_view!(x) setup=(x=rand(100)) evals=1;
# 3.375 μs (1 allocation: 496 bytes)

I think you’re looking for something like UnsafeArrays.jl? I’ve used this package many times, it’s really nice.

6 Likes

Thanks, that works!

1 Like

How big was the performance gain you got from UnsafeArrays.jl? Just curious, because the Readme states

With Julia v1.5 and higher, using UnsafeArrays should not be necessary and is not likely to result in significant performance gains.

4 Likes

As the readme says, views no longer allocate in Julia. It doesn’t mean they are already zero-cost.
I guess the speed difference quantified in the first post doesn’t seem a

significant performance gain

for some (:

More generally, UnsafeArrays.jl are by no means obsolete. As far as I know, it’s the only no-allocation way to transform a C array returned from a foreign function into a Julia array.

Which is exactly my point. It might be worth mentioning somewhere in that Readme that UnsafeArrays.jl doesn’t only reduce allocations (which seems to be obsolete with current versions of Julia). To me, the quote I posted sounds like the package is not really useful anymore, but from this discussion here I understand that it is indeed useful and might help other people too.

So what I am curious about is if, in this particular example, the use of UnsafeArrays.jl is giving timings closer to the version mergesort_pointer_manual! or more on the end of mergesort_view! in the original post. Or something in between. I haven’t looked into the code of UnsafeArrays.jl enough to tell if it is completely equivalent to the approach mentioned in the first post.

UnsafeArrays.jl provided equivalent performance to my custom AbstractArray wrapper of a pointer. It was still a bit shy of manual pointer management, likely because I pass around three abstract arrays and know at design time that the lengths of two of them add up to the length of the third so I only pass two lengths when using manual pointer management.

A recap of the performance of various methods I tried in this thread is

@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.458 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.416 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.542 μs (1 allocation: 496 bytes)
@btime mergesort_unsafe_arrays!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)
Implementation details
abstract type ViewLike{T} <: AbstractVector{T} end

struct PointerView{T} <: ViewLike{T}
    ptr::Ptr{T}
    len::Int
end
Base.view(v::PointerView, range::UnitRange) = PointerView(v.ptr+(first(range)-1)*sizeof(v.ptr), length(range))
Base.getindex(v::PointerView, i::Int) = unsafe_load(v.ptr, i)
Base.setindex!(v::PointerView, x, i::Int) = unsafe_store!(v.ptr, x, i)
Base.size(v::PointerView) = (v.len,)
PointerView(v::Vector) = PointerView(pointer(v), length(v))

struct LoHiView{T} <: ViewLike{T}
    data::Vector{T}
    lo::Int
    hi::Int
end
Base.view(v::LoHiView, range::UnitRange) = LoHiView(v.data, v.lo+first(range)-1, v.lo+last(range)-1)
Base.getindex(v::LoHiView, i::Int) = @inbounds v.data[v.lo+i-1]
Base.setindex!(v::LoHiView, x, i::Int) = @inbounds v.data[v.lo+i-1] = x
Base.size(v::LoHiView) = (v.hi-v.lo+1,)
HiLoView(v::AbstractVector) = LoHiView(v, firstindex(v), lastindex(v))

struct ViewView{T} <: ViewLike{T}
    data::SubArray{T, 1, Vector{T}, Tuple{UnitRange{Int64}}, true}
end
Base.view(v::ViewView, range::UnitRange) = @inbounds ViewView(view(v.data, range))
Base.getindex(v::ViewView, i::Int) = @inbounds v.data[i]
Base.setindex!(v::ViewView, x, i::Int) = @inbounds v.data[i] = x
Base.size(v::ViewView) = size(v.data)
ViewView(v::AbstractVector) = ViewView(view(v, 1:length(v)))

function merge!(target::AbstractVector, source::AbstractVector)
    @inbounds begin
        target_i = source_i = 1
        source_in_target_i = length(source)+1
        hi = length(target)
        while target_i < source_in_target_i <= hi
            if target[source_in_target_i] < source[source_i]
                x = target[source_in_target_i]
                source_in_target_i += 1
            else
                x = source[source_i]
                source_i += 1
            end
            target[target_i] = x
            target_i += 1
        end
        while target_i < source_in_target_i
            x = source[source_i]
            target[target_i] = x
            source_i += 1
            target_i += 1
        end
    end
end

function mergesort_to!(target::AbstractVector, source::AbstractVector)
    @inbounds begin
        len = length(source)
        len > 0 || return
        if len == 1
            target[1] = source[1]
            return
        end
        m = len >> 1
        mergesort_to!(view(target, m+1:len), view(source, m+1:len))
        mergesort!(view(source, 1:m), view(target, 1:m))
        merge!(target, view(source, 1:m))
    end
end

function mergesort!(data::AbstractVector, scratch::AbstractVector)
    @inbounds begin
        len = length(data)
        len > 1 || return
        m = len >> 1
        mergesort!(view(data, m+1:len), scratch)
        mergesort_to!(view(scratch, 1:m), view(data, 1:m))
        merge!(data, view(scratch, 1:m))
    end
end

function mergesort_pointer!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(PointerView(v), PointerView(scratch))
    v
end

function mergesort_lohi!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(HiLoView(v), HiLoView(scratch))
    v
end

function mergesort_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(ViewView(v), ViewView(scratch))
    v
end

using UnsafeArrays
function mergesort_unsafe_arrays!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    @uviews v scratch begin
        mergesort!(v, scratch)
    end
    v
end


# -------------------------

function merge!(target::Ptr, source::Ptr, offset::Int, source_len::Int)
    # assert source[1:source_len] does not share memory with target[1:offset+source_len]
    target_i = source_i = 1
    source_in_target_i = 1+offset
    hi = source_len + offset
    while target_i < source_in_target_i <= hi
        if unsafe_load(target, source_in_target_i) < unsafe_load(source, source_i)
            x = unsafe_load(target, source_in_target_i)
            source_in_target_i += 1
        else
            x = unsafe_load(source, source_i)
            source_i += 1
        end
        unsafe_store!(target, x, target_i)
        target_i += 1
    end
    while target_i < source_in_target_i
        x = unsafe_load(source, source_i)
        unsafe_store!(target, x, target_i)
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::Ptr, source::Ptr, len::Int)
    len > 0 || return
    if len == 1
        unsafe_store!(target, unsafe_load(source, 1), 1)
        return
    end
    m = len >> 1
    mergesort_to!(target+m*sizeof(target), source+m*sizeof(source), len-m)
    mergesort!(source, m, target)
    merge!(target, source, m, len-m)
end

function mergesort!(source::Ptr, len::Int, scratch::Ptr)
    len > 1 || return
    m = len >> 1
    mergesort!(source+m*sizeof(source), len-m, scratch)
    mergesort_to!(scratch, source, m)
    merge!(source, scratch, m, len-m)
end

function mergesort_pointer_manual!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(pointer(v), length(v), pointer(scratch))
    v
end

# ---------------------

using Random
function test(f, n=500)
    for i in 1:n
        for _ in 1:n ÷ i
            x = randperm(i)
            sort(x) == f(copy(x)) || return x
        end
    end
end

function test(n::Int=500)
    for f in (mergesort_pointer_manual!, mergesort_pointer!, mergesort_lohi!, mergesort_view!, mergesort_unsafe_arrays!)
        x = test(f, n)
        x === nothing || return f, x
    end
end

# ---------------------

using BenchmarkTools
@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.458 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.416 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.542 μs (1 allocation: 496 bytes)
@btime mergesort_unsafe_arrays!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)

versioninfo()
# Julia Version 1.10.0-beta1
# Commit 6616549950e (2023-07-25 17:43 UTC)
# Platform Info:
#   OS: macOS (arm64-apple-darwin22.4.0)
#   CPU: 8 × Apple M2
#   WORD_SIZE: 64
#   LIBM: libopenlibm
#   LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
#   Threads: 1 on 4 virtual cores
# Environment:
#   JULIA_EDITOR = code

They are the exact same approach and memory layout. UnsafeArrays.jl is more general, but within the domain used in my example there is no runtime difference,

8 Likes

Very interesting, thanks!

1 Like

Beginner question here. Would it make a difference (interm of speed or else) to use Ref instead of pointer in the above code?
Julia’s documentation says “Calling Ref(array[,index]) is generally preferable to this function (i.e. pointer) as it guarantees validity”. I am not sure what validity means here.

1 Like

With UnsafeArrays and I presume this more limited view-like approach, you need to make sure the parent array is not garbage-collected until after you’re done, and the pointers alone won’t stop that. The risk of dangling pointers is what makes it unsafe.

What I find interesting about UnsafeArrays having a performance edge over views is that UnsafeArrays doesn’t actually replace Julia’s view and SubArrays, it more directly wraps the parent array. So it seems to me the performance difference should be rooted in the differences between UnsafeArrays/pointer-work and the parent array, which is Array in the benchmarks here.

1 Like

Sadly, yes. I believe this would impact performance. Refs are safer than pointers because they prevent the thing they are pointing to from being garbage collected. The way they do this is by maintaining both a pointer to the parent array and an offset. With pointers or UnsafeArrays, on the other hand, we take a pointer to the Array and add an offset to get a new pointer. This new pointer no longer points to the parent array, so the garbage collector can’t usefully follow it (unsafe) but unless the compiler is really clever (which in this case it is not) passing around just a pointer is faster than passing around a pointer and an offset.

2 Likes