How to get a zero-overhead view?

Lilith · August 12, 2023, 1:26pm

In C, I pass around pointers and lengths, constructing views with pointer arithmetic. This is annoying to debug, but quite performant. Is it possible to get equivalent performance with Julia views? I am okay with segfaulting on out of bounds reads or writes.

So far, I’ve tried three view-like approaches, and have not been able to match the performance of manual pointer handling.

My implementations and benchmarks

abstract type ViewLike{T} <: AbstractVector{T} end

struct PointerView{T} <: ViewLike{T}
    ptr::Ptr{T}
    len::Int
end
Base.view(v::PointerView, range::UnitRange) = PointerView(v.ptr+(first(range)-1)*sizeof(v.ptr), length(range))
Base.getindex(v::PointerView, i::Int) = unsafe_load(v.ptr, i)
Base.setindex!(v::PointerView, x, i::Int) = unsafe_store!(v.ptr, x, i)
Base.size(v::PointerView) = (v.len,)
PointerView(v::Vector) = PointerView(pointer(v), length(v))

struct LoHiView{T} <: ViewLike{T}
    data::Vector{T}
    lo::Int
    hi::Int
end
Base.view(v::LoHiView, range::UnitRange) = LoHiView(v.data, v.lo+first(range)-1, v.lo+last(range)-1)
Base.getindex(v::LoHiView, i::Int) = @inbounds v.data[v.lo+i-1]
Base.setindex!(v::LoHiView, x, i::Int) = @inbounds v.data[v.lo+i-1] = x
Base.size(v::LoHiView) = (v.hi-v.lo+1,)
HiLoView(v::AbstractVector) = LoHiView(v, firstindex(v), lastindex(v))

struct ViewView{T} <: ViewLike{T}
    data::SubArray{T, 1, Vector{T}, Tuple{UnitRange{Int64}}, true}
end
Base.view(v::ViewView, range::UnitRange) = @inbounds ViewView(view(v.data, range))
Base.getindex(v::ViewView, i::Int) = @inbounds v.data[i]
Base.setindex!(v::ViewView, x, i::Int) = @inbounds v.data[i] = x
Base.size(v::ViewView) = size(v.data)
ViewView(v::AbstractVector) = ViewView(view(v, 1:length(v)))

function merge!(target::ViewLike, source::ViewLike)
    target_i = source_i = 1
    source_in_target_i = length(source)+1
    hi = length(target)
    while target_i < source_in_target_i <= hi
        if target[source_in_target_i] < source[source_i]
            x = target[source_in_target_i]
            source_in_target_i += 1
        else
            x = source[source_i]
            source_i += 1
        end
        target[target_i] = x
        target_i += 1
    end
    while target_i < source_in_target_i
        x = source[source_i]
        target[target_i] = x
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::ViewLike, source::ViewLike)
    len = length(source)
    len > 0 || return
    if len == 1
        target[1] = source[1]
        return
    end
    m = len >> 1
    mergesort_to!(view(target, m+1:len), view(source, m+1:len))
    mergesort!(view(source, 1:m), view(target, 1:m))
    merge!(target, view(source, 1:m))
end

function mergesort!(data::ViewLike, scratch::ViewLike)
    len = length(data)
    len > 1 || return
    m = len >> 1
    mergesort!(view(data, m+1:len), scratch)
    mergesort_to!(view(scratch, 1:m), view(data, 1:m))
    merge!(data, view(scratch, 1:m))
end

function mergesort_pointer!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(PointerView(v), PointerView(scratch))
    v
end

function mergesort_lohi!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(HiLoView(v), HiLoView(scratch))
    v
end

function mergesort_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(ViewView(v), ViewView(scratch))
    v
end

# -------------------------

function merge!(target::Ptr, source::Ptr, offset::Int, source_len::Int)
    # assert source[1:source_len] does not share memory with target[1:offset+source_len]
    target_i = source_i = 1
    source_in_target_i = 1+offset
    hi = source_len + offset
    while target_i < source_in_target_i <= hi
        if unsafe_load(target, source_in_target_i) < unsafe_load(source, source_i)
            x = unsafe_load(target, source_in_target_i)
            source_in_target_i += 1
        else
            x = unsafe_load(source, source_i)
            source_i += 1
        end
        unsafe_store!(target, x, target_i)
        target_i += 1
    end
    while target_i < source_in_target_i
        x = unsafe_load(source, source_i)
        unsafe_store!(target, x, target_i)
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::Ptr, source::Ptr, len::Int)
    len > 0 || return
    if len == 1
        unsafe_store!(target, unsafe_load(source, 1), 1)
        return
    end
    m = len >> 1
    mergesort_to!(target+m*sizeof(target), source+m*sizeof(source), len-m)
    mergesort!(source, m, target)
    merge!(target, source, m, len-m)
end

function mergesort!(source::Ptr, len::Int, scratch::Ptr)
    len > 1 || return
    m = len >> 1
    mergesort!(source+m*sizeof(source), len-m, scratch)
    mergesort_to!(scratch, source, m)
    merge!(source, scratch, m, len-m)
end

function mergesort_pointer_manual!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(pointer(v), length(v), pointer(scratch))
    v
end

# ---------------------

using Random
function test(f, n=500)
    for i in 1:n
        for _ in 1:n ÷ i
            x = randperm(i)
            sort(x) == f(copy(x)) || return x
        end
    end
end

function test(n::Int=500)
    for f in (mergesort_pointer_manual!, mergesort_pointer!, mergesort_lohi!, mergesort_view!)
        x = test(f, n)
        x === nothing || return f, x
    end
end

# ---------------------

using BenchmarkTools
@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.250 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.583 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.250 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.333 μs (1 allocation: 496 bytes)

fatteneder · August 12, 2023, 4:22pm

I guess with zero-overhead you mean something non-allocating?

Some time ago I also played around with the unsafe_* methods without success. The best I could come up with was unsafe_wrap, because it can give you back a full fledged Vector (but I think without strides, but that wasn’t needed for my application) and so saves writing the boiler plate.
However, sizeof(Vector{Float64}) = 24 whereas sizeof(Ptr{Float64}) + sizeof(Int) = 16, so your PointerView struct would save 8 bytes (IDK if the struct’s size should be added here, does it even have one?).

Another bonus of the wrapped Vector is that you don’t risk segfaults, provided you GC.preserved the underlying data.

In the end I gave up, because I realized this was not the bottleneck for my application. But perhaps this can be improved on once the array stuff will be moved from C into julia.

Lilith · August 12, 2023, 4:41pm

No. By zero overhead I mean no extra cycles at runtime.

Sukera · August 12, 2023, 4:49pm

Is there something wrong with @view?

Lilith · August 12, 2023, 5:04pm

Yes. It is slower than all the versions in the OP. About 1.6x slower than manual pointer handling.

Implementation & benchmark

function merge!(target::AbstractArray, source::AbstractArray)
    target_i = source_i = 1
    source_in_target_i = length(source)+1
    hi = length(target)
    while target_i < source_in_target_i <= hi
        if target[source_in_target_i] < source[source_i]
            x = target[source_in_target_i]
            source_in_target_i += 1
        else
            x = source[source_i]
            source_i += 1
        end
        target[target_i] = x
        target_i += 1
    end
    while target_i < source_in_target_i
        x = source[source_i]
        target[target_i] = x
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::AbstractArray, source::AbstractArray)
    len = length(source)
    len > 0 || return
    if len == 1
        target[1] = source[1]
        return
    end
    m = len >> 1
    mergesort_to!(@view(target[m+1:len]), @view(source[m+1:len]))
    mergesort!(@view(source[1:m]), @view(target[1:m]))
    merge!(target, @view(source[1:m]))
end

function mergesort!(data::AbstractArray, scratch::AbstractArray)
    len = length(data)
    len > 1 || return
    m = len >> 1
    mergesort!(@view(data[m+1:len]), scratch)
    mergesort_to!(@view(scratch[1:m]), @view(data[1:m]))
    merge!(data, @view(scratch[1:m]))
end

function mergesort_at_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(v, scratch)
    v
end

using BenchmarkTools
@btime mergesort_at_view!(x) setup=(x=rand(100)) evals=1;
# 3.666 μs (1 allocation: 496 bytes)

Sukera · August 12, 2023, 5:08pm

Well yes, you’re not using @inbounds, so all of those accesses (by default) do bounds checking. 1.6x is IIRC about what the overhead for that is like. That’s what a higher level language than C does

Lilith · August 12, 2023, 6:06pm

Adding @inbounds speeds this up by 8%, so that’s not it.

Implementation & benchmark

function merge!(target::AbstractArray, source::AbstractArray)
    @inbounds begin
        target_i = source_i = 1
        source_in_target_i = length(source)+1
        hi = length(target)
        while target_i < source_in_target_i <= hi
            if target[source_in_target_i] < source[source_i]
                x = target[source_in_target_i]
                source_in_target_i += 1
            else
                x = source[source_i]
                source_i += 1
            end
            target[target_i] = x
            target_i += 1
        end
        while target_i < source_in_target_i
            x = source[source_i]
            target[target_i] = x
            source_i += 1
            target_i += 1
        end
    end
end

function mergesort_to!(target::AbstractArray, source::AbstractArray)
    @inbounds begin
        len = length(source)
        len > 0 || return
        if len == 1
            target[1] = source[1]
            return
        end
        m = len >> 1
        mergesort_to!(@view(target[m+1:len]), @view(source[m+1:len]))
        mergesort!(@view(source[1:m]), @view(target[1:m]))
        merge!(target, @view(source[1:m]))
    end
end

function mergesort!(data::AbstractArray, scratch::AbstractArray)
    @inbounds begin
        len = length(data)
        len > 1 || return
        m = len >> 1
        mergesort!(@view(data[m+1:len]), scratch)
        mergesort_to!(@view(scratch[1:m]), @view(data[1:m]))
        merge!(data, @view(scratch[1:m]))
    end
end

function mergesort_at_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    @inbounds mergesort!(v, scratch)
    v
end

@btime mergesort_at_view!(x) setup=(x=rand(100)) evals=1;
# 3.375 μs (1 allocation: 496 bytes)

jondeuce · August 12, 2023, 6:41pm

I think you’re looking for something like UnsafeArrays.jl? I’ve used this package many times, it’s really nice.

Lilith · August 12, 2023, 10:24pm

Thanks, that works!

Sevi · August 13, 2023, 8:53am

How big was the performance gain you got from UnsafeArrays.jl? Just curious, because the Readme states

With Julia v1.5 and higher, using UnsafeArrays should not be necessary and is not likely to result in significant performance gains.

aplavin · August 13, 2023, 10:59am

As the readme says, views no longer allocate in Julia. It doesn’t mean they are already zero-cost.
I guess the speed difference quantified in the first post doesn’t seem a

significant performance gain

for some (:

More generally, UnsafeArrays.jl are by no means obsolete. As far as I know, it’s the only no-allocation way to transform a C array returned from a foreign function into a Julia array.

Sevi · August 13, 2023, 11:17am

Which is exactly my point. It might be worth mentioning somewhere in that Readme that UnsafeArrays.jl doesn’t only reduce allocations (which seems to be obsolete with current versions of Julia). To me, the quote I posted sounds like the package is not really useful anymore, but from this discussion here I understand that it is indeed useful and might help other people too.

So what I am curious about is if, in this particular example, the use of UnsafeArrays.jl is giving timings closer to the version mergesort_pointer_manual! or more on the end of mergesort_view! in the original post. Or something in between. I haven’t looked into the code of UnsafeArrays.jl enough to tell if it is completely equivalent to the approach mentioned in the first post.

Lilith · August 13, 2023, 2:33pm

UnsafeArrays.jl provided equivalent performance to my custom AbstractArray wrapper of a pointer. It was still a bit shy of manual pointer management, likely because I pass around three abstract arrays and know at design time that the lengths of two of them add up to the length of the third so I only pass two lengths when using manual pointer management.

A recap of the performance of various methods I tried in this thread is

@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.458 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.416 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.542 μs (1 allocation: 496 bytes)
@btime mergesort_unsafe_arrays!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)

Implementation details

abstract type ViewLike{T} <: AbstractVector{T} end

struct PointerView{T} <: ViewLike{T}
    ptr::Ptr{T}
    len::Int
end
Base.view(v::PointerView, range::UnitRange) = PointerView(v.ptr+(first(range)-1)*sizeof(v.ptr), length(range))
Base.getindex(v::PointerView, i::Int) = unsafe_load(v.ptr, i)
Base.setindex!(v::PointerView, x, i::Int) = unsafe_store!(v.ptr, x, i)
Base.size(v::PointerView) = (v.len,)
PointerView(v::Vector) = PointerView(pointer(v), length(v))

struct LoHiView{T} <: ViewLike{T}
    data::Vector{T}
    lo::Int
    hi::Int
end
Base.view(v::LoHiView, range::UnitRange) = LoHiView(v.data, v.lo+first(range)-1, v.lo+last(range)-1)
Base.getindex(v::LoHiView, i::Int) = @inbounds v.data[v.lo+i-1]
Base.setindex!(v::LoHiView, x, i::Int) = @inbounds v.data[v.lo+i-1] = x
Base.size(v::LoHiView) = (v.hi-v.lo+1,)
HiLoView(v::AbstractVector) = LoHiView(v, firstindex(v), lastindex(v))

struct ViewView{T} <: ViewLike{T}
    data::SubArray{T, 1, Vector{T}, Tuple{UnitRange{Int64}}, true}
end
Base.view(v::ViewView, range::UnitRange) = @inbounds ViewView(view(v.data, range))
Base.getindex(v::ViewView, i::Int) = @inbounds v.data[i]
Base.setindex!(v::ViewView, x, i::Int) = @inbounds v.data[i] = x
Base.size(v::ViewView) = size(v.data)
ViewView(v::AbstractVector) = ViewView(view(v, 1:length(v)))

function merge!(target::AbstractVector, source::AbstractVector)
    @inbounds begin
        target_i = source_i = 1
        source_in_target_i = length(source)+1
        hi = length(target)
        while target_i < source_in_target_i <= hi
            if target[source_in_target_i] < source[source_i]
                x = target[source_in_target_i]
                source_in_target_i += 1
            else
                x = source[source_i]
                source_i += 1
            end
            target[target_i] = x
            target_i += 1
        end
        while target_i < source_in_target_i
            x = source[source_i]
            target[target_i] = x
            source_i += 1
            target_i += 1
        end
    end
end

function mergesort_to!(target::AbstractVector, source::AbstractVector)
    @inbounds begin
        len = length(source)
        len > 0 || return
        if len == 1
            target[1] = source[1]
            return
        end
        m = len >> 1
        mergesort_to!(view(target, m+1:len), view(source, m+1:len))
        mergesort!(view(source, 1:m), view(target, 1:m))
        merge!(target, view(source, 1:m))
    end
end

function mergesort!(data::AbstractVector, scratch::AbstractVector)
    @inbounds begin
        len = length(data)
        len > 1 || return
        m = len >> 1
        mergesort!(view(data, m+1:len), scratch)
        mergesort_to!(view(scratch, 1:m), view(data, 1:m))
        merge!(data, view(scratch, 1:m))
    end
end

function mergesort_pointer!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(PointerView(v), PointerView(scratch))
    v
end

function mergesort_lohi!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(HiLoView(v), HiLoView(scratch))
    v
end

function mergesort_view!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    mergesort!(ViewView(v), ViewView(scratch))
    v
end

using UnsafeArrays
function mergesort_unsafe_arrays!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    @uviews v scratch begin
        mergesort!(v, scratch)
    end
    v
end


# -------------------------

function merge!(target::Ptr, source::Ptr, offset::Int, source_len::Int)
    # assert source[1:source_len] does not share memory with target[1:offset+source_len]
    target_i = source_i = 1
    source_in_target_i = 1+offset
    hi = source_len + offset
    while target_i < source_in_target_i <= hi
        if unsafe_load(target, source_in_target_i) < unsafe_load(source, source_i)
            x = unsafe_load(target, source_in_target_i)
            source_in_target_i += 1
        else
            x = unsafe_load(source, source_i)
            source_i += 1
        end
        unsafe_store!(target, x, target_i)
        target_i += 1
    end
    while target_i < source_in_target_i
        x = unsafe_load(source, source_i)
        unsafe_store!(target, x, target_i)
        source_i += 1
        target_i += 1
    end
end

function mergesort_to!(target::Ptr, source::Ptr, len::Int)
    len > 0 || return
    if len == 1
        unsafe_store!(target, unsafe_load(source, 1), 1)
        return
    end
    m = len >> 1
    mergesort_to!(target+m*sizeof(target), source+m*sizeof(source), len-m)
    mergesort!(source, m, target)
    merge!(target, source, m, len-m)
end

function mergesort!(source::Ptr, len::Int, scratch::Ptr)
    len > 1 || return
    m = len >> 1
    mergesort!(source+m*sizeof(source), len-m, scratch)
    mergesort_to!(scratch, source, m)
    merge!(source, scratch, m, len-m)
end

function mergesort_pointer_manual!(v::AbstractVector, scratch::AbstractVector=similar(v, length(v)>>1))
    GC.@preserve v scratch mergesort!(pointer(v), length(v), pointer(scratch))
    v
end

# ---------------------

using Random
function test(f, n=500)
    for i in 1:n
        for _ in 1:n ÷ i
            x = randperm(i)
            sort(x) == f(copy(x)) || return x
        end
    end
end

function test(n::Int=500)
    for f in (mergesort_pointer_manual!, mergesort_pointer!, mergesort_lohi!, mergesort_view!, mergesort_unsafe_arrays!)
        x = test(f, n)
        x === nothing || return f, x
    end
end

# ---------------------

using BenchmarkTools
@btime mergesort_pointer_manual!(x) setup=(x=rand(100)) evals=1;
# 2.458 μs (1 allocation: 496 bytes)
@btime mergesort_pointer!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)
@btime mergesort_lohi!(x) setup=(x=rand(100)) evals=1;
# 3.416 μs (1 allocation: 496 bytes)
@btime mergesort_view!(x) setup=(x=rand(100)) evals=1;
# 3.542 μs (1 allocation: 496 bytes)
@btime mergesort_unsafe_arrays!(x) setup=(x=rand(100)) evals=1;
# 2.875 μs (1 allocation: 496 bytes)

versioninfo()
# Julia Version 1.10.0-beta1
# Commit 6616549950e (2023-07-25 17:43 UTC)
# Platform Info:
#   OS: macOS (arm64-apple-darwin22.4.0)
#   CPU: 8 × Apple M2
#   WORD_SIZE: 64
#   LIBM: libopenlibm
#   LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
#   Threads: 1 on 4 virtual cores
# Environment:
#   JULIA_EDITOR = code

They are the exact same approach and memory layout. UnsafeArrays.jl is more general, but within the domain used in my example there is no runtime difference,

Sevi · August 13, 2023, 3:36pm

Very interesting, thanks!

gitboy16 · August 13, 2023, 4:54pm

Beginner question here. Would it make a difference (interm of speed or else) to use Ref instead of pointer in the above code?
Julia’s documentation says “Calling Ref(array[,index]) is generally preferable to this function (i.e. pointer) as it guarantees validity”. I am not sure what validity means here.

Benny · August 13, 2023, 10:13pm

With UnsafeArrays and I presume this more limited view-like approach, you need to make sure the parent array is not garbage-collected until after you’re done, and the pointers alone won’t stop that. The risk of dangling pointers is what makes it unsafe.

What I find interesting about UnsafeArrays having a performance edge over views is that UnsafeArrays doesn’t actually replace Julia’s view and SubArrays, it more directly wraps the parent array. So it seems to me the performance difference should be rooted in the differences between UnsafeArrays/pointer-work and the parent array, which is Array in the benchmarks here.

Lilith · August 14, 2023, 3:42pm

Sadly, yes. I believe this would impact performance. Refs are safer than pointers because they prevent the thing they are pointing to from being garbage collected. The way they do this is by maintaining both a pointer to the parent array and an offset. With pointers or UnsafeArrays, on the other hand, we take a pointer to the Array and add an offset to get a new pointer. This new pointer no longer points to the parent array, so the garbage collector can’t usefully follow it (unsafe) but unless the compiler is really clever (which in this case it is not) passing around just a pointer is faster than passing around a pointer and an offset.

Topic		Replies	Views
Passing sub-array by reference / performance Performance	18	4727	April 10, 2019
Is there a way to guarantee 0 allocations when accessing an array? General Usage	18	4306	June 1, 2017
Array views becoming dominant source of memory allocation Performance	31	5169	November 14, 2018
Why does `reinterpret` cause an extra allocation? General Usage	30	4153	February 24, 2018
Unsafe_store! sometimes slower than arrays' setindex!? Performance	15	1431	November 30, 2017

How to get a zero-overhead view?

Related topics