Column iterator

I’d like to define a plain iterator over the columns of a matrix in style of julienned arrays. If I take the state to be

struct Slice{T} <: AbstractVector{T}
    a::Matrix{T}
    idx::Int
end

what do I have to define of size, length, getindex, setindex!, axes
to obtain a performant broadcast as below

a = rand(2, 1000)
x = rand(2)

y = Slice(a, 1) # a[:, 1]
y .= x

I’d like to avoid views and leverage that columns are contiguous.

Whats the problem with views? Are they allocating? Are you trying to call mutate in place functions?

Yeah, views are allocating and have overhead as they solve a considerably more difficult problem.

To be sure, I benchmarked and it appears that creating the view is not a problem, the costs are stemming from broadcasting:

yy = rand(2, 10000)
y = rand(2)
using BenchmarkTools
function f5(yy, x) 
    for i in 1:size(yy, 2)
        y = Slice1(yy, i)
        for k in eachindex(x)
            @inbounds y[k] = x[k]
        end
    end
end
function f6(yy, x) 
    for i in 1:size(yy, 2)
        y = view(yy, :, i)
        for k in eachindex(x)
            @inbounds y[k] = x[k]
        end
    end
end



@btime f5($yy, $y)
@btime f6($yy, $y)
  25.072 μs (0 allocations: 0 bytes)
  27.160 μs (0 allocations: 0 bytes)

with





struct Slice1{T} <: AbstractVector{T}
    S::Matrix{T}
    idx::Int
    sz::Base.OneTo{Int}
end
@inline function Slice1(A::TA, i::Int) where {TA<:AbstractArray{T,2}} where {T} 
    Slice1(A, i, Base.OneTo(size(A)[1]) )
end


@inline size(S::Slice1) = (length(S.sz),)
@inline axes(S::Slice1) = (S.sz,)


@inline Base.IndexStyle(::Type{<:Slice1}) = IndexLinear()
@inline function setindex!(S::Slice1, x, k)
    @boundscheck checkbounds(S.S, k, S.idx)
    @inbounds S.S[k, S.idx] = x
end

Compare with


function f7(yy, x) 
    for i in 1:size(yy, 2)
        y = view(yy, :, i)
        @inbounds y .= x
    end
end

@btime f7($yy, $y)
 169.822 μs (10000 allocations: 468.75 KiB)

I think broadcast is not strictly the problem. It acts like a function barrier and does not inline, therefore forcing the allocations of views. You can define the inner loop as a separate function and inline/noinline it to check.

Also, your benchmarking is probably skewed by copying two numbers a 1000 times. If these numbers are representative of the application, fine. If not, you will discover that, while the allocations don’t disappear, the performance is almost the same on a 200x1000 example.

Cheers!

1 Like

You are right, the allocation because of the function barrier is a problem. The benchmark is representative of the application (low-dim ODEs). I don’t understand why though, the view is just an immutable struct with two fields… what has to be allocated there?

My philosophy on this is just wait for Base to catch up. I’m sure eventually they’ll be able to figure out a way to get rid of the views allocations.

1 Like

I think the real difference is in one case the views are completely optimized out (the inlined case), in the other they are not. I had to unfortunately go back to v0.6 to see @code_warntype which marks y <optimized out>
( I find the output in 0.7 illegible).