Column iterator


#1

I’d like to define a plain iterator over the columns of a matrix in style of julienned arrays. If I take the state to be

struct Slice{T} <: AbstractVector{T}
    a::Matrix{T}
    idx::Int
end

what do I have to define of size, length, getindex, setindex!, axes
to obtain a performant broadcast as below

a = rand(2, 1000)
x = rand(2)

y = Slice(a, 1) # a[:, 1]
y .= x

I’d like to avoid views and leverage that columns are contiguous.


#2

Whats the problem with views? Are they allocating? Are you trying to call mutate in place functions?


#3

Yeah, views are allocating and have overhead as they solve a considerably more difficult problem.


#4

To be sure, I benchmarked and it appears that creating the view is not a problem, the costs are stemming from broadcasting:

yy = rand(2, 10000)
y = rand(2)
using BenchmarkTools
function f5(yy, x) 
    for i in 1:size(yy, 2)
        y = Slice1(yy, i)
        for k in eachindex(x)
            @inbounds y[k] = x[k]
        end
    end
end
function f6(yy, x) 
    for i in 1:size(yy, 2)
        y = view(yy, :, i)
        for k in eachindex(x)
            @inbounds y[k] = x[k]
        end
    end
end



@btime f5($yy, $y)
@btime f6($yy, $y)
  25.072 μs (0 allocations: 0 bytes)
  27.160 μs (0 allocations: 0 bytes)

with





struct Slice1{T} <: AbstractVector{T}
    S::Matrix{T}
    idx::Int
    sz::Base.OneTo{Int}
end
@inline function Slice1(A::TA, i::Int) where {TA<:AbstractArray{T,2}} where {T} 
    Slice1(A, i, Base.OneTo(size(A)[1]) )
end


@inline size(S::Slice1) = (length(S.sz),)
@inline axes(S::Slice1) = (S.sz,)


@inline Base.IndexStyle(::Type{<:Slice1}) = IndexLinear()
@inline function setindex!(S::Slice1, x, k)
    @boundscheck checkbounds(S.S, k, S.idx)
    @inbounds S.S[k, S.idx] = x
end

#5

Compare with


function f7(yy, x) 
    for i in 1:size(yy, 2)
        y = view(yy, :, i)
        @inbounds y .= x
    end
end

@btime f7($yy, $y)
 169.822 μs (10000 allocations: 468.75 KiB)

Readability of `@code_warntype`
#6

I think broadcast is not strictly the problem. It acts like a function barrier and does not inline, therefore forcing the allocations of views. You can define the inner loop as a separate function and inline/noinline it to check.

Also, your benchmarking is probably skewed by copying two numbers a 1000 times. If these numbers are representative of the application, fine. If not, you will discover that, while the allocations don’t disappear, the performance is almost the same on a 200x1000 example.

Cheers!


#7

You are right, the allocation because of the function barrier is a problem. The benchmark is representative of the application (low-dim ODEs). I don’t understand why though, the view is just an immutable struct with two fields… what has to be allocated there?


#8

My philosophy on this is just wait for Base to catch up. I’m sure eventually they’ll be able to figure out a way to get rid of the views allocations.


#9

I think the real difference is in one case the views are completely optimized out (the inlined case), in the other they are not. I had to unfortunately go back to v0.6 to see @code_warntype which marks y <optimized out>
( I find the output in 0.7 illegible).