# Column iterator

I’d like to define a plain iterator over the columns of a matrix in style of julienned arrays. If I take the state to be

``````struct Slice{T} <: AbstractVector{T}
a::Matrix{T}
idx::Int
end
``````

what do I have to define of `size, length, getindex, setindex!, axes`
to obtain a performant broadcast as below

``````a = rand(2, 1000)
x = rand(2)

y = Slice(a, 1) # a[:, 1]
y .= x
``````

I’d like to avoid `view`s and leverage that columns are contiguous.

Whats the problem with views? Are they allocating? Are you trying to call mutate in place functions?

Yeah, views are allocating and have overhead as they solve a considerably more difficult problem.

To be sure, I benchmarked and it appears that creating the `view` is not a problem, the costs are stemming from broadcasting:

``````yy = rand(2, 10000)
y = rand(2)
using BenchmarkTools
function f5(yy, x)
for i in 1:size(yy, 2)
y = Slice1(yy, i)
for k in eachindex(x)
@inbounds y[k] = x[k]
end
end
end
function f6(yy, x)
for i in 1:size(yy, 2)
y = view(yy, :, i)
for k in eachindex(x)
@inbounds y[k] = x[k]
end
end
end

@btime f5(\$yy, \$y)
@btime f6(\$yy, \$y)
25.072 μs (0 allocations: 0 bytes)
27.160 μs (0 allocations: 0 bytes)
``````

with

``````

struct Slice1{T} <: AbstractVector{T}
S::Matrix{T}
idx::Int
sz::Base.OneTo{Int}
end
@inline function Slice1(A::TA, i::Int) where {TA<:AbstractArray{T,2}} where {T}
Slice1(A, i, Base.OneTo(size(A)) )
end

@inline size(S::Slice1) = (length(S.sz),)
@inline axes(S::Slice1) = (S.sz,)

@inline Base.IndexStyle(::Type{<:Slice1}) = IndexLinear()
@inline function setindex!(S::Slice1, x, k)
@boundscheck checkbounds(S.S, k, S.idx)
@inbounds S.S[k, S.idx] = x
end
``````

Compare with

``````
function f7(yy, x)
for i in 1:size(yy, 2)
y = view(yy, :, i)
@inbounds y .= x
end
end

@btime f7(\$yy, \$y)
``````
`````` 169.822 μs (10000 allocations: 468.75 KiB)
``````

I think broadcast is not strictly the problem. It acts like a function barrier and does not inline, therefore forcing the allocations of views. You can define the inner loop as a separate function and inline/noinline it to check.

Also, your benchmarking is probably skewed by copying two numbers a 1000 times. If these numbers are representative of the application, fine. If not, you will discover that, while the allocations don’t disappear, the performance is almost the same on a 200x1000 example.

Cheers!

1 Like

You are right, the allocation because of the function barrier is a problem. The benchmark is representative of the application (low-dim ODEs). I don’t understand why though, the view is just an immutable struct with two fields… what has to be allocated there?

My philosophy on this is just wait for Base to catch up. I’m sure eventually they’ll be able to figure out a way to get rid of the views allocations.

1 Like

I think the real difference is in one case the views are completely optimized out (the inlined case), in the other they are not. I had to unfortunately go back to `v0.6` to see `@code_warntype` which marks `y <optimized out>`
( I find the output in `0.7` illegible).