julia> using BenchmarkTools
julia> for i=1000
A=randn(2i,2i)
vA=@view A[1:i,1:i]
C=randn(i,i)
@btime @. $C+=$vA
end
1.325 ms (0 allocations: 0 bytes)
julia> for i=1000
A=randn(2i,2i)
vA=@view A[1:i,1:i]
C=randn(i,i)
@btime @. $C=$vA
end
2.490 ms (0 allocations: 0 bytes)
julia>
Corroborating this, just in one line:
julia> let i=1000; @btime $(randn(i,i)) .= $(view(randn(2i,2i), 1:i, 1:i)) end;
983.500 μs (0 allocations: 0 bytes)
julia> let i=1000; @btime $(randn(i,i)) .+= $(view(randn(2i,2i), 1:i, 1:i)) end;
687.600 μs (0 allocations: 0 bytes)
Difference intuitively reverses without the noncontiguous view, maybe that’s a hint:
julia> let i=1000; @btime $(randn(i,i)) .= $(randn(i,i)) end;
270.300 μs (0 allocations: 0 bytes)
julia> let i=1000; @btime $(randn(i,i)) .+= $(randn(i,i)) end;
627.700 μs (0 allocations: 0 bytes)
Note that A .= B
dispatches to a specialized method that is supposed to be a performance optimization, but maybe it is slower for a non-contiguous view.
Yes, exactly. But the default copyto!(A,B)
is as slow as A.=B
.
The only thing I can do in this case is to use hand-writing for loops.
I think that identity
-dispatch is the difference, here’s what happens when I made identity
-like functions (@nospecialize
evidently makes no difference):
julia> id2(@nospecialize x)=x
id2 (generic function with 1 method)
julia> ids(x)=x
ids (generic function with 1 method)
julia> let i=1000
@btime $(randn(i,i)) .= $(view(randn(2i,2i), 1:i, 1:i))
@btime $(randn(i,i)) .= id2.($(view(randn(2i,2i), 1:i, 1:i)))
@btime $(randn(i,i)) .= ids.($(view(randn(2i,2i), 1:i, 1:i)))
@btime copyto!($(randn(i,i)), $(view(randn(2i,2i), 1:i, 1:i)))
@btime $(randn(i,i)) .= $(randn(i,i))
@btime $(randn(i,i)) .= id2.($(randn(i,i)))
@btime $(randn(i,i)) .= ids.($(randn(i,i)))
@btime copyto!($(randn(i,i)), $(randn(i,i)))
end;
982.700 μs (0 allocations: 0 bytes)
692.600 μs (0 allocations: 0 bytes)
687.600 μs (0 allocations: 0 bytes)
989.000 μs (0 allocations: 0 bytes)
297.000 μs (0 allocations: 0 bytes)
622.700 μs (0 allocations: 0 bytes)
617.500 μs (0 allocations: 0 bytes)
264.900 μs (0 allocations: 0 bytes)
So the identity
-dispatch to copyto!
works a treat on contiguous matrices with the same indices, not so much on views. Not sure why .+=
adds almost no time on top of .= id2.
, maybe it’s just that much cheaper than everything else.
This is the loop it should be dispatching to. Not sure why this is slow?
I don’t know. But see the result:
julia> using BenchmarkTools
julia> cp!(C,A) = @inbounds @simd for i in CartesianIndices(C)
C[i] = A[i]
end
cp! (generic function with 1 method)
julia> for i=1000
A=randn(2i,2i)
vA=@view A[1:i,1:i]
C=randn(i,i)
@btime cp!($vA,$C)
end
981.100 μs (0 allocations: 0 bytes)
julia>
compared to 2.49 ms
in the OP.
I’m guessing that’s a version of the broadcast loop because it gets me the 680-690μs timing of the .= id2.(
benchmark that evades the identity
branch. For some reason, the @simd
really matters there; if I take it out, the time more than doubles to 1.554ms. But shouldn’t the non-contiguity of the view prevent SIMD? EDIT: Maybe not, quickly changed the indices of the view to 1:2:2i, 1:2:2i
for less contiguity and it jumped up to 1.2ms with @simd
…