Counter-intuitive performance difference

photor · March 29, 2025, 1:54am

julia> using BenchmarkTools

julia> for i=1000
           A=randn(2i,2i)
           vA=@view A[1:i,1:i]
           C=randn(i,i)
           @btime @. $C+=$vA
       end
  1.325 ms (0 allocations: 0 bytes)

julia> for i=1000
           A=randn(2i,2i)
           vA=@view A[1:i,1:i]
           C=randn(i,i)
           @btime @. $C=$vA
       end
  2.490 ms (0 allocations: 0 bytes)

julia>

Benny · March 29, 2025, 3:05am

Corroborating this, just in one line:

julia> let i=1000; @btime $(randn(i,i)) .= $(view(randn(2i,2i), 1:i, 1:i)) end;
  983.500 μs (0 allocations: 0 bytes)

julia> let i=1000; @btime $(randn(i,i)) .+= $(view(randn(2i,2i), 1:i, 1:i)) end;
  687.600 μs (0 allocations: 0 bytes)

Difference intuitively reverses without the noncontiguous view, maybe that’s a hint:

julia> let i=1000; @btime $(randn(i,i)) .= $(randn(i,i)) end;
  270.300 μs (0 allocations: 0 bytes)

julia> let i=1000; @btime $(randn(i,i)) .+= $(randn(i,i)) end;
  627.700 μs (0 allocations: 0 bytes)

stevengj · March 30, 2025, 1:12am

Note that A .= B dispatches to a specialized method that is supposed to be a performance optimization, but maybe it is slower for a non-contiguous view.

photor · March 30, 2025, 1:49am

Yes, exactly. But the default copyto!(A,B) is as slow as A.=B.
The only thing I can do in this case is to use hand-writing for loops.

Benny · March 30, 2025, 2:29am

I think that identity-dispatch is the difference, here’s what happens when I made identity-like functions (@nospecialize evidently makes no difference):

julia> id2(@nospecialize x)=x
id2 (generic function with 1 method)

julia> ids(x)=x
ids (generic function with 1 method)

julia> let i=1000
         @btime $(randn(i,i)) .= $(view(randn(2i,2i), 1:i, 1:i))
         @btime $(randn(i,i)) .= id2.($(view(randn(2i,2i), 1:i, 1:i)))
         @btime $(randn(i,i)) .= ids.($(view(randn(2i,2i), 1:i, 1:i)))
         @btime copyto!($(randn(i,i)), $(view(randn(2i,2i), 1:i, 1:i)))
         @btime $(randn(i,i)) .= $(randn(i,i))
         @btime $(randn(i,i)) .= id2.($(randn(i,i)))
         @btime $(randn(i,i)) .= ids.($(randn(i,i)))
         @btime copyto!($(randn(i,i)), $(randn(i,i)))
       end;
  982.700 μs (0 allocations: 0 bytes)
  692.600 μs (0 allocations: 0 bytes)
  687.600 μs (0 allocations: 0 bytes)
  989.000 μs (0 allocations: 0 bytes)
  297.000 μs (0 allocations: 0 bytes)
  622.700 μs (0 allocations: 0 bytes)
  617.500 μs (0 allocations: 0 bytes)
  264.900 μs (0 allocations: 0 bytes)

So the identity-dispatch to copyto! works a treat on contiguous matrices with the same indices, not so much on views. Not sure why .+= adds almost no time on top of .= id2., maybe it’s just that much cheaper than everything else.

stevengj · March 30, 2025, 11:13am

This is the loop it should be dispatching to. Not sure why this is slow?

photor · March 30, 2025, 2:40pm

I don’t know. But see the result:

julia> using BenchmarkTools

julia> cp!(C,A) = @inbounds @simd for i in CartesianIndices(C)
           C[i] = A[i]
       end
cp! (generic function with 1 method)

julia> for i=1000
           A=randn(2i,2i)
           vA=@view A[1:i,1:i]
           C=randn(i,i)
           @btime cp!($vA,$C)
       end
  981.100 μs (0 allocations: 0 bytes)

julia>

compared to 2.49 ms in the OP.

Benny · March 30, 2025, 4:35pm

I’m guessing that’s a version of the broadcast loop because it gets me the 680-690μs timing of the .= id2.( benchmark that evades the identity branch. For some reason, the @simd really matters there; if I take it out, the time more than doubles to 1.554ms. But shouldn’t the non-contiguity of the view prevent SIMD? EDIT: Maybe not, quickly changed the indices of the view to 1:2:2i, 1:2:2i for less contiguity and it jumped up to 1.2ms with @simd…

Topic		Replies	Views
Array broadcasting slower than numpy? Performance	20	721	June 4, 2022
Strange dot-syntax result New to Julia	17	460	October 6, 2020
View or copy: looking for an overview Performance views	8	577	June 9, 2022
@simd with array views General Usage performance , simd	5	1240	January 4, 2019
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1003	June 21, 2021

Counter-intuitive performance difference

Related topics