What is the fastest way to update a particular row of a matrix using dot syntax?

Hi,

I have a square matrix M, and I would like to update it in-place such that the kth column is multiplied by some scalar s. I’ve tried implementing it with dot syntax and simple rolling out of the for loop.

function diagright1!(M, s, col)
    @views M[:, col] .*= s
end

function diagright2!(M, s, col)
    d = size(M, 1)
    for ind in 1:d
        M[ind, col] = M[ind, col] * s
    end
end

I was hoping the dot syntax will be as performant as the for loop, but I got this benchmark result instead.

M = rand(500, 500);
s = 3.2
col = 2
b1 = @benchmarkable diagright1!(M1, $s, $col) setup = (M1 = copy($M))
b2 = @benchmarkable diagright2!(M2, $s, $col) setup = (M2 = copy($M))
run(b1)
run(b2)
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  2
  --------------
  minimum time:     277.000 ns (0.00% GC)
  median time:      327.000 ns (0.00% GC)
  mean time:        346.747 ns (0.00% GC)
  maximum time:     17.267 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     181.000 ns (0.00% GC)
  median time:      198.000 ns (0.00% GC)
  mean time:        203.403 ns (0.00% GC)
  maximum time:     13.936 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Dot syntax is >50% slower than the naive for loop implementation. I’ve tried removing @views macro and/or removing the . from .*=, but the above implementation was the fastest among the bunch.
My question is, Is there some clever trick I missed that speeds up the dot syntax implementation, or is this performance to be expected and thus unavoidable?

This looks like the overhead associated with constructing the view itself. There may be a better way to work around this, but one option is to use UnsafeArrays, which provides a non-allocating (and unsafe) uview function. It works well in this case and seems to give excellent performance:

julia> using UnsafeArrays

julia> function diagright3!(M, s, col)
           uview(M, :, col) .*= s
       end
diagright3! (generic function with 1 method)

julia> b3 = @benchmarkable diagright3!(M3, $s, $col) setup = (M3 = copy($M))
Benchmark(evals=1, seconds=5.0, samples=10000)

julia> run(b3)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     159.000 ns (0.00% GC)
  median time:      376.000 ns (0.00% GC)
  mean time:        419.402 ns (0.00% GC)
  maximum time:     91.006 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The uview is unsafe because it doesn’t protect the original array from being de-allocated, so if you were to hold onto the view longer than the original array you could get undefined behavior. It’s essentially a nicer wrapper around a raw pointer.

4 Likes

Thank you for your response! I did not know about UnsafeArrays, and it looks useful for reducing memory allocation.

However, looking at your benchmark, the mean and median run time for the run is actually slower than the @views version of the algorithm. Do you have any insight into why that might be the case?

Is it actually slower on your machine? My other benchmark numbers were slower than yours, so you probably just have a faster processor.

Right! I blindly just compared your numbers with my numbers. Thanks for patiently pointing it out.

Here’s the benchmarking result on my machine. It is indeed a lot faster than either normal dot syntax or unrolled for loop!

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     93.000 ns (0.00% GC)
  median time:      133.000 ns (0.00% GC)
  mean time:        146.088 ns (0.00% GC)
  maximum time:     3.158 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

You can also speed up the explicit loop by adding @inbounds, this makes it comparable to uview for me.

Returning nothing instead of the view also helped that version, but not as much.

3 Likes

I don’t know about speeding up the broadcasted version, but adding @inbounds makes a big difference for me:

function diagright2!(M, s, col)
    d = size(M, 1)
    for ind in 1:d
        M[ind, col] = M[ind, col] * s
    end
end

function diagright3!(M, s, col)
    for row in axes(M, 1)  # this is more general
        @inbounds M[row, col] *= s
    end
end


julia> @benchmark diagright2!(M, 3.2, 2) setup=(M=rand(500,500))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     289.777 ns (0.00% GC)
  median time:      323.277 ns (0.00% GC)
  mean time:        325.961 ns (0.00% GC)
  maximum time:     734.010 ns (0.00% GC)
  --------------
  samples:          6021
  evals/sample:     310

julia> @benchmark diagright3!(M, 3.2, 2) setup=(M=rand(500,500))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     52.567 ns (0.00% GC)
  median time:      59.098 ns (0.00% GC)
  mean time:        64.215 ns (0.00% GC)
  maximum time:     352.070 ns (0.00% GC)
  --------------
  samples:          6209
  evals/sample:     987
2 Likes

I tried @inbounds on my machine too.

function diagright4!(M, s, col)
    d = size(M, 1)
    for ind in 1:d
        @inbounds M[ind, col] = M[ind, col] * s
    end
end

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     55.000 ns (0.00% GC)
  median time:      112.000 ns (0.00% GC)
  mean time:        123.933 ns (0.00% GC)
  maximum time:     571.000 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

It is slightly faster than uview version. Thus, performancewise we have uview + broadcast ~ for loop + @inbounds. I’m glad to know simple dot syntax with uview can be almost as fast as a for loop.

I suggest that you get used to using axes, eachindex etc instead of 1:n type of loops. They have the same performance, but are more general. Might as well get into the habit😉

3 Likes

I have this habit (and I agree with the reccomendation), but then my experience is that, unfortunately, when people who learn Julia (or do not know Julia but just want to read the code as a pseudocode) read my code have harder time understanding it, so sometimes I have to switch to 1:n kind of indexing.

I don’t think the overhead is the creation of view. The output of

using BenchmarkTools

function diagright1!(M, s, col)
    @views M[:, col] .*= s
end

function diagright4!(M, s, col)
    d = size(M, 1)
    for ind in 1:d
        @inbounds M[ind, col] = M[ind, col] * s
    end
end

col = 2
M = rand(500, col);
s = 3.2
@btime diagright1!(M1, $s, $col) setup = (M1 = copy($M))
@btime diagright4!(M1, $s, $col) setup = (M1 = copy($M))

is

  303.104 ns (2 allocations: 96 bytes)
  34.064 ns (0 allocations: 0 bytes)

while with M = rand(5000, col) it’s

  1.941 μs (2 allocations: 96 bytes)
  534.723 ns (0 allocations: 0 bytes)

If the overhead of diagright1! were view creation, you’d expect a constant overhead. Instead, it seems that the difference grows with the input size.

I reported the issue here: Broadcast with view is slow when it is input and output · Issue #35158 · JuliaLang/julia · GitHub

2 Likes

Hmm. Why is @views even helpful here? (Which it is, I tested it.) Shouldn’t M[:,col] .*= s do an element-by-element (hence non-allocating) assignment via the broadcasting framework?

1 Like

You can check what is happening with Meta.@lower

julia> Meta.@lower M[:,col] .*= s
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(M, :, col)
│   %2 = Base.getindex(M, :, col)
│   %3 = Base.broadcasted(*, %2, s)
│   %4 = Base.materialize!(%1, %3)
└──      return %4
))))

As you can see, the LHS (%1) is accessed through a view but the RHS (%2) copies the column.

1 Like

Oh interesting. That is indeed a rather huge cost as the input size becomes larger. Thanks for creating the issue, and let’s hope it gets fixed!

Agreed–that is not how I expected broadcasting to work. We have

julia> Meta.@lower x .= x .+ 1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.broadcasted(+, x, 1)
│   %2 = Base.materialize!(x, %1)
└──      return %2
))))

but

julia> Meta.@lower x[:] .= x[:] .+ 1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(x, :)
│   %2 = Base.getindex(x, :)
│   %3 = Base.broadcasted(+, %2, 1)
│   %4 = Base.materialize!(%1, %3)
└──      return %4
))))

The equivalent expressions using .+= lower to the same things. So apparently broadcast does not fuse with getindex (though apparently it does with setindex!). I am quite surprised by this and think it would be a good feature to have.

Edit: This appears to be just a matter of syntax lowering. If we replace [:] with getindex.( ) it seems to work:

julia> Meta.@lower x[:] .= getindex.(x, :) .+ 1
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope'
1 ─ %1 = Base.dotview(x, :)
│   %2 = Base.broadcasted(getindex, x, :)
│   %3 = Base.broadcasted(+, %2, 1)
│   %4 = Base.materialize!(%1, %3)
└──      return %4
))))

Edit 2: Even though the code above lowers, it doesn’t do what you want. (In fact it errors). The reason is that getindex.(x, :) broadcasts over x, not :. That is, it is roughly equivalent to [getindex(x[1], :), getindex(x[2], :), ...]. See https://github.com/JuliaLang/julia/issues/19169 for a relevant discussion of the complicated semantics of broadcasted indexing.

Yeah, I agree it’s counter-intuitive that view on the LHS is automatic but it’s not automatic for the RHS. It’d be nice if we can fix this.

@tkf I see you created an issue for this on GitHub. FYI I also created one, In broadcasted assignment, indexing on LHS is a view but indexing on RHS is not · Issue #35171 · JuliaLang/julia · GitHub, on a different aspect.

1 Like

For people reading this now, this issue seems to have been fixed on 1.5:
https://github.com/JuliaLang/julia/issues/35158

julia> function diagright1!(M, s, col)
           @views M[:, col] .*= s
       end
diagright1! (generic function with 1 method)

julia> function diagright2!(M, s, col)
           d = size(M, 1)
           for ind in 1:d
               M[ind, col] = M[ind, col] * s
           end
       end
diagright2! (generic function with 1 method)

julia> function diagright3!(M, s, col)
           d = size(M, 1)
           @inbounds for ind in 1:d
               M[ind, col] = M[ind, col] * s
           end
       end
diagright3! (generic function with 1 method)

julia> M = rand(500, 500);

julia> s = 3.2
3.2

julia> @benchmark diagright1!(M1, $s, 2) setup = (M1 = copy($M)) #unaligned accesses given AVX512
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     61.930 ns (0.00% GC)
  median time:      63.224 ns (0.00% GC)
  mean time:        64.618 ns (0.00% GC)
  maximum time:     103.881 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     982

julia> @benchmark diagright2!(M1, $s, 2) setup = (M1 = copy($M))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     286.850 ns (0.00% GC)
  median time:      287.017 ns (0.00% GC)
  mean time:        288.267 ns (0.00% GC)
  maximum time:     331.038 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     346

julia> @benchmark diagright3!(M1, $s, 2) setup = (M1 = copy($M)) #unaligned accesses given AVX512
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     43.287 ns (0.00% GC)
  median time:      50.369 ns (0.00% GC)
  mean time:        52.284 ns (0.00% GC)
  maximum time:     93.080 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark diagright1!(M1, $s, 3) setup = (M1 = copy($M)) #aligned accesses given AVX512
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     38.126 ns (0.00% GC)
  median time:      38.352 ns (0.00% GC)
  mean time:        40.935 ns (0.00% GC)
  maximum time:     98.494 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark diagright2!(M1, $s, 3) setup = (M1 = copy($M))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     286.050 ns (0.00% GC)
  median time:      286.229 ns (0.00% GC)
  mean time:        287.626 ns (0.00% GC)
  maximum time:     338.982 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     280

julia> @benchmark diagright3!(M1, $s, 3) setup = (M1 = copy($M)) #aligned accesses given AVX512
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.261 ns (0.00% GC)
  median time:      27.755 ns (0.00% GC)
  mean time:        29.346 ns (0.00% GC)
  maximum time:     83.991 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

Oh interesting! Do you mind elaborating on why col=2 and col=3 have such different performances?