In place GEMM

My LHS is huge and needs to be constructed by parts/sub-matrix. e.g.,

lhs = [ b11 b12
      b21 b22]

Is there a way to specify the results into a sub-matrix, e.g., b22, to save memory? I believe gemm in Julia is also based on blas, which can specify this, though painful.

Have you tried just creating a view of b22 (e.g. b22 = @view lhs[i2:end, j2:end]) and then doing an in-place operation (mul! or ldiv!) on that?

1 Like

Unless I’m missing something or misunderstanding, I don’t think this works. For example:

A = rand(8,8)
a22 = view(A, 4:8,4:8)
mul!(a22, 10*randn(5,5), a22)

errors. This is expected, as the docstring for mul! says

mul!(Y, A, B) -> Y

  Calculates the matrix-matrix or matrix-vector product AB and stores the result in Y, overwriting the existing value of Y. Note that Y must not be aliased with either A or B.

To do in-place, overwriting matrix multiplication you need to use rmul! or lmul!. Unfortunately those functions don’t seems to work with sub-arrays:

julia> rmul!(a22,4*randn(5,5))
ERROR: MethodError: no method matching rmul!(::SubArray{Float64,2,Array{Float64,2},Tuple{UnitRange{Int64},UnitRange{Int64}},false}, ::Array{Float64,2})
Closest candidates are:
  rmul!(::AbstractArray, ::Number) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/generic.jl:161
  rmul!(::StridedArray{T, 2}, ::LowerTriangular{T,var"#s817"} where var"#s817"<:(StridedArray{T, 2} where T)) where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/triangular.jl:749
  rmul!(::StridedArray{T, 2}, ::Transpose{var"#s807",var"#s806"} where var"#s806"<:(LowerTriangular{T,var"#s805"} where var"#s805"<:(StridedArray{T, 2} where T)) where var"#s807") where T<:Union{Complex{Float32}, Complex{Float64}, Float32, Float64} at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/triangular.jl:759

Is there a reason why this shouldn’t work for subarrays?

t = zeros(1010, 1010)
v = rand(1000, 1000)
c = view(t, 11:1010, 11:1010)
@btime BLAS.gemm!('T', 'N', 1., v, v, 1., c)  # 6.135 ms (0 allocations: 0 bytes)

@btime v'v # 5.226 ms (3 allocations: 7.63 MiB)
1 Like

Ocatavian.matmul! is even better. Which

  • is faster
  • allows mixed precision

The error message is informative here:

julia> mul!(a22, 10*randn(5,5), a22)
ERROR: ArgumentError: output matrix must not be aliased with input matrix

mul! can’t write into the same array it’s reading, as this would over-write data it still needs to read. This is true for any rectangular arrays, whether or not they are views.

And these only work in cases where it is possible to work out the output and over-write the input sequentially, such as various structured matrices. They don’t mind views though:

julia> lmul!(Diagonal([0,100]), view(rand(4,4), 1:2, 3:4))
2×2 view(::Array{Float64,2}, 1:2, 3:4) with eltype Float64:
  0.0      0.0
 71.1938  90.2883

julia> lmul!(UpperTriangular(rand(2,2)), view(rand(4,4), 1:2, 3:4))
2×2 view(::Array{Float64,2}, 1:2, 3:4) with eltype Float64:
 0.0627902  0.525721
 0.0012771  0.494267

Thanks @mcabbott, that helped clarify things.

For some reason I was thinking BLAS had an overwriting version of GEMM for dense matrices. At least in a naive matrix multiplication algorithm it’s possible to do this by allocating a temporary array that can hold a single row or column of the output. I’m not sure why I thought standard BLAS implementations had such a routine. As far as I can tell they do not, and upon reflection it seems like this would only be useful in rare circumstances.