Non-allocating views

I recently installed Julia Version 1.5.0-beta1.0 and I was impressed with the non-allocating views. I just tried an example of LU factorisation written in vectorised style to get a feel for it, and the performance compared very well to the inbuilt generic_lufact! which uses straight loops.

swap!(x,y) = for i in eachindex(x, y) @inbounds x[i], y[i] = y[i], x[i] end

function lufact!(A, p)

    m, n = size(A)
    p .= 1:m;

    @views @inbounds begin
        for k = 1:m-1
            _, kp = findmax(i->abs(A[i,k]), k:m)  #35316
            if k != kp
                swap!(A[k,:], A[kp,:])
                swap!(view(p,k), view(p,kp))
            end
            A[k+1:m,k] ./= A[k,k]
            for j = k+1:n
                 A[k+1:m,j] .-= A[k,j] .* A[k+1:m,k]
            end
        end
    end
    A, p
end

Note: I used the two argument findmax from https://github.com/JuliaLang/julia/pull/35316#issuecomment-622629895 to find the maximum absolute value for the pivot. I hope this will make its official appearance in Julia shortly, because it seems very useful. But maybe there’s a better way to do this that I overlooked.

Quick timing test:

using LinearAlgebra:generic_lufact!
function lutest()
    N = 1000; A = rand(N,N); p = zeros(Int, N);
    B = copy(A); @time lufact!(B,p);
    C = copy(A); @time generic_lufact!(C);
    @show B ≈ C
end

I found the time was very competitive (faster even) compared to generic_lufact! which surprised me.

julia> lutest()
  0.115506 seconds
  0.206406 seconds (1 allocation: 7.938 KiB)
B ≈ C = true

By comparison, in Julia Version 1.4.2 the results were

julia> lutest()
  0.236483 seconds (1.51 M allocations: 91.735 MiB, 1.80% gc time)
  0.210450 seconds (2 allocations: 7.969 KiB)
B ≈ C = true

with lots of allocations.

I’m new to Julia, so any feedback on style or substance is appreciated. I also wondered if I overlooked an inbuilt function to swap the contents of two arrays (like BLAS swap).

Thanks for your time.

1 Like

Just to be sure, these are second timings, right? If you just executed lutest() a single time in each terminal, you are measuring compilation time. Also, if my memory does not fail me, from one minor version to the other the values of rand can change, so this can introduce additional noise if the computation time depends on the specific values.

Good point, yes I ran it multiple times, and the timings are consistent. The randomness doesn’t change much regarding the runtime.