Linpack Performance (as an learning exercise)

coreyjadams · May 7, 2024, 7:41pm

Hi everyone,

I am exploring Julia as a high performance language since everyone says Julia has great performance … if you do it right. I don’t think I’m doing it right, and I’d love to be educated.

Eventually, I want to use ML methods and auto differentiation, but I’m starting with just making sure I can do some basic operations. I wrote the LINPACK algorithm in Julia since it’s a well benchmarked algorithm, with concrete performance comparisons. I don’t expect to get to BLAS performance though maybe you will all prove me wrong!

Using numpy, I can get 50+ GFlops on the CPU and on the GPU it closes in on nearly 100 GFlops. Neither is getting near the theoretical peak of this hardward, but ok it’s running with python while I’m also writing firefox and writing this post - I don’t expect to hit the peak.

My Julia script actually has the best performance up to matrix sizes of 128x128, on the CPU, which is nice. After that, performance stops going up. When I profiled it, most of the hits were on this line here:

github.com

coreyjadams/PyLinpack/blob/main/linpack_pivot.jl#L117


      
              # outer1 = gauss_vector .* target_row'
          
          
              # Skip the first row:
          
              @inbounds for idy in _k:N
                  t_y = idy - _k + 1
                  @inbounds for idx in _k+1:N
                      # Using outer product vector:
                      # _A[idx, idy] = _A[idx, idy] - outer[idx  - _k + 1, idy - _k + 1]
                      _A[idx, idy] -= gauss_vector[idx - _k + 1] * target_row[t_y]
                  end
              end
          
              return gauss_vector
          
          end
          
          function LU_partial_pivoting(_A::AbstractArray)
              """LU_partial_pivoting(_A, overwrite_a=False)

For context if you don’t want to click the link, here is the loop:


function form_gauss_vector!(_A::Matrix{T}, _k::Integer)::Vector{T} where {T<:Real}
    """
    Form the gauss vector and update the matrix value as the return

    The vector, at each stage of the algorithm, needs less and less of a column/matrix



    """

    # Start with the values at every point in the target column
    N = size(_A, 1)

    gauss_vector = copy(_A[_k:N, _k] / _A[_k,_k])
    @views target_row   = _A[_k, _k:N]


    # Scale the gauss vector 
    # gauss_vector = (gauss_vector / a_nn) # Copy?

    # This should be the biggest value at or below the diagonal 
    # in this column, from the pivoting.

    # Update the matrix by subtracting off the gauss vector.
    # subtract off the low right corner from the outer product:

    # Skip the first row:

    @inbounds for idy in _k:N
        t_y = idy - _k + 1
        @inbounds for idx in _k+1:N
            # This is the performance critical line:
            _A[idx, idy] -= gauss_vector[idx - _k + 1] * target_row[t_y]
        end
    end

    return gauss_vector

end

This function is meant to perform gaussian elimination on row _k of the matrix _A, in place. I think I haven’t optimized this loop…well I know I haven’t, or performance would be better. I am curious if anyone can spot obvious errors, or make some suggestions for reading materials to write better code here?

Thanks in advance,
Corey

Oscar_Smith · May 7, 2024, 8:14pm

Around that size is when BLAS libraries start using multithreading. You might just be maxing out a single core.

stevengj · May 8, 2024, 12:32am

Not using SIMD? Use @simd at least, if not LoopVectorization.jl

But as the matrix size gets bigger your flop rate will go down due to the poor cache locality of LINPACK style algorithms.

Topic		Replies	Views
Julia matrix-multiplication performance Performance linearalgebra	20	8668	October 30, 2022
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3618	February 11, 2020
LU factorization performance issue New to Julia linearalgebra	30	717	June 6, 2022
[ANN]: PaddedMatrices.jl, Julia BLAS and partially sized arrays Package Announcements performance , blas	32	5783	July 5, 2020
We can write an optimized BLAS library in pure Julia (please skip OP and jump to post 4) Numerics	17	13561	October 30, 2019

Linpack Performance (as an learning exercise)

Related topics