Linpack Performance (as an learning exercise)

Hi everyone,

I am exploring Julia as a high performance language since everyone says Julia has great performance … if you do it right. I don’t think I’m doing it right, and I’d love to be educated.

Eventually, I want to use ML methods and auto differentiation, but I’m starting with just making sure I can do some basic operations. I wrote the LINPACK algorithm in Julia since it’s a well benchmarked algorithm, with concrete performance comparisons. I don’t expect to get to BLAS performance though maybe you will all prove me wrong!

Using numpy, I can get 50+ GFlops on the CPU and on the GPU it closes in on nearly 100 GFlops. Neither is getting near the theoretical peak of this hardward, but ok it’s running with python while I’m also writing firefox and writing this post - I don’t expect to hit the peak.

My Julia script actually has the best performance up to matrix sizes of 128x128, on the CPU, which is nice. After that, performance stops going up. When I profiled it, most of the hits were on this line here:

For context if you don’t want to click the link, here is the loop:

function form_gauss_vector!(_A::Matrix{T}, _k::Integer)::Vector{T} where {T<:Real}
    Form the gauss vector and update the matrix value as the return

    The vector, at each stage of the algorithm, needs less and less of a column/matrix


    # Start with the values at every point in the target column
    N = size(_A, 1)

    gauss_vector = copy(_A[_k:N, _k] / _A[_k,_k])
    @views target_row   = _A[_k, _k:N]

    # Scale the gauss vector 
    # gauss_vector = (gauss_vector / a_nn) # Copy?

    # This should be the biggest value at or below the diagonal 
    # in this column, from the pivoting.

    # Update the matrix by subtracting off the gauss vector.
    # subtract off the low right corner from the outer product:

    # Skip the first row:

    @inbounds for idy in _k:N
        t_y = idy - _k + 1
        @inbounds for idx in _k+1:N
            # This is the performance critical line:
            _A[idx, idy] -= gauss_vector[idx - _k + 1] * target_row[t_y]

    return gauss_vector


This function is meant to perform gaussian elimination on row _k of the matrix _A, in place. I think I haven’t optimized this loop…well I know I haven’t, or performance would be better. I am curious if anyone can spot obvious errors, or make some suggestions for reading materials to write better code here?

Thanks in advance,

Around that size is when BLAS libraries start using multithreading. You might just be maxing out a single core.

1 Like

Not using SIMD? Use @simd at least, if not LoopVectorization.jl

But as the matrix size gets bigger your flop rate will go down due to the poor cache locality of LINPACK style algorithms.