Utilizing @turbo / @tturbo in Performance Critical Code

nosewitz · September 8, 2022, 1:45pm

I am writing a KNN prediction algorithm from scratch and I wanted to see if I can squeeze even more performance out of with LoopVectorization’s @turbo macros. However when trying to compile the method, it seems that initializing the number of rows to iterate through using size(::Matrix, 2) is not meshing with the macro. Below is the code:

# Simple struct
struct KNN
    k::Int
    M::Matrix
    Labels
end

# Prealloc
const prediction = Array{Float16}(undef, size(test_set, 1))

function predicts(model::KNN, test::Matrix)
    #println(size(model.M, 2), size(test, 2))
    size(model.M, 2) == size(test, 2) || throw("Dimensions do not match, test set must contain the same variables.")

    # For each row in test Matrix
        # compute distance with every member of the model's Matrix
        # sort based on distance
        # Look at classes and use mode to vote on majority
        # Assign prediction to Vector 
    # Repeat

    # Rotating for column-wise operations
    r1 = rotl90( model.M ) # Training data
    r2 =  rotl90( test ) # Test data

    nrows = size( r2, 2) # Number of test rows     <------ Incompatible?
    orows = size( r1, 2) # Number of train rows    <------ Incompatible?
    
    K = model.k

    # IDs
    id = collect(1:orows)
    #Prealloc distance array
    dist = Array{Float32}(undef, orows)


    @turbo for i in 1:nrows

        curr_row = r2[:, i]

        # Compute distance between the current row (internally  a column) and the training set
        for j in 1:orows
            dist[j]= 1/distance(curr_row, r1[:, j]; p = 2)^2
        end

        # Store and sort
        df = DataFrame( distances= dist, id = id)
        sort!(df, :distances, rev=true)[1:K, :]
        # Vote
        prediction[i] = mode( model.Labels[ df.id ] )
    end

    prediction
end

Here is the error message:

Closes function definition for predicts(model::KNN, test::Matrix)

ERROR: LoadError: Expression not recognized.
(Expr(:parameters, :((Expr(:kw, :p, 2)))))
in expression starting at c:\Users\aledo\Desktop\Julia_Files\dataf.jl:92

Where line 92 is the outer loop going from i = 1:nrows .

It’s not 100% critical that it works with @turbo as I am already getting good performance using Theads.@threads or @fastmath.

Thanks in advance : )

Elrod · September 8, 2022, 2:57pm

I’d suggest making KNN parametric:

struct KNN{T,L}
    k::Int
    M::Matrix{T}
    Labels::L
end

Float16 is also probably going to be generally suboptimal for performance at the moment.

There are a lot of things you can already do to improve performance, e.g.

function predicts(model::KNN, test::Matrix)
    #println(size(model.M, 2), size(test, 2))
    size(model.M, 2) == size(test, 2) || throw("Dimensions do not match, test set must contain the same variables.")

    # For each row in test Matrix
        # compute distance with every member of the model's Matrix
        # sort based on distance
        # Look at classes and use mode to vote on majority
        # Assign prediction to Vector 
    # Repeat

    # Rotating for column-wise operations
    r1 = rotl90( model.M ) # Training data
    r2 =  rotl90( test ) # Test data

    nrows = size( r2, 2) # Number of test rows     <------ Incompatible?
    orows = size( r1, 2) # Number of train rows    <------ Incompatible?
    
    K = model.k

    # IDs
    id = collect(1:orows)
    #Prealloc distance array
    dist = Array{Float32}(undef, orows)
    prediction = Array{Float32}(undef, size(test, 1))
    Threads.@threads for i in 1:nrows

        curr_row = @view r2[:, i]

        # Compute distance between the current row (internally  a column) and the training set
        for j in 1:orows
            dist[j]= 1/distance(curr_row, @view(r1[:, j]); p = 2)^2
        end

        # Store and sort
        sp = @view(sortperm(dist)[1:K])
        # Vote
        prediction[i] = mode( model.Labels[ @view(id[sp]) ] )
    end

    prediction
end

should already help a lot.

If you want @turbo to work, you’ll need lower level code. E.g., replace the call to distance with the actual loops calculating distance, and probably move the @turbo to the for j in 1:orows loop.

nosewitz · September 8, 2022, 6:24pm

This halved the time by 2x, now down to about 36 seconds, thank you!

As for @turbo, placing it in front j in 1:orow now throws this error:
nested task error: UndefVarError: j not defined

I assume it’s got something to do with the @threads.

Updated code:

    Threads.@threads for i in 1:nrows

        curr_row = @view r2[:, i]

        # Compute distance between the current row (internally  a column) and the training set
        #j=0
        @turbo for j = 1:orows
            dist[j] = 1/sum(sqrt.( (curr_row .- @view(r1[:, j]) ).^2))^2
        end

        # Store and sort
        sp = @view(sortperm(dist, rev= model.weighted )[1:K])
        # Vote
        prediction[i] = mode( model.Labels[ @view(id[sp]) ] )
    end

Elrod · September 8, 2022, 6:30pm

I’d try something like

@turbo for j = 1:orows
    distj = zero(eltype(dist))
    for k = axes(r1,1)
        distj += (r2[k,i] - r1[k,j])^2
    end
    dist[j] = 1/distj
end

FWIW, it’d probably be faster if r1 and r2’s memory layouts were transposed.

nosewitz · September 8, 2022, 7:22pm

Incredible! The speed is about 20-24 seconds now. So as I understand it, when we use @turbo we have to write granular code and? Also by transposing r1 and r2 do you mean keeping them in their original form?

This is how the test matrix looks:

julia> X_test
20000×6 Matrix{Float64}:
 3295.0  4044.0  1072.0   27.0  103.0   42.0
 2959.0  1825.0  2980.0   89.0   42.0  323.0
 2934.0  2779.0  1236.0  176.0  349.0  594.0
 2808.0  1423.0  1734.0   80.0  147.0  180.0
 3164.0  1338.0  2271.0   32.0   57.0  350.0
 3085.0  2598.0  1712.0   90.0  299.0  324.0
 3358.0  1332.0  1655.0   32.0   98.0  175.0
    ⋮                                    ⋮
 2742.0  2493.0  1753.0   64.0   26.0  446.0
 2735.0  1578.0  1879.0   19.0   65.0  457.0
 3091.0  2239.0  2431.0   25.0  323.0  258.0
 3278.0  1931.0  2439.0   63.0  115.0  470.0
 3178.0  4577.0  2154.0   42.0   93.0  551.0
 3262.0   390.0   807.0   28.0  261.0  408.0

So then I rotate 90 degrees to the left to transpose the rows, which then become columns.

julia> X_test |> rotl90
6×20000 Matrix{Float64}:
   42.0   323.0   594.0   180.0   350.0   324.0   175.0   150.0  …   218.0   324.0   446.0   457.0   258.0   470.0   551.0   408.0      
  103.0    42.0   349.0   147.0    57.0   299.0    98.0   316.0      193.0   354.0    26.0    65.0   323.0   115.0    93.0   261.0
   27.0    89.0   176.0    80.0    32.0    90.0    32.0    39.0       37.0    43.0    64.0    19.0    25.0    63.0    42.0    28.0      
 1072.0  2980.0  1236.0  1734.0  2271.0  1712.0  1655.0   966.0     1410.0  1976.0  1753.0  1879.0  2431.0  2439.0  2154.0   807.0      
 4044.0  1825.0  2779.0  1423.0  1338.0  2598.0  1332.0  5486.0     3208.0  4094.0  2493.0  1578.0  2239.0  1931.0  4577.0   390.0
 3295.0  2959.0  2934.0  2808.0  3164.0  3085.0  3358.0  3307.0  …  3296.0  3112.0  2742.0  2735.0  3091.0  3278.0  3178.0  3262.0

Is this the wrong approach? I thought that Julia performs better when we iterate through columns rather than rows. Thus by rotating the rows we can access them faster now that they are columns.

I’m learning a lot from this : )

Topic		Replies	Views
Efficient use of @turbo for linear algebra operations (LoopVectorization.jl) Performance linearalgebra , loopvectorization	6	4013	August 21, 2021
@turbo on sets of operations Performance	15	1004	August 10, 2021
@turbo speeds routine, slows down everything else Performance loopvectorization	16	2610	June 5, 2021
@turbo macro gives incorrect results Performance loopvectorization	4	458	October 26, 2022
@turbo macro giving slightly different results General Usage loopvectorization	6	505	January 28, 2023

Utilizing @turbo / @tturbo in Performance Critical Code

Related topics