# Utilizing @turbo / @tturbo in Performance Critical Code

I am writing a KNN prediction algorithm from scratch and I wanted to see if I can squeeze even more performance out of with LoopVectorization’s @turbo macros. However when trying to compile the method, it seems that initializing the number of rows to iterate through using size(::Matrix, 2) is not meshing with the macro. Below is the code:

``````# Simple struct
struct KNN
k::Int
M::Matrix
Labels
end

# Prealloc
const prediction = Array{Float16}(undef, size(test_set, 1))

function predicts(model::KNN, test::Matrix)
#println(size(model.M, 2), size(test, 2))
size(model.M, 2) == size(test, 2) || throw("Dimensions do not match, test set must contain the same variables.")

# For each row in test Matrix
# compute distance with every member of the model's Matrix
# sort based on distance
# Look at classes and use mode to vote on majority
# Assign prediction to Vector
# Repeat

# Rotating for column-wise operations
r1 = rotl90( model.M ) # Training data
r2 =  rotl90( test ) # Test data

nrows = size( r2, 2) # Number of test rows     <------ Incompatible?
orows = size( r1, 2) # Number of train rows    <------ Incompatible?

K = model.k

# IDs
id = collect(1:orows)
#Prealloc distance array
dist = Array{Float32}(undef, orows)

@turbo for i in 1:nrows

curr_row = r2[:, i]

# Compute distance between the current row (internally  a column) and the training set
for j in 1:orows
dist[j]= 1/distance(curr_row, r1[:, j]; p = 2)^2
end

# Store and sort
df = DataFrame( distances= dist, id = id)
sort!(df, :distances, rev=true)[1:K, :]
# Vote
prediction[i] = mode( model.Labels[ df.id ] )
end

prediction
end

``````

Here is the error message:

Closes function definition for `predicts(model::KNN, test::Matrix)`

(Expr(:parameters, :((Expr(:kw, :p, 2)))))
in expression starting at c:\Users\aledo\Desktop\Julia_Files\dataf.jl:92

Where line 92 is the outer loop going from `i = 1:nrows `.

It’s not 100% critical that it works with @turbo as I am already getting good performance using Theads.@threads or @fastmath.

I’d suggest making `KNN` parametric:

``````struct KNN{T,L}
k::Int
M::Matrix{T}
Labels::L
end
``````

`Float16` is also probably going to be generally suboptimal for performance at the moment.

There are a lot of things you can already do to improve performance, e.g.

``````function predicts(model::KNN, test::Matrix)
#println(size(model.M, 2), size(test, 2))
size(model.M, 2) == size(test, 2) || throw("Dimensions do not match, test set must contain the same variables.")

# For each row in test Matrix
# compute distance with every member of the model's Matrix
# sort based on distance
# Look at classes and use mode to vote on majority
# Assign prediction to Vector
# Repeat

# Rotating for column-wise operations
r1 = rotl90( model.M ) # Training data
r2 =  rotl90( test ) # Test data

nrows = size( r2, 2) # Number of test rows     <------ Incompatible?
orows = size( r1, 2) # Number of train rows    <------ Incompatible?

K = model.k

# IDs
id = collect(1:orows)
#Prealloc distance array
dist = Array{Float32}(undef, orows)
prediction = Array{Float32}(undef, size(test, 1))

curr_row = @view r2[:, i]

# Compute distance between the current row (internally  a column) and the training set
for j in 1:orows
dist[j]= 1/distance(curr_row, @view(r1[:, j]); p = 2)^2
end

# Store and sort
sp = @view(sortperm(dist)[1:K])
# Vote
prediction[i] = mode( model.Labels[ @view(id[sp]) ] )
end

prediction
end
``````

If you want `@turbo` to work, you’ll need lower level code. E.g., replace the call to `distance` with the actual loops calculating distance, and probably move the `@turbo` to the for `j in 1:orows` loop.

2 Likes

This halved the time by 2x, now down to about 36 seconds, thank you!

As for @turbo, placing it in front ` j in 1:orow` now throws this error:
`nested task error: UndefVarError: j not defined`

I assume it’s got something to do with the @threads.

Updated code:

``````    Threads.@threads for i in 1:nrows

curr_row = @view r2[:, i]

# Compute distance between the current row (internally  a column) and the training set
#j=0
@turbo for j = 1:orows
dist[j] = 1/sum(sqrt.( (curr_row .- @view(r1[:, j]) ).^2))^2
end

# Store and sort
sp = @view(sortperm(dist, rev= model.weighted )[1:K])
# Vote
prediction[i] = mode( model.Labels[ @view(id[sp]) ] )
end

``````

I’d try something like

``````@turbo for j = 1:orows
distj = zero(eltype(dist))
for k = axes(r1,1)
distj += (r2[k,i] - r1[k,j])^2
end
dist[j] = 1/distj
end
``````

FWIW, it’d probably be faster if `r1` and `r2`’s memory layouts were transposed.

2 Likes

Incredible! The speed is about 20-24 seconds now. So as I understand it, when we use @turbo we have to write granular code and? Also by transposing r1 and r2 do you mean keeping them in their original form?

This is how the test matrix looks:

``````julia> X_test
20000×6 Matrix{Float64}:
3295.0  4044.0  1072.0   27.0  103.0   42.0
2959.0  1825.0  2980.0   89.0   42.0  323.0
2934.0  2779.0  1236.0  176.0  349.0  594.0
2808.0  1423.0  1734.0   80.0  147.0  180.0
3164.0  1338.0  2271.0   32.0   57.0  350.0
3085.0  2598.0  1712.0   90.0  299.0  324.0
3358.0  1332.0  1655.0   32.0   98.0  175.0
⋮                                    ⋮
2742.0  2493.0  1753.0   64.0   26.0  446.0
2735.0  1578.0  1879.0   19.0   65.0  457.0
3091.0  2239.0  2431.0   25.0  323.0  258.0
3278.0  1931.0  2439.0   63.0  115.0  470.0
3178.0  4577.0  2154.0   42.0   93.0  551.0
3262.0   390.0   807.0   28.0  261.0  408.0
``````

So then I rotate 90 degrees to the left to transpose the rows, which then become columns.

``````julia> X_test |> rotl90
6×20000 Matrix{Float64}:
42.0   323.0   594.0   180.0   350.0   324.0   175.0   150.0  …   218.0   324.0   446.0   457.0   258.0   470.0   551.0   408.0
103.0    42.0   349.0   147.0    57.0   299.0    98.0   316.0      193.0   354.0    26.0    65.0   323.0   115.0    93.0   261.0
27.0    89.0   176.0    80.0    32.0    90.0    32.0    39.0       37.0    43.0    64.0    19.0    25.0    63.0    42.0    28.0
1072.0  2980.0  1236.0  1734.0  2271.0  1712.0  1655.0   966.0     1410.0  1976.0  1753.0  1879.0  2431.0  2439.0  2154.0   807.0
4044.0  1825.0  2779.0  1423.0  1338.0  2598.0  1332.0  5486.0     3208.0  4094.0  2493.0  1578.0  2239.0  1931.0  4577.0   390.0
3295.0  2959.0  2934.0  2808.0  3164.0  3085.0  3358.0  3307.0  …  3296.0  3112.0  2742.0  2735.0  3091.0  3278.0  3178.0  3262.0
``````

Is this the wrong approach? I thought that Julia performs better when we iterate through columns rather than rows. Thus by rotating the rows we can access them faster now that they are columns.

I’m learning a lot from this : )