I guess I’m resurrecting this topic to see if the Julia community can squeeze more performance out of this Kernel algorithm. I’ve written the equivalent algorithm in D and I am getting stunning outputs in comparison to Julia’s.
To Recap, I’m doing kernel matrix computations, for example here are some kernel types:
using Base.Threads: @threads, @spawn
using Random: shuffle!
# Kernel Methods
#===============#
abstract type AbstractKernel end
struct DotProduct <: AbstractKernel end
function kernel(K::DotProduct, x, y)
ret::eltype(x) = 0
@inbounds @simd for i = 1:length(x)
ret += x[i]*y[i]
end
return ret
end
struct Gaussian{T} <: AbstractKernel
gamma::T
end
function kernel(K::Gaussian, x, y)
ret::eltype(x) = 0
@inbounds @simd for i in 1:length(x)
temp = x[i] - y[i]
ret -= temp * temp
end
return exp(K.gamma * ret)
end
Here’s is the kernel function I’m currently using:
function calculateKernelMatrix(K::AbstractKernel, data::Array{T}) where {T <: AbstractFloat}
n = size(data)[2]
mat::Array{T, 2} = zeros(T, n, n)
@threads for j in 1:n
@inbounds for i in j:n
mat[i, j] = kernel(K, data[:, i], data[:, j])
mat[j, i] = mat[i, j]
end
end
return mat
end
The timings:
x = rand(Float32, (784, 10_000));
calculateKernelMatrix(DotProduct(), x[:, 1:10]); # for compilation
calculateKernelMatrix(Gaussian(Float32(1.0)), x[:, 1:10]); # for compilation
@time mat1 = calculateKernelMatrix(DotProduct(), x);
# 38.315531 seconds (100.01 M allocations: 304.387 GiB, 23.55% gc time)
@time mat2 = calculateKernelMatrix(Gaussian(Float32(1.0)), x);
# 38.029511 seconds (100.01 M allocations: 304.387 GiB, 24.31% gc time)
D does the same computations in about 1.5 - 1.6 seconds, I wonder if Julia can do any better. I also tried using LoopVectorization: @avx
but it didn’t make much difference, maybe I wasn’t using it correctly. I also tried running Julia with julia -O3 --check-bounds=no
but it made no difference.
Many Thanks
P.s. Chapel does it in ~ 9 seconds.