Perhaps of interest, I have a package which is a convenient way to use both LoopVectorization and threads:
julia> using Tullio, LoopVectorization
julia> function ML_tullio(x,k)
@tullio F[i] := x[i]*x[j]/(1+x[i]*x[j]) * (i!=j) # sum over j
@tullio F[i] += -k[i]
end
julia> ML_tullio(x,k) ≈ ML_baseline(x,k)
true
julia> @btime ML_tullio($x, $k); # threads + avx
8.328 ms (1178 allocations: 122.70 KiB)
julia> @btime ML_avx($x, $k); # just @avx, above
43.627 ms (2 allocations: 78.20 KiB)
julia> @btime ML_threaded_bounds_noif($x, $k); # just threads, above
15.554 ms (65 allocations: 86.27 KiB)
julia> @btime ML_baseline($x, $k);
196.572 ms (2 allocations: 78.20 KiB)
(This should work on the GPU too, but may not be quicker at this size.)