SpecialFunctions.jl's besselk() is ~10X slower than MATLAB's

I find that SpecialFunctions.jl’s besselk() is ~10X slower than MATLAB’s.

In MATLAB,

>> N = 10^6; v = rand(N,1);
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.017116 seconds.
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.017542 seconds.

With SpecialFunctions.jl,

julia> VERSION
v"1.10.2"

(@v1.10) pkg> st SpecialFunctions
Status `~/.julia/environments/v1.10/Project.toml`
  [276daf66] SpecialFunctions v2.3.1

julia> using SpecialFunctions

julia> N = 10^6; v = rand(N);

julia> @time for x = v; SpecialFunctions.besselk(1, x); end;
  0.158126 seconds (5.00 M allocations: 91.545 MiB, 1.42% gc time)

julia> @time for x = v; SpecialFunctions.besselk(1, x); end;
  0.173226 seconds (5.00 M allocations: 91.545 MiB, 0.56% gc time)

BesselK.jl is better than SpecialFunctions.jl, but it still does not beat MATALB:

(@v1.10) pkg> st BesselK
Status `~/.julia/environments/v1.10/Project.toml`
  [432ab697] BesselK v0.5.6

julia> using BesselK

julia> @time for x = v; BesselK.besselk(1, x); end;
  0.066285 seconds (4.00 M allocations: 76.286 MiB, 1.60% gc time)

julia> @time for x = v; BesselK.besselk(1, x); end;
  0.065575 seconds (4.00 M allocations: 76.286 MiB, 1.62% gc time)

Is there a way to make besselk() in Julia as fast as or faster than in MATLAB?

Another issue I found related to this is that besselk() is slower than besselj() in Julia, whereas it is similar in MATLAB.

In MATLAB,

>> tic; for x = v; besselj(1, x); end; toc
Elapsed time is 0.017452 seconds.
>> tic; for x = v; besselj(1, x); end; toc
Elapsed time is 0.018163 seconds.

so besselj() at a similar speed as besselk().

On the other hand, in Julia,

julia> @time for x = v; SpecialFunctions.besselj(1, x); end;
  0.042908 seconds (4.00 M allocations: 76.286 MiB, 2.06% gc time)

julia> @time for x = v; SpecialFunctions.besselj(1, x); end;
  0.044744 seconds (4.00 M allocations: 76.286 MiB, 1.15% gc time)

so bessej() is a lot faster than besselk(). (Despite this speedup compared to besselk(), SpecialFunctions.jl’s besselj() is still slower than MATLAB’s.)

Note that the loop in global scope adds a lot of overhead. Either broadcast, or use another function like map(!) or foreach.

julia> @time foreach(Base.Fix1(SpecialFunctions.besselk,1), v);
  0.523409 seconds (1.00 M allocations: 15.508 MiB, 3.36% compilation time)

julia> @time foreach(Base.Fix1(SpecialFunctions.besselk,1), v);
  0.502420 seconds (1000.00 k allocations: 15.259 MiB)

julia> @time foreach(Base.Fix1(BesselK.besselk,1), v);
  0.085844 seconds

julia> @time foreach(Base.Fix1(BesselK.besselk,1), v);
  0.077590 seconds

julia> @time foreach(Base.Fix1(Bessels.besselk,1), v);
  0.086257 seconds

julia> @time foreach(Base.Fix1(Bessels.besselk,1), v);
  0.077424 seconds

besselj:

julia> @time foreach(Base.Fix1(SpecialFunctions.besselj,1), v);
  0.040554 seconds

julia> @time foreach(Base.Fix1(Bessels.besselj,1), v);
  0.030290 seconds

Also notice that the README of BesselK.jl mentions Bessels.jl as the fastest option. Unless you need AD, you can switch.

1 Like

@Elrod, the real code I am testing does not evaluate the loop in the global scope, but I still find that the Julia version of the code is significantly slower than the MATLAB version. (I am porting MATLAB code to Julia, and the Julia port is noticeably slower, and the bottleneck is besselk().)

Following @Elrod’s and @juliohm’s comments, I tested foreach and Bessels:

julia> N = 10^6; v = rand(N);

julia> @time foreach(Base.Fix1(Bessels.besselk,1), v)
  0.027744 seconds

julia> @time foreach(Base.Fix1(Bessels.besselk,1), v)
  0.027165 seconds

Using foreach eliminated the allocations, but still, Julia takes ~50% more time than MATLAB on the same computer.

How many cores on your machine? Here is what I get with Matlab:

>> N = 10^6; v = rand(N,1);
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.033320 seconds.
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.033011 seconds.

But the besselk loop in Matlab is multi-threaded. So, checking and then setting the number of computational threads to 1 results in

>> N = maxNumCompThreads

N =

     8

>> maxNumCompThreads(1);
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.195831 seconds.
>> tic; for x = v; besselk(1, x); end; toc
Elapsed time is 0.195697 seconds.

On the same machine, I get this with Julia:

julia> @time foreach(x -> SpecialFunctions.besselk(1,x), v)
  0.298474 seconds (1.02 M allocations: 16.731 MiB, 2.71% gc time, 3.84% compilation time)

julia> @time foreach(x -> SpecialFunctions.besselk(1,x), v)
  0.292399 seconds (1.02 M allocations: 16.731 MiB, 4.04% compilation time)

which is slower, but much closer to the single-threaded Matlab time.
With multithreading in Julia:

julia> function runit(v)
           Threads.@threads for x in v
               SpecialFunctions.besselk(1,x)
           end
           nothing
       end
runit (generic function with 1 method)

julia> @time runit(v)
  0.098745 seconds (1.04 M allocations: 18.102 MiB, 13.26% gc time, 175.39% compilation time)

julia> @time runit(v)
  0.041363 seconds (1.00 M allocations: 15.264 MiB)

julia> @time runit(v)
  0.042982 seconds (1.00 M allocations: 15.264 MiB)

And for BesselK:

julia> function runit2(v)
           Threads.@threads for x in v
               BesselK.besselk(1,x)
           end
           nothing
       end
runit2 (generic function with 1 method)

julia> @time runit2(v)
  0.048516 seconds (26.86 k allocations: 1.832 MiB, 334.81% compilation time)

julia> @time runit2(v)
  0.009439 seconds (44 allocations: 4.938 KiB)

julia> @time runit2(v)
  0.009018 seconds (45 allocations: 4.953 KiB) 

About 3 times as fast as multithreaded Matlab. Not bad!

2 Likes

It’s probably important to note that both SpecialFunctions.jl and MATLAB are calling the exact same Fortran code so most likely the only difference is how the benchmark is being setup or how the Fortran code is embedded within each language and as @PeterSimon pointed what resources are being used.

Bessels.jl and BesselK.jl have different implementations than the Fortran code so I would expect some deviations. Bessels.jl has a dedicated besselk1 routine, so if you are always using orders of 0 or 1 it is advised to use besselk0 or besselk1.

julia> @btime Bessels.besselk1(x) setup=(x=rand()*50)
  8.049 ns (0 allocations: 0 bytes)

julia> @btime Bessels.besselk(1, x) setup=(x=rand()*50)
  19.726 ns (0 allocations: 0 bytes)

julia> @btime SpecialFunctions.besselk(1, x) setup=(x=rand()*50)
  111.757 ns (1 allocation: 16 bytes)

Perhaps, the general Bessels.jl besselk routine could be improved because it does eventually call besselk1 under the hood but there’s a lot of checks that occur on top of that so you can see the time difference. However, the routine in the future may or may not use besselk1 as we hope to make it AD compatible. Those details aren’t fleshed out yet though.

I am not sure if besselk1 is really the bottleneck (8 ns per evaluation) unless that is the only function your calling in a hot loop without seeing the full code.

If you are evaluating many times I would make sure the evaluations are ordered as that will help branch prediction. Unfortunately, we do not have a SIMD routine for this yet but could easily make one if you know that all your arguments are x>1.0.

Anyway, if able, you can factor out a exp(x) and call besselk1x. That will save you a exp(x) evaluation every call (~35% of runtime) that you can add back at the end. (with the assumption that the majority of calls are larger than 1.

julia> @btime Bessels.besselk1x(x) setup=(x=rand()*50)
  5.250 ns (0 allocations: 0 bytes)
6 Likes