Extremely slow n-th root of float

I’m benchmarking an algorithm translated from Fortran, doing something similar to Expokit.expmv – evaluate the result of exponentiation a matrix and applying it to a vector, by expanding the exp into a polynomial series.

It’s not performing as well as it should (by a factor of 5-10). Profiling shows that strangely, 50% of the entire runtime is spent in a line that simply calculates the n-th root of a float, Δ^ex where Δ is a float and ex=1/n for an integer n.

The full benchmark script, profile_propagate.jl, is meant to be included in the test-REPL of QuantumPropagators.jl (make devrepl or just julia --project=test from a checkout of that repo).

The routine extend_leja! where this occurs in an almost 1-to-1 transcription of the original Fortran code. When I profile that Fortran code, the corresponding routine is completely negligible. The expected behavior is for this algorithm to be dominated by the matrix-vector products, not by some scalar operations for calculating the expansion coefficients. (There’s also a diagonalization of a small Hessenberg matrix that takes much longer in the Julia code than it should, but I’ll look at that after I can figure out what’s going on with the calculation of the coefficients).

Any ideas?

Are you on Linux? ^ is known to be very slow on Linux in Julia 1.6 and 1.7, but it’s better in <= 1.5 and >= 1.8.

It doesn’t have this problem on Windows or Mac.
But the problem is because it’s calling the system ^ function, which means Fortran should be experiencing slow ^ as well…

Huh:

julia> const libdpow = "/home/chriselrod/Documents/progwork/fortran/libdpow.so"
"/home/chriselrod/Documents/progwork/fortran/libdpow.so"

julia> dpow(x,y) = @ccall libdpow.dpow(x::Float64, y::Float64)::Float64
dpow (generic function with 1 method)

julia> syspow(x,y) = ccall(:pow, Float64, (Float64,Float64), x, y)
syspow (generic function with 1 method)

julia> @btime $(Ref(2.3))[]^$(Ref(1.2))[]
  28.336 ns (0 allocations: 0 bytes)
2.716898432499149

julia> @btime dpow($(Ref(2.3))[],$(Ref(1.2))[])
  15.797 ns (0 allocations: 0 bytes)
2.716898432499149

julia> @btime syspow($(Ref(2.3))[],$(Ref(1.2))[])
  70.757 ns (0 allocations: 0 bytes)
2.716898432499149

julia> versioninfo()
Julia Version 1.9.0-DEV.229
Commit 7cde4be23d* (2022-03-22 02:44 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 28 × Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake-avx512)
  Threads: 28 on 28 virtual cores
module pow
use ISO_C_BINDING
  
implicit none

contains

real(C_double) function dpow(x, y) bind(C, name="dpow")
    real(C_double), value, intent(in) :: x, y
    dpow = x**y
    return
end function dpow

end module pow

I compiled with

gfortran -O3 -shared -fPIC pow.f90 -o libdpow.so

@Oscar_Smith
Interestingly, I didn’t even compile the Fortran with -march=native (or with -Ofast)!

Related:
https://github.com/JuliaLang/julia/pull/44717

If you’re curious, the Fortran assembly

dpow:
.LFB0:
	.cfi_startproc
	jmp	pow@PLT
	.cfi_endproc

I guess it is jumping to a different pow than we’re calling with ccall(:pow, ...) above…

Also, compiling with ifort -fast instead, I get

julia> @btime dpow($(Ref(2.3))[],$(Ref(1.2))[])
  12.671 ns (0 allocations: 0 bytes)
2.716898432499149

But presumably this function is less accurate.

3 Likes

Can you test the accuracy of this? I’m curious how good it is.

I’ll get around to it, working on something else at the moment.
But if you’re curious, you can install the Intel compilers through apt:
https://www.intel.com/content/www/us/en/develop/documentation/installation-guide-for-intel-oneapi-toolkits-linux/top/installation/install-using-package-managers/apt.html
They’re all free as in beer now (but not as in freedom).

5 Likes

Yes, I’m on Linux (Ubuntu), currently using Julia 1.7.2, and I’m comparing with ifort-compiled Fortran code (with -O3).

I just tried running it with the 1.8.0-beta1, and I can confirm that ^ seems to be about twice as fast as on 1.7.2, which matches the comment in https://github.com/JuliaLang/julia/pull/44717#issue-1178579501. This is from eyeballing the ProfileSVG output; Pprof and StatProfilerHTML which would give more quantitative info both still crash on 1.8.

I also tried it on my Macbook (with 1.7), and it looked like the relative amount of time spent in ^ was indeed also shorter there, although overall performance was abysmal. I’m not sure I can trust the profiler there, it’s showing the majority of time spent in top-level functions named .L10, .L999, and similar that I have no idea what they mean. I’d guess this maybe has to do with Rosetta (I haven’t had time yet to really look into the M1 support in 1.7). It doesn’t matter that much, since anything serious is going to run on the Linux workstation.

The overall algorithm is still much slower than it should be, but now that Hessenberg diagonalization is the slowest part, so I’ll have a look at that next. Also, I might have to be a bit more thorough with my Fortran benchmarking, maybe try gfortran instead of ifort. For one of the other methods in QuantumPropagators that’s far less involved (but limited to Hermitian matrices), I got the Julia code to be exactly as fast as the Fortran with compiled gfortran (which was a factor of two slower than ifort). I was hoping to get something similar here.

I’ll keep twiddling the code, and maybe also see how some of the existing Julia solutions like ExpoKit perform (I’m just implementing my old true-and-tested methods first to have something to fall back on that I know exactly what it’s doing)

https://github.com/JuliaLang/julia/pull/44717#issue-1178579501 isn’t in 1.8. That will be a future speedup in 1.9. The part that did go into 1.8 is https://github.com/JuliaLang/julia/pull/42271

2 Likes

Ah! Should I try the Nightly?

Note that the PR isn’t merged yet. To see the change, you would have to build the PR from source. Also I’m planning on making a number of changes before this merges (that hopefully will have small to no performance impact for this) before it merges. Also, in the next few weeks I should have a PR that speeds up exp by 1-2 ns that will also help this.

1 Like

Also…

julia> const libm6 = "/usr/lib64/libm.so.6"
"/usr/lib64/libm.so.6"

julia> syspow(x,y) = @ccall libm6.pow(x::Float64, y::Float64)::Float64
syspow (generic function with 1 method)

julia> @btime syspow($(Ref(2.3))[],$(Ref(1.2))[])
  15.675 ns (0 allocations: 0 bytes)
2.716898432499149

So, actually, system libm is fast on Linux.

If it isn’t the system libm, which pow is Julia actually calling on 1.6 and 1.7, and why is it different on Windows, Mac, and Linux???

Wait, that’s really weird. Are 1.6 and 1.7 calling an LLVM specific pow somehow?

Apparently, but why would that be different on each OS?