Serious slowdown with FFTW v1.1.0 and v1.2.0 on Windows

I have just upgraded from Julia 0.7.0 → 1.0.5 and 1.3.1, and experienced a serious slowdown.
It seems to me that the reason is the default FFTW version that is:
FFTW v0.2.4 for Julia 0.7.0
FFTW v1.1.0 for Julia 1.0.5
FFTW v1.2.0 for Julia 1.3.1.

If I revert back to FFTW v0.2.4 on Julia 1.0.5/1.3.1 then the speed is fast again.
I have created a minimal code that reproduces the >5x slowdown (see below).
I have tested it on both 64-bit Windows 7 and Windows 10 with the same results.

I am stuck, so I would greatly appreciate any help!

using FFTW, LinearAlgebra

function myfft(plan,x,y)
    mul!(y,plan,x)
    return
end

siz = (25,36,45)
x = zeros(ComplexF32,siz)
y = zeros(ComplexF32,siz)
plan = plan_fft(x, flags=FFTW.MEASURE)

println("with function call:")
@time for i=1:5000 myfft(plan,x,y) end
@time for i=1:5000 myfft(plan,x,y) end

println("without function call:")
@time for i=1:5000 mul!(y,plan,x) end
@time for i=1:5000 mul!(y,plan,x) end

Are you using threads? slowdown in threaded code from julia 1.2 to julia 1.4-DEV · Issue #121 · JuliaMath/FFTW.jl · GitHub

No, I am not using threads or anything fancy.
The above code is really that minimal, and behaves as reported.

Today I had access to an old Ubuntu Linux box.
At Julia 1.3.1 I have experienced no slowdown switching from FFTW v0.2.4 to v1.2.0.

So it seems to be a Windows-specific problem.
I have changed the title accordingly.

If you look at the plan, is it significantly different? I wonder if the Windows build is not using AVX or something?

You are absolutely right, the plans differ indeed.
The two cases:

  1. Windows 7 64-bit | Julia 0.7.0 | FFTW v0.2.4 | ~1.7 seconds
FFTW forward plan for 25×36×45 array of Complex{Float32}
(dft-rank>=2/1                                          
  (dft-vrank>=1-x45/1                                   
    (dft-rank>=2/1                                      
      (dft-direct-25-x36 "n1fv_25_avx2")                
      (dft-buffered-36-x5/25-2                          
        (dft-vrank>=1-x5/1                              
          (dft-ct-dit/9                                 
            (dftw-direct-9/64 "t1fv_9_avx2")            
            (dft-direct-4-x9 "n1fv_4_sse2")))           
        (dft-r2hc-1                                     
          (rdft-rank0-iter-co/2-x5-x36))                
        (dft-nop))))                                    
  (dft-buffered-45-x225/900-1                           
    (dft-vrank>=1-x225/1                                
      (dft-ct-dit/3                                     
        (dftw-direct-3/8 "t1fuv_3_avx2_128")            
        (dft-direct-15-x3 "n1fv_15_avx2_128")))         
    (dft-r2hc-1                                         
      (rdft-rank0-iter-co/2-x225-x45))                  
    (dft-nop)))                                         
  1. Windows 7 64-bit | Julia 1.3.1 | FFTW v1.2.0 | ~9.5 seconds
FFTW forward plan for 25×36×45 array of Complex{Float32}
(dft-rank>=2/1                                          
  (dft-vrank>=1-x45/1                                   
    (dft-rank>=2/1                                      
      (dft-direct-25-x36 "n1_25")                       
      (dft-buffered-36-x5/25-2                          
        (dft-vrank>=1-x5/1                              
          (dft-ct-dit/3                                 
            (dftw-direct-3/4 "t1_3")                    
            (dft-direct-12-x3 "n1_12")))                
        (dft-r2hc-1                                     
          (rdft-rank0-iter-co/2-x5-x36))                
        (dft-nop))))                                    
  (dft-buffered-45-x225/900-1                           
    (dft-vrank>=1-x225/1                                
      (dft-ct-dit/3                                     
        (dftw-direct-3/4 "t1_3")                        
        (dft-direct-15-x3 "n1_15")))                    
    (dft-r2hc-1                                         
      (rdft-rank0-iter-co/2-x225-x45))                  
    (dft-nop)))

This is not using AVX (or SSE), so it is not surprising that it is a lot slower. You can get the configuration options by looking at the fftw_version C global:

julia> using FFTW

julia> unsafe_string(cglobal((:fftw_version,FFTW.libfftw3), UInt8))
"fftw-3.3.9-sse2-avx2-avx2_128"

(which indicates that, on my machine, FFTW was configured with SSE2 and AVX2.)

Yes, I realize the difference.
However, my “slow” setup of Julia 1.3.1 + FFTW v1.2.0 also returns the info:

julia> using FFTW

julia> unsafe_string(cglobal((:fftw_version,FFTW.libfftw3), UInt8))
"fftw-3.3.9-sse2-avx2-avx2_128"

So I am still badly stuck.
It is also strange that nobody else complained about the precompiled binaries on Windows.
So I beg all Windows users here to test this minimal code for the appropriate AVX plan:

siz = (25,36,45)
x = zeros(ComplexF32,siz)
plan = plan_fft(x, flags=FFTW.MEASURE)

println(plan)

I would like to test the MWE myself, however, printing the plan’s contents, either with display or with print, does not provide anything besides

julia> println(plan)
FFTW forward plan for 25×36×45 array of Complex{Float32}

julia> display(plan)
FFTW forward plan for 25×36×45 array of Complex{Float32}

Sounds like you are using MKL, not FFTW.

1 Like

My work computer runs Windows, so I can confirm

julia> plan = plan_fft(x, flags=FFTW.MEASURE)
FFTW forward plan for 25×36×45 array of Complex{Float32}
(dft-rank>=2/1
  (dft-vrank>=1-x45/1
    (dft-rank>=2/1
      (dft-direct-25-x36 "n1_25")
      (dft-buffered-36-x25/25-2
        (dft-vrank>=1-x25/1
          (dft-ct-dit/3
            (dftw-direct-3/4 "t1_3")
            (dft-direct-12-x3 "n1_12")))
        (dft-r2hc-1
          (rdft-rank0-iter-ci/2-x25-x36))
        (dft-nop))))
  (dft-buffered-45-x225/900-1
    (dft-vrank>=1-x225/1
      (dft-ct-dit/3
        (dftw-direct-3/4 "t1_3")
        (dft-direct-15-x3 "n1_15")))
    (dft-r2hc-1
      (rdft-rank0-iter-ci/2-x225-x45))
    (dft-nop)))

julia> @benchmark mul!($y, $plan, $x)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.909 ms (0.00% GC)
  median time:      1.957 ms (0.00% GC)
  mean time:        2.270 ms (0.00% GC)
  maximum time:     54.895 ms (0.00% GC)
  --------------
  samples:          2193
  evals/sample:     1

It works within WSL:

julia> plan = plan_fft(x, flags=FFTW.MEASURE)                                                                                                                                              FFTW forward plan for 25×36×45 array of Complex{Float32}                                                                                                                                 (dft-rank>=2/1                                                                                                                                                                               (dft-vrank>=1-x45/1                                                                                                                                                                          (dft-rank>=2/1                                                                                                                                                                               (dft-direct-25-x36 "n1fv_25_avx2")                                                                                                                                                         (dft-buffered-36-x25/25-2                                                                                                                                                                    (dft-vrank>=1-x25/1                                                                                                                                                                          (dft-ct-dit/9                                                                                                                                                                                (dftw-direct-9/64 "t1fv_9_avx2")                                                                                                                                                           (dft-direct-4-x9 "n1fv_4_sse2")))                                                                                                                                                      (dft-r2hc-1                                                                                                                                                                                  (rdft-rank0-iter-ci/2-x25-x36))                                                                                                                                                          (dft-nop))))                                                                                                                                                                         (dft-buffered-45-x225/900-1                                                                                                                                                                  (dft-vrank>=1-x225/1                                                                                                                                                                         (dft-ct-dit/3                                                                                                                                                                                (dftw-direct-3/8 "t1fuv_3_avx2_128")                                                                                                                                                       (dft-direct-15-x3 "n1fv_15_avx2_128")))                                                                                                                                                (dft-r2hc-1                                                                                                                                                                                  (rdft-rank0-iter-co/2-x225-x45))                                                                                                                                                         (dft-nop)))                                                                                                                                                                                                                                                                                                                                                                       julia> @benchmark mul!($y, $plan, $x)                                                                                                                                                      BenchmarkTools.Trial:                                                                                                                                                                        memory estimate:  0 bytes                                                                                                                                                                  allocs estimate:  0                                                                                                                                                                        --------------                                                                                                                                                                             minimum time:     278.800 μs (0.00% GC)                                                                                                                                                   median time:      285.100 μs (0.00% GC)                                                                                                                                                   mean time:        295.158 μs (0.00% GC)                                                                                                                                                   maximum time:     835.200 μs (0.00% GC)                                                                                                                                                   --------------                                                                                                                                                                             samples:          10000                                                                                                                                                                    evals/sample:     1               

Unfortunately, pasting from WSL does not work. But the plan shows avx, and it benchmarks at 278.8 microseconds vs the 1.9 milliseconds (about 7x faster in WSL than Cygwin).

1 Like

I can confirm that while the version string reports FFTW being configured with AVX and SSE2, the plan does not include those instructions, on FFTW.jl v1.1.0 & Julia 1.1.1.

The last fast version is v0.3. As soon as the PARTR thread work was merged in with v1.0.0 things get slower (~6x).

I have Julia 1.1.0 on Windows 10. I get

julia> unsafe_string(cglobal((:fftw_version,FFTW.libfftw3), UInt8))
"fftw-3.3.9-sse2-avx2-avx2_128"

but the plan is identical to @turtle’s slow one.

Dear @stevengj, @Elrod, @jebej, @mbaz and @jonas-kr,
Many thanks for all your precious time, feedback and help!

Now it is clear that this is a general problem of the compiled Windows binary,
and at present the user has no other choice than reverting to a previous version.
So with all due respect I am asking the core developers to fix this problem somehow.
Thank you very much in advance.

What is the result of

ccall((:fftw_have_simd_sse2, FFTW.libfftw3), Cint, ())
julia> using FFTW

julia> ccall((:fftw_have_simd_sse2, FFTW.libfftw3), Cint, ())
ERROR: ccall: could not find function fftw_have_simd_sse2 in library libfftw3-3.dll
Stacktrace:
 [1] top-level scope at .\REPL[2]:100:

But this is also the same for both Julia 1.3.1 + FFTW v1.2.0 and Julia 0.7.0 + FFTW v0.2.4

Oh, grr, that’s because it’s not explicitly marked as a DLL export on Windows, so even if it is there you can’t call it. (It might still be working fine internally to FFTW, but there is no way to check that from Julia.)

For further discussion, see: FFTW windows build is not using SIMD · Issue #534 · JuliaPackaging/Yggdrasil · GitHub

Is there anyone willing to test the binary at https://github.com/JuliaPackaging/Yggdrasil/pull/536#issuecomment-588548546?

1 Like

Good news:

I have downloaded the suggested file, unzipped and replaced all directories in
C:\Users\MYNAME\.julia\artifacts\2193a89e52669d43b28c47b83e738b73d6ed7a50
i.e. bin, include, lib, logs and share. (actually share was not there originally)

And magic happened:
Julia 1.3.1 + FFTW v1.2.0 got fast again with the plan containing SSE2 + AVX instructions!

4 Likes