This is a continuation of a discussion in issues 159 and 166 of LinearSolve.jl.
I have two computers with similar, 8-core i7-9700 CPUs, one of which runs Windows and the other Manjaro Linux. I ran the benchmark script (after implementing this PR), which measures the speed of LU factorizations on a
Float64 square matrix of order in the range
4:8:500 using 8 threads. The results for OpenBLAS under Windows are abysmally worse than under Linux. Plots are shown below, where the blue trace is the OpenBLAS result. First the Linux result:
and now the Windows result:
On Linux OpenBLAS computes the factorization for a matrix of order 500 at a rate of about 110 GFLOPS, but on Windows, it tops out at less than 40.
My questions: Is this an expected result? Why is the performance of OpenBLAS, the default library used for Julia BLAS, so much worse on Windows than on Linux?
Note, you can find more plots, for MKL, and for using only a single thread, on the above-linked issues.
based on another thread as I see you are using Julia 1.7.3
Julia Version 1.7.3
As I see Julia 1.7.3 - has
And check the OpenBlas Changelog.txt:
Version 0.3.19: 19-Dec-2021
- fixed missing thread initialization for static builds on Windows/MSVC
Version 0.3.14: 17-Mar-2021
- Fixed compilation for DYNAMIC_ARCH with clang on Windows
- Added support for running the BLAS/CBLAS tests on Windows
- Fixed signatures of the tls callback functions for Windows x64
This is just a theory, but I think the first thing to do is to upgrade the ~ old OpenBlas … and see the result again. Maybe the results are better with the new version.
IMHO: interesting … windows OpenBlas has some limitations with AVX-512
# OpenBLAS can't deal with avx512 on windows for some reason.
flags += "OPENBLAS_NO_AVX512=1 "
( source )
( But probably this is not related to your config )
Thanks for looking into this. Following your suggestion, I installed Julia 1.8.0-rc3, which has OPENBLAS_BRANCH=v0.3.20 . I then reran the script with 8 threads enabled under Windows with the following result:
The OpenBLAS result is still very poor.
When I started Julia with only a single thread (which causes the script to set
BLAS.set_num_threads(1)), OpenBLAS performed somewhat better:
Still, both results with Julia 1.8-rc3 on Windows are significantly worse than Julia 1.7.3 on Linux.
As I see OpenBlas has an Internal benchmark
OpenBLAS/benchmark/scripts at develop · xianyi/OpenBLAS · GitHub
And if one of the internal benchmarks could replicate the similar Linux-vs-Windows mismatch, it would clearly show that it is purely an OpenBlas problem ( and then it could be reported )
Here is the result of running the Octave benchmark for Octave version 7.1.0 on Windows and Linux with
From 4 To 500 Step=8 Loops=100: