Issue with XGBoost.jl and LIBSVM.jl when Julia 1.8.4

I’ve tried to dig fairly deeply into this issue, and I would put weight on the issue stemming from the disabling of thread-local storage on Windows (mingw builds) mentioned above by @mkitti.
The discussion leading to this change begins about here, as far as I can tell,
and continues later in the thread here.

This change appears to have had an impact on packages calling into C libraries built via BinaryBuilder, that have a dependency on CompilerSupportLibraries_jll for supplying OpenMP functionality (of course, that doesn’t exactly narrow things down). For the purpose of gathering further data, I have noted three additional packages are experiencing the exact issue given here (listed below):

Based on the error traces for XGBoost above and in the FastTransforms issue, both appear to be crashes caused by freeing memory while trying to use multiple threads. (The blis C library has strong warnings about the dangers of --disable-tls and race conditions.)

I would welcome other’s thoughts on the commonality linking these packages.

(Another PR around the time of the others is [CompilerSupportLibraries] Add `libmsvcrt` to the Windows package by giordano · Pull Request #5730 · JuliaPackaging/Yggdrasil · GitHub but I only mention it for interest’s sake.)

4 Likes

Essentially the fix is to enable TLS. To do this, the original issue by @jameson needs to be addessed.

If I understand a change in gcc led to the lost of some symbols such as std::__once_functor which @giordano was able to resolve by using --enable-tls=no.

If someone can figure out how to restore those symbols with enable-tls=yes` then we might have a fix.

Maybe this is a GCC 12 bug?

2 Likes

Yes, I agree, that was my conclusion as well.

I’m not convinced honestly: Jameson said that TLS was never enabled before in the first place.

Of the changes in PR 45582 or otherwise, do you have a sense of the most likely cause? Thanks

No.

Ok, I have a consistent method of “solving” the crash behaviour. It’s not a solution to the bug but I think it may point the way to a means of fixing it.

The basic observation is that the crash doesn’t occur when loading each package’s respective .dll libraries directly (i.e. via a dlopen) but does occur when loading the dll through their respective JuliaBinaryWrappers package (e.g. XGBoost_jll.jl). The common thread linking the examples seems to be:

  1. the wrapper loading aspect,
  2. using CompilerSupportLibraries_jll as a build dependency in the build_tarballs.jl file, e.g:
Dependency("CompilerSupportLibraries_jll"; platforms=filter(!Sys.isbsd, platforms))
  1. and the use of OpenMP / GOMP / libgomp-1.dll within the wrapper C library.

So a method of “solving” the crash is to run the following before loading the package:

## "Solution" code:
import CompilerSupportLibraries_jll
using Libdl
dlclose(CompilerSupportLibraries_jll.libgomp_handle)

I’ve verified that this works for me with the following reproducible crash MWEs given here and in linked issues:

XGBoost.jl
using XGBoost
(X, y) = (randn(100,4), randn(100))
bst = xgboost((X, y), num_round=5, max_depth=6, objective="reg:squarederror")
LIBSVM.jl
# LIBSVM.jl (https://github.com/JuliaML/LIBSVM.jl/issues/95#issue-1517938994)
using LIBSVM
(X, y) = (randn(100,4), randn(100))
svmtrain(X', y)
BLIS.jl
# (https://github.com/JuliaLinearAlgebra/BLIS.jl/issues/23#issuecomment-1515754372)

using blis_jll
using LinearAlgebra
BLAS.lbt_find_backing_library("dgemm_", :ilp64)  # for information
BLAS.lbt_forward(blis_jll.blis_path; clear=false, verbose=true)
BLAS.lbt_find_backing_library("dgemm_", :ilp64) # for information
A = rand(10,5); B = rand(5,8);
A * B

I guess the question now is why is the above “solution” effective?

It seems to be reverting or unloading the default libgomp, possibly to a system version? I get the fallback dllist() loading

"C:\\ProgramData\\Anaconda3\\Library\\mingw-w64\\bin\\libgomp-1.dll"

Is it working because otherwise there is some incompatibility with a GCC 12 version that julia 1.8.4 and above is loading by default?

4 Likes

I just realised there is a whole parallel discussion about Windows library loading issues here that may have some relevance:

This suggests that an alternative build of CompilerSupportLibraries_jll might br able to resolve the issue.

This doesn’t say anything about what the alternative build would need to look like though. That something doesn’t work is amply clear, but no one can pinpoint what.

That’s not a great solution, you just happen to have a libgomp from anaconda, in general that’s a false assumption.

If you have a DLL that appears to be working you can always use the Preferences.jl override system:

https://docs.binarybuilder.org/stable/jll/#Overriding-specific-products

1 Like

Yep, I knew that :slightly_smiling_face:. Hope it was clear that I’m still trying to get closer to what the actual solution should be, but not there yet.

That’s a good tool to have. Still, I think this shouldn’t need to be necessary since it’s not on the other platforms?

Not for stdlibs.

1 Like

An update.

Firstly, I’m taking it as given that there is an incompatibility with the libgomp DLL shipping with mingw GCC 12 (and hence also julia 1.8.4 and above) and something in Windows (maybe a conflict with the C runtime msvcrt and similar, but not something I have the skills to resolve with any certainty). The other side of this is the observation that this same incompatibility doesn’t exist when (cross-)compiling with GCC 11.

So the question I needed to resolve is why then do JLLs built with dependencies=[Dependency(PackageSpec(name="CompilerSupportLibraries_jll" ... for OpenMP/libgomp requirements and build_tarballs(... products, dependencies; preferred_gcc_version=v"11") fail when run under julia 1.8.4+ ?

I propose that one possible fix here is to 1) change the CSL Dependency to a BuildDependency if at all possible and 2) copy the necessary dependencies into the build tarball, while adding an explicit declaration in products of

LibraryProduct("libgomp", :libgomp)

The result of these two steps means that CSL is not invoked in the xxx_jll binary wrapper package when loading. Why is this important? When CompilerSupportLibraries_jll is a Dependency, then when it is imported in a package, the JLLWrappers.@generate_init_footer() call involves calling JLLWrappers.get_julia_libpaths() to add Sys.BINDIR to LIBPATH in Windows (cf. this PR, and see also @mkitti 's JLLWrappers.jl/issues/51 and jheinen/GR.jl#489).

The effect of this ends up loading, in Windows, the julia libpaths: ..\\julia-1.8.4\\bin\\libgomp-1.dll. However, this is the newer and incompatible GCC 12 libgomp, not the BinaryBuilder linked one, that we need to avoid crashes.
Quite likely there are some other details I may have missed, or much simpler fixes that I would love to hear about.

Having a plethora of different copies of the same shared library doesn’t sound like a good solution. It also doesn’t address the problem.

I really appreciate your work on the BinaryBuilder and JLL ecosystem, having a spent a bit time now going through and using your solutions. Thanks.

I thought this was already some version of DLL hel :). For someone else who wants to have a crack at this, what would a solution look like? Would it be something like pointing to the thing in the C library code that’s causing the crash between the shared libraries?

Since CompilerSupportLibraries_jll is a standard library, you at least need to recompile the system image with a substitute version of CompilerSupportLibraries_jll for a more permanent fix.

Alternatively, recompile Julia with USE_BINARYBUILDER_CSL = 0 changing csl.mk as needed.

My recommended way to compile Julia on Windows:

  1. Download MSYS2, run MSYS2 MINGW64
  2. git clone julia
  3. cd julia
  4. git checkout v1.8.5
  5. make -j8

If that owrks, then substitute this for the last step.

  1. make USE_BINARYBUILDER_CSL = 0 -j8

This is going to be quite difficult to get to compile though.

1 Like

Another anecdote:

While compiling julia via msys2, I removed julia/usr/bin/libgomp-1.dll and then ran julia/usr/bin/julia. This ran and julia picked up the system libgomp.

julia> using Libdl

julia> filter(endswith("libgomp-1.dll"), dllist())
1-element Vector{String}:
 "D:\\msys64\\mingw64\\bin\\libgomp-1.dll"

XGBoost.jl than ran well without issue.

julia> using XGBoost

julia> (X, y) = (randn(100,4), randn(100))
([-0.2872406656219727 -0.12530887476138095 -1.0645945831730095 -0.5476863629851729; 1.1796877272505861 -0.40424373371279 1.1259156453243035 -0.30888166542343876; … ; -0.849482346037964 -0.1579562943535686 -1.1079538137852494 -0.27463220454720016; 1.182042709625743 -0.87330578264992 -1.416683927904884 -0.4334293199841588], [-0.17352201292899774, 1.16313239488861
83, 0.14563217219555555, 0.5894469249299029, -0.868977268778328, 0.9206360145384936, -0.11610698551105296, -1.9548248409028115, -0.2730516728874413, -0.5349293988464893  …  0.8795303941899613, 0.34443362475322187, 1.0898747246422278, 0.37545903666247926, 0.587411742539772, 1.9751852159772305, -0.11853522099175944, 1.698022088389855, -0.3255423700067697, -0.1102
913206768139])

julia> bst = xgboost((X, y), num_round=5, max_depth=6, objective="reg:squarederror")
[ Info: XGBoost: starting training.
[ Info: [1]     train-rmse:0.86275966048601038
[ Info: [2]     train-rmse:0.72102870654282825
[ Info: [3]     train-rmse:0.61120932596553845
[ Info: [4]     train-rmse:0.51586851123155142
[ Info: [5]     train-rmse:0.45288159287658064
[ Info: Training rounds complete.

Next I checked to see if /mingw64/bin/stdc++-6.dll had the missing symbols.

MINGW64 ~/julia/usr/bin
$ nm /mingw64/bin/libstdc++-6.dll 2> /dev/null | grep -E '(D|T) .*(_|et)_once_(functor|mutexv)'

MINGW64 ~/julia/usr/bin
$ nm libstdc++-6.dll 2> /dev/null | grep -E '(D|T) .*(_|et)_once_(functor|mutexv)'
00000003bea83a80 D _ZSt14__once_functor
00000003bea683d0 T _ZSt16__get_once_mutexv
00000003bea6b430 T _ZSt23__get_once_functor_lockv
00000003bea6b5e0 T _ZSt27__set_once_functor_lock_ptrPSt11unique_lockISt5mutexE

The msys2 libstdc++-6.dll appears to not have the needed symbols while the one distributed with Julia does.

xgboost.dll only calls a few methods in libgomp-1.dll

N/A, 20 (0x00000014), GOMP_atomic_end, libgomp-1.dll, False, None
N/A, 21 (0x00000015), GOMP_atomic_start, libgomp-1.dll, False, None
N/A, 22 (0x00000016), GOMP_barrier, libgomp-1.dll, False, None
N/A, 38 (0x00000026), GOMP_loop_dynamic_next, libgomp-1.dll, False, None
N/A, 39 (0x00000027), GOMP_loop_dynamic_start, libgomp-1.dll, False, None
N/A, 42 (0x0000002a), GOMP_loop_end_nowait, libgomp-1.dll, False, None
N/A, 43 (0x0000002b), GOMP_loop_guided_next, libgomp-1.dll, False, None
N/A, 44 (0x0000002c), GOMP_loop_guided_start, libgomp-1.dll, False, None
N/A, 65 (0x00000041), GOMP_loop_ull_dynamic_next, libgomp-1.dll, False, None
N/A, 66 (0x00000042), GOMP_loop_ull_dynamic_start, libgomp-1.dll, False, None
N/A, 67 (0x00000043), GOMP_loop_ull_guided_next, libgomp-1.dll, False, None
N/A, 68 (0x00000044), GOMP_loop_ull_guided_start, libgomp-1.dll, False, None
N/A, 91 (0x0000005b), GOMP_parallel, libgomp-1.dll, False, None
N/A, 113 (0x00000071), GOMP_single_start, libgomp-1.dll, False, None
N/A, 233 (0x000000e9), omp_get_max_threads, libgomp-1.dll, False, None
N/A, 241 (0x000000f1), omp_get_num_procs, libgomp-1.dll, False, None
N/A, 245 (0x000000f5), omp_get_num_threads, libgomp-1.dll, False, None
N/A, 270 (0x0000010e), omp_get_thread_limit, libgomp-1.dll, False, None
N/A, 272 (0x00000110), omp_get_thread_num, libgomp-1.dll, False, None
1 Like