Issue with XGBoost.jl and LIBSVM.jl when Julia 1.8.4

@vchuravy , I think you might be the only one who can solve this.

std::call_once is invoked to implement llvm::call_once:

However, it seems it was compiled against a std::mutex without _GLIBCXX_HAVE_TLS defined on Windows.

Why does our libLLVM not have _GLIBCXX_HAVE_TLS defined on Windows?

Are these related?

I am confused? What does libLLVM have to do with this problem? LibLLVM is not a dependency of libgomp?

libLLVM was compiled against a GCC configured without thread local storage on Windows. Upon updating GCC to GCC 12 we found missing symbols (e.g. _ZSt14__once_functor) needed by libLLVM as documented in the following pull request.

To restore those symbols, we had to explicitly disable thread local storage, in part due to an upstream change in GCC 12. However, by explicitly disabling thread local storage in GCC 12 we may have broken libgomp or perhaps how many other libraries (e.g. XGBoost) interact with libgomp. So far it appears that substituing a libgomp built with thread local storage appears to resolve the issue for those packages.

One potential solution, still unproven but complicated to test, is perhaps we should just enable GCC’s thread local storage on Windows, which actually does have native support for thread local storage. This in part would involve having to rebuild libLLVM against GCC with thread local storage enabled.

The specific questions to you, @vchuravy , are

  1. Is there a reason we have disabled thread local storage for LLVM on Windows? (My guess is no. That’s just how CompilerSupportLibraries_jll was built in the past)
  2. Could we enable thread local storage support on Windows and still have libLLVM function correctly?

To move that in this direction, some additional work is needed to prove that thread local storage support is actually the issue, but that is the current lead suspect here.

1 Like

I don’t have answers for those question, my best guess is probably?
It sounds like we compiled libLLVM against a libstdc++ from GCC and that baked in the fact that TLS was disabled. So yes a “simple” recompilation might work?

Any news on this issue? 1.9.0 is imminent but I don’t think its fixed, right?

Still waiting for someone to understand what the issue is in order to be able to come up with a fix.

In the absence of understanding the precise issue, what do you think about the utility of bumping the version of the MinGW-w64 release from v7 (2019-11-10) to v11 (2023-04-29)?

The current line in gcc_sources.jl seems to have been there since 2020 (Final cleanups to GCCBootstrap shards and RootFS by staticfloat · Pull Request #1607 · JuliaPackaging/Yggdrasil · GitHub).

(Side note: seems like mingw-w64 v11 fixed the parallel build race condition: [0_RootFS] Add GCC 12 by giordano · Pull Request #4980 · JuliaPackaging/Yggdrasil · GitHub

Send the pull request. We’ll probably want to do that eventually, I think.

We generally use old toolchains for a reason (compatibility).

2 Likes

I’m using IDA to do debugging, and it’s assigning the critical error to different location:
_ZNSt12__shared_ptrIN7xgboost10SparsePageELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_.isra.751+0xE6

Interestingly, this path goes through 00007FF92124323F libgcc_s_seh-1.dll libgcc_s_seh-1___emutls_get_address+1EF again suggesting possible involvement of TLS.

Address	Module	Function
00007FF9670AF3D3	ntdll.dll	ntdll_RtlIsZeroMemory+A3
00007FF9670B812F	ntdll.dll	ntdll_RtlpNtSetValueKey+44F
00007FF9670BAEF5	ntdll.dll	ntdll_RtlpNtSetValueKey+3215
00007FF9670B8192	ntdll.dll	ntdll_RtlpNtSetValueKey+4B2
00007FF9670B847A	ntdll.dll	ntdll_RtlpNtSetValueKey+79A
00007FF966FD47B1	ntdll.dll	ntdll_RtlFreeHeap+51
00007FF9670BE101	ntdll.dll	ntdll_RtlpNtSetValueKey+6421
00007FF966FD5BF0	ntdll.dll	ntdll_RtlGetCurrentServiceSessionId+13A0
00007FF936086F29	libwinpthread-1.dll	libwinpthread-1_sem_destroy+79
00007FF92124323F	libgcc_s_seh-1.dll	libgcc_s_seh-1___emutls_get_address+1EF
00007FF8C966E21C	libstdc++-6.dll	libstdc++-6__Znwy+1C
00000000067C6036	xgboost.dll	_ZNSt12__shared_ptrIN7xgboost10SparsePageELN9__gnu_cxx12_Lock_policyE2EEC2ISaIS1_EJEEESt19_Sp_make_shared_tagRKT_DpOT0_.isra.751+0xE6
000000000698DE8A	xgboost.dll	xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::DenseAdapter>(xgboost::data::DenseAdapter *,float,int)+0x27A
0000000006A41633	xgboost.dll	xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::DenseAdapter>(xgboost::data::DenseAdapter *,float,int,std::__cxx11::basic_string<char,std::char_traits<char>,std::allocator<char>> const&)+0x33
0000000006744A9D	xgboost.dll	XGDMatrixCreateFromMat+0x8D

2 Likes

We need someone to show if the bug can recreated in C++ using the libraries shipprd with Julia. If we can show this is not specifically a Julia issue, we stand a better chance of getting help from upstreamm

1 Like

Should be fixed by [CompilerSupportLibraries_jll] Upgrade to v1.0.5 by giordano · Pull Request #50135 · JuliaLang/julia · GitHub

13 Likes

Many thanks to everybody that contributed to fix this difficult problem, amazing job greatly appreciated. :+1:

1 Like