Issue with XGBoost.jl and LIBSVM.jl when Julia 1.8.4

Dear all
I am using XGBoost.jl v2.2.0 and LIBSVM.jl v0.8.0, under Windows 11.
It works fine with Julia 1.8.3

versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

But when I uses Julia 1.8.4

julia> versioninfo()
Julia Version 1.8.4
Commit 00177ebc4f (2022-12-23 21:32 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

both packages kills my Julia session, for instance after doing:

using XGBoost
(X, y) = (randn(100,4), randn(100))

doing

xgboost((X, y))

kills the process and closes directly Julia. Same when I run any function of LIBSVM.jl.

Did somebody observe the same problem and know what is happening?

1 Like

For what it’s worth I cannot reproduce on my mac

julia> versioninfo()
Julia Version 1.8.4
Commit 00177ebc4fc (2022-12-23 21:32 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: 12 × Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 5 on 12 virtual cores
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  DYLD_LIBRARY_PATH = /usr/local/homebrew/Cellar/libomp/9.0.1/lib/
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia

Given that you are having trouble with both these packages, I think you would be entitled to raise an issue directly at GitHub - JuliaLang/julia: The Julia Programming Language and see if you can get help there. Worth mentioning that both the packages wrap C/C++ code.

Ok thanks I will do.

I already opened two issues: https://github.com/JuliaML/LIBSVM.jl/issues/95 and https://github.com/dmlc/XGBoost.jl/issues/153. The second raised unexplained error on windows.

Julia 1.8.5 should be released any minute now.
Given that it works on 1.8.3, I would wait for the new release and see if that fixes it, and revert to 1.8.3 in the meantime.

1 Like

Something in 1.8.4, 1.9 betas, and the current nightly build is causing a heap corruption (exit code: 3221226356) on Windows when using ccall. I have not been able to pinpoint the change that is causing this issue. I hope that 1.8.5 fixes it.

Here is a short code snippet to reproduce the error:

using XGBoost
x = rand(4,5)

o = Ref{XGBoost.DMatrixHandle}()
sz = reverse(size(x))
xp = convert(Matrix{Cfloat}, x)
missing_value=NaN32
XGBoost.xgbcall(XGBoost.XGDMatrixCreateFromMat, xp, sz[1], sz[2], missing_value, o)

Here is the function that is being called in XGBoost.jl:

function XGDMatrixCreateFromMat(data, nrow, ncol, missing, out)
    @ccall libxgboost.XGDMatrixCreateFromMat(data::Ptr{Cfloat}, nrow::bst_ulong, ncol::bst_ulong, missing::Cfloat, out::Ptr{DMatrixHandle})::Cint
end
1 Like

I am still seeing the error in v1.8.5. I am not sure what could be causing it. Here is the v1.8.3 to v1.8.4 changelog: Comparing v1.8.3...v1.8.4 · JuliaLang/julia · GitHub

My best guess is that it was one of these PRs:
[CompilerSupportLibraries_jll] Upgrade to libraries from GCC 12 by tylerjthomas9 · Pull Request #47544 · JuliaLang/julia · GitHub (Upgrade to libraries from GCC 12)
https://github.com/JuliaLang/julia/pull/46976 (Probe and dlopen() the correct libstdc++)

yes I just tried 1.8.5, and also observed the same problem: Julia crashes. And as you said @tylerjthomas9, it seems the problem persists in 1.9 betas.

Until now, I stayed to 1.8.3 for daily uses, to be able to use XGBoost and LIBSVM, and got no solutions from https://github.com/JuliaLang/julia/issues/48187. I assume that there are many users of XGBoost.jl and LIBSVM.jl under Windows; I don’t know if some have found another strategy.

Can you compile Julia on Windows? In that case you bisect (Git - git-bisect Documentation) and find what change that caused the issue.

1 Like

The error first occurs with this commit: [CompilerSupportLibraries_jll] Upgrade to libraries from GCC 12 (#47544) · JuliaLang/julia@c8b72e2 · GitHub

2 Likes

I am still seeing this issue on v1.9.0-beta3 and the current nightly build (v1.10.0-DEV.398 (2023-01-19).

Same for me, I still observed the issue on v1.9.0-beta3

1 Like

Same for v1.9.0-beta4

Bugs don’t usually get fixed alone unless someone understands what the issue really is (we know only the symptoms) and finds an appropriate solution.

I’m a bit confused on what is going on here.

The issue appears to have been bisected to [CompilerSupportLibraries_jll] Upgrade to libraries from GCC 12 by tylerjthomas9 · Pull Request #47544 · JuliaLang/julia · GitHub by tylerjthomas9 above.

Tyler appears to bisected the 1.8 branch. Did anyone try reverting the corresponding commit on the 1.8 branch to see if this resolves the issue?

The corresponding changes on the master branch are

Running julia under gdb from MSYS2, I ran the following Julia code:

using XGBoost

# training set of 100 datapoints of 4 features
(X, y) = (randn(100,4), randn(100))

# create and train a gradient boosted tree model of 5 trees
bst = xgboost((X, y), num_round=5, max_depth=6, objective="reg:squarederror")

I then got the following trace.

warning: Critical error detected c0000374

Thread 1 received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffbbf64f633 in ntdll!RtlIsZeroMemory () from C:\windows\SYSTEM32\ntdll.dl
(gdb) bt
#0  0x00007ffbbf64f633 in ntdll!RtlIsZeroMemory () from C:\windows\SYSTEM32\ntdll.dll
#1  0x00007ffbbf6583f2 in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#2  0x00007ffbbf6586da in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#3  0x00007ffbbf65e361 in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#4  0x00007ffbbf575bf0 in ntdll!RtlGetCurrentServiceSessionId ()
   from C:\windows\SYSTEM32\ntdll.dll
#5  0x00007ffbbf5747b1 in ntdll!RtlFreeHeap () from C:\windows\SYSTEM32\ntdll.dll
#6  0x00007ffbbd5b9c9c in msvcrt!free () from C:\windows\System32\msvcrt.dll
#7  0x0000000002d6a0ef in unsigned long long xgboost::SparsePage::Push<xgboost::data::DenseAdapterBatch>(xgboost::data::DenseAdapterBatch const&, float, int) ()
   from C:\Users\mkitti\.julia\artifacts\a1540ff6121e48fd4712006a269d6bf6bf8216e1\bin\xgboost.dll
#8  0x0000000002dbac97 in xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::DenseAdapter>(xgboost::data::DenseAdapter*, float, int) ()
   from C:\Users\mkitti\.julia\artifacts\a1540ff6121e48fd4712006a269d6bf6bf8216e1\bin\xgboost.dll
#9  0x0000000002e70883 in xgboost::DMatrix* xgboost::DMatrix::Create<xgboost::data::DenseAdapter>(xgboost::data::DenseAdapter*, float, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from C:\Users\mkitti\.julia\artifacts\a1540ff6121e48fd4712006a269d6bf6bf8216e1\bin\xgboost.dll
#10 0x0000000002b555fb in XGDMatrixCreateFromMat ()
   from C:\Users\mkitti\.julia\artifacts\a1540ff6121e48fd4712006a269d6bf6bf8216e1\bin\xgboost.dll
#11 0x000002248136cd09 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) b unsigned long long xgboost::SparsePage::Push<xgboost::data::DenseAdapterBatch>(xgboost::data::DenseAdapterBatch const&, float, int)

(gdb) s
Single stepping until exit from function _ZN7xgboost10SparsePage4PushINS_4data17DenseAdapterBatchEEEyRKT_fi,
which has no line number information.
operator new (sz=8) at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/new_op.cc:47
47      /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/new_op.cc: No such file or directory.
(gdb) s
50      in /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/new_op.cc
(gdb) s
58      in /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/new_op.cc

(gdb) s
0x00000000029a9f9d in unsigned long long xgboost::SparsePage::Push<xgboost::data::DenseAdapterBatch>(xgboost::data::DenseAdapterBatch const&, float, int) () from C:\Users\kittisopikulm\.julia\artifacts\a1540ff6121e48fd4712006a269d6bf6bf8216e1\bin\xgboost.dll

(gdb) s
Single stepping until exit from function _ZN7xgboost10SparsePage4PushINS_4data17DenseAdapterBatchEEEyRKT_fi,
which has no line number information.
operator delete (ptr=0x18d2233a620) at /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/del_op.cc:49
49      /workspace/srcdir/gcc-12.1.0/libstdc++-v3/libsupc++/del_op.cc: No such file or directory.
(gdb) s
0x00007ffb58ae68e0 in free () from C:\Users\mkitti\.julia\juliaup\julia-1.8.5+0.x64.w64.mingw32\bin\libstdc++-6.dll

(gdb) s
Single stepping until exit from function free,
which has no line number information.
0x00007ffbbd5b9c80 in msvcrt!free () from C:\windows\System32\msvcrt.dll
3 Likes

Looks like the issue is somewhere around here:

0x00007ffbbf64f633 in ntdll!RtlIsZeroMemory () from C:\windows\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ffbbf64f633 in ntdll!RtlIsZeroMemory () from C:\windows\SYSTEM32\ntdll.dll
#1  0x00007ffbbf6583f2 in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#2  0x00007ffbbf6586da in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#3  0x00007ffbbf65e361 in ntdll!RtlpNtSetValueKey () from C:\windows\SYSTEM32\ntdll.dll
#4  0x00007ffbbf575bf0 in ntdll!RtlGetCurrentServiceSessionId ()
   from C:\windows\SYSTEM32\ntdll.dll
#5  0x00007ffbbf5747b1 in ntdll!RtlFreeHeap () from C:\windows\SYSTEM32\ntdll.dll
#6  0x00007ffbbd5b9c9c in msvcrt!free () from C:\windows\System32\msvcrt.dll
#7  0x000000000264d1f6 in xgboost::SparsePage::Push<xgboost::data::DenseAdapterBatch> (
    this=this@entry=0x20d9e121ee0, batch=..., missing=<optimized out>,
    missing@entry=nan(0x400000), nthread=<optimized out>)
    at /workspace/srcdir/xgboost/src/data/data.cc:1074
#8  0x000000000269de8a in xgboost::data::SimpleDMatrix::SimpleDMatrix<xgboost::data::DenseAdapter>
    (this=0x20d8a872100, adapter=0xb6a47fc460, missing=nan(0x400000), nthread=<optimized out>)
    at /workspace/srcdir/xgboost/src/data/simple_dmatrix.cc:144
#9  0x0000000002751633 in xgboost::DMatrix::Create<xgboost::data::DenseAdapter> (
    adapter=adapter@entry=0xb6a47fc460, missing=missing@entry=nan(0x400000),
    nthread=nthread@entry=1) at /workspace/srcdir/xgboost/src/data/data.cc:917
#10 0x0000000002454a9d in XGDMatrixCreateFromMat (data=<optimized out>, nrow=<optimized out>,
    ncol=<optimized out>, missing=nan(0x400000), out=0x20d909185d0)
    at /workspace/srcdir/xgboost/src/c_api/c_api.cc:457

I’ve debugged this to the degree that I know how.

I produced a version of XGBoost_jll.jl with debugging symbols here:

@staticfloat @giordano @gbaraldi if you can think of any other thing to look at, please let me know.

1 Like

Thanks @mkitti. For my understanding, does the problem comes finally from XGBoost.jl? Since it was suggested here and here that the problem was on the Julia side (and the problem is also observed for LIBSVM.jl).

Let me try to describe what I’m seeing above in plain English.

XGBoost is a C++ project that uses OpenMP for parallelization. When XGDMatrixCreateFromMat is invoked, the method unsigned long long xgboost::SparsePage::Push<xgboost::data::DenseAdapterBatch>(xgboost::data::DenseAdapterBatch const&, float, int) is eventually called. This contains a potentially parallel for loop via OpenMP.

The creation of this parallel for loop is created by interacting with libstdc++ and libgomp. Julia provides both of these libraries as part of its CompilerSupportLibraries_jll:

Some update to this these libraries seems to causing XGBoost to free some memory that should not be freed. Recent updates bumped this the underlying gcc to version 12.

Another change is the disabling of thread local storage on Windows builds:

The issue is thus not directly with Julia itself but rather with libraries that it provides.

2 Likes

Thanks @mkiti

@mkiti Thanks for your insights, do you see a possible fix?