Currently hitting out of memory when training AND inferecing.
When it only fails on inference, it looks like this:
XGBoostError: (caller: XGBoosterPredictFromDMatrix)
[14:54:49] /workspace/srcdir/xgboost/src/c_api/../data/../common/device_helpers.cuh:431: Memory allocation error on worker 0: std::bad_alloc: cudaErrorMemoryAllocation: out of memory
- Free memory: 2493513728
- Requested memory: 4624851000
Stack trace:
[bt] (0) /home/jiling/.julia/artifacts/dcee79537e0e0f3f2ef6acf4b886a1dd6adcc6c8/lib/libxgboost.so(+0x9239b4) [0x7f09fac089b4]
[bt] (1) /home/jiling/.julia/artifacts/dcee79537e0e0f3f2ef6acf4b886a1dd6adcc6c8/lib/libxgboost.so(dh::detail::ThrowOOMError(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long)+0x3d9) [0x7f09fac0d9b9]
[bt] (2) /home/jiling/.julia/artifacts/dcee79537e0e0f3f2ef6acf4b886a1dd6adcc6c8/lib/libxgboost.so(dh::detail::XGBDefaultDeviceAllocatorImpl<xgboost::Entry>::allocate(unsigned long)+0x386) [0x7f09fac254e6]
My setup is a bit awkward, basically I can only get 1 slice of A100 which has 5GB of VRAM or two slices of A100 but those to can’t be used together from Julia it seems