Multithreaded LIBSVM and XGBoost crashing

dbass · August 25, 2023, 7:54am

Hello,

I would like to perform crossvalidation and grid search using SVM (and XGBoost), using a parameter grid defined for each method.
In MATLAB, within each split, I can run parfor, and each thread will train some models on the data, using specific parameter combinations.
In Julia, using LIBSVM.jl and XGBoost.jl, I would like to run a Threads.@threads for loop to perform the same. However, this results in crashing and segfaults. I have also tried using MLJScikitLearnInterface, but with similar outcomes.
Interestingly, as you can see from the MWE below, the error only occurs if the function is called twice.

Note: Before the Threads.@threads loop, I set the number of BLAS threads to 1, to have each thread create its own BLAS thread. Moreover, each thread saves the model and the output in a preallocated array, at the index defined by the index of the parameter combination used (there should be no data race or issues with Threads.threadid() AFAIU).

A MWE would be the following:

using Random, LIBSVM, LinearAlgebra.BLAS

function gridsearch(gammas)
    models = Vector(undef, length(gammas))  # collector of models
    tim = zeros(length(gammas))                     # training times
    prevBLASthreads = BLAS.get_num_threads(); 
    BLAS.set_num_threads(1);

    Threads.@threads for i ∈ eachindex(gammas)
        tim[i] = @elapsed models[i] = svmtrain(
            rand(10,100),   # For this example, random data
            rand(1:10,100),
            svmtype = SVC,
            kernel = Kernel.RadialBasis,
            gamma = gammas[i],
        )
    end
    BLAS.set_num_threads(prevBLASthreads);
    return models,minimum(tim)              # Let us just return models and best time as example
end

@info "Starting"
out1,tim1 =  gridsearch([.01;.05;.1;.5;.7])            # This runs
@info "First time done, fastest took $tim1"
out2,tim2 =  gridsearch([.01;.05;.1;.5;.7]).           # This crashes
@info "Second time done, fastest took $tim2"

I would highly appreciate some opinions on what is happening here. Thanks!

mkitti · August 25, 2023, 8:17am

LIBSVM.jl is not thread safe.

This is because libsvm itself is not thread safe.

github.com/cjlin1/libsvm

svm_train() in svmutil.py is not thread-safe

opened 11:03AM - 30 Apr 20 UTC

wyykak

To be more exact, _svm_set_print_string_function()_ in svm.cpp is not thread-saf…e, as it modifies a global function pointer _svm_print_string_. If you call _svm_train()_ in multiple Python threads with '-q' option, it will try to convert _print_null()_ to a C-style function pointer and assign it to _svm_print_string_ in order to suppress the output. However, the function pointer generated each time seems to point to different addresses, which makes it possible to corrupt _svm_print_string_ and crash the program. I suggest to make the printing function pointer independent in different function calls. A global print option is neither thread-safe nor practical.

dbass · August 25, 2023, 8:51am

Thanks for the answer.
However, the problem also presents with XGBoost.jl, which AFAIU should (or could?) be thread safe

https://docs.juliahub.com/XGBoost_jll/AJezb/1.7.1+1/autodocs/#XGBoost_jll.xgboost-Tuple{}

Here is a MWE:

using Random, XGBoost,LinearAlgebra.BLAS,Term

function gridsearch(nrounds)
    models = Vector(undef, length(nrounds))  # collector of models
    tim = zeros(length(nrounds))                     # training times
    prevBLASthreads = BLAS.get_num_threads(); BLAS.set_num_threads(1);

    Threads.@threads for i ∈ eachindex(nrounds)
        tim[i] = @elapsed models[i] = xgboost(
            (rand(10,100), rand(1.:10.,100)), 
            num_round=nrounds[i],
            verbosity = 1,
            watchlist=(;),
        )
    end
    BLAS.set_num_threads(prevBLASthreads);
    return models,minimum(tim)              # In MWE, let us just return models and best time as example
end

@info "Starting"
out1,tim1 =  gridsearch([10; 50; 100; 250;500]) # This runs
@info "First time done, fastest took $tim1"
importancereport(out1[1])

out2,tim2 =  gridsearch([10; 50; 100; 250;500]) # This crashes
@info "Second time done, fastest took $tim2"

nilshg · August 25, 2023, 9:34am

It’s been a while since I’ve used MATLAB, but isn’t parfor distributed rather than threaded? I.e. the equivalent would be using Distributed and then divide your work among worker processes? (In which case you also wouldn’t have to worry about thread safety.)

mkitti · August 25, 2023, 9:44am

I cannot reproduce the issue with XGBoost.jl with Julia 1.9.2 on Linux. What platform are you on?

(jl_ikzb83) pkg> st
Status `/tmp/jl_ikzb83/Project.toml`
  [22787eb5] Term v2.0.5
  [009559a3] XGBoost v2.3.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD FX(tm)-8350 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
  Threads: 8 on 8 virtual cores

dbass · August 25, 2023, 12:49pm

I’m using 10 core M1.

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

with XGBoost v2.3.1, and also with XGBoost v2.3.2

Actually, I tried on a linux machine and it works fine, so it seems the problem may be MacOS-related.

dbass · August 25, 2023, 1:01pm

Depends on the settings

I am not too familiar with Julia’s Distributed.
In this specific situation, in a CV split the data (matrix X and vector y) should be shared between the processes, as it is not necessary to copy them (they are not modified by the function, and can be quite large). Would something like this:

@distributed for i in each index(params)
    model[i] = train(X,y)
end

pass X and y by reference or copy them?

nilshg · August 25, 2023, 5:02pm

Distributed creates completely separate worker processes, so you have to make sure functions and data are available on each. There are SharedArrayss for sharing data between workers.

Topic		Replies	Views
Crash when tuning SVC with multithreading in MLJ.jl Machine Learning multithreading , mlj , crash , libsvm	4	66	November 21, 2024
LIBSVM crashes after sucessfully running a few times Machine Learning question	1	651	June 29, 2020
Issue with XGBoost.jl and LIBSVM.jl when Julia 1.8.4 General Usage packages	52	3062	June 12, 2023
Can not run simple example with LIBSVM Machine Learning question	4	223	February 1, 2023
Segmentation fault using multithreaded julia on new server General Usage question , segfault	0	157	May 22, 2024

Multithreaded LIBSVM and XGBoost crashing

Related topics