Multithreaded LIBSVM and XGBoost crashing

Hello,

I would like to perform crossvalidation and grid search using SVM (and XGBoost), using a parameter grid defined for each method.
In MATLAB, within each split, I can run parfor, and each thread will train some models on the data, using specific parameter combinations.
In Julia, using LIBSVM.jl and XGBoost.jl, I would like to run a Threads.@threads for loop to perform the same. However, this results in crashing and segfaults. I have also tried using MLJScikitLearnInterface, but with similar outcomes.
Interestingly, as you can see from the MWE below, the error only occurs if the function is called twice.

Note: Before the Threads.@threads loop, I set the number of BLAS threads to 1, to have each thread create its own BLAS thread. Moreover, each thread saves the model and the output in a preallocated array, at the index defined by the index of the parameter combination used (there should be no data race or issues with Threads.threadid() AFAIU).

A MWE would be the following:

using Random, LIBSVM, LinearAlgebra.BLAS

function gridsearch(gammas)
    models = Vector(undef, length(gammas))  # collector of models
    tim = zeros(length(gammas))                     # training times
    prevBLASthreads = BLAS.get_num_threads(); 
    BLAS.set_num_threads(1);

    Threads.@threads for i ∈ eachindex(gammas)
        tim[i] = @elapsed models[i] = svmtrain(
            rand(10,100),   # For this example, random data
            rand(1:10,100),
            svmtype = SVC,
            kernel = Kernel.RadialBasis,
            gamma = gammas[i],
        )
    end
    BLAS.set_num_threads(prevBLASthreads);
    return models,minimum(tim)              # Let us just return models and best time as example
end

@info "Starting"
out1,tim1 =  gridsearch([.01;.05;.1;.5;.7])            # This runs
@info "First time done, fastest took $tim1"
out2,tim2 =  gridsearch([.01;.05;.1;.5;.7]).           # This crashes
@info "Second time done, fastest took $tim2"

I would highly appreciate some opinions on what is happening here. Thanks!

LIBSVM.jl is not thread safe.

This is because libsvm itself is not thread safe.

Thanks for the answer.
However, the problem also presents with XGBoost.jl, which AFAIU should (or could?) be thread safe

https://docs.juliahub.com/XGBoost_jll/AJezb/1.7.1+1/autodocs/#XGBoost_jll.xgboost-Tuple{}

Here is a MWE:

using Random, XGBoost,LinearAlgebra.BLAS,Term

function gridsearch(nrounds)
    models = Vector(undef, length(nrounds))  # collector of models
    tim = zeros(length(nrounds))                     # training times
    prevBLASthreads = BLAS.get_num_threads(); BLAS.set_num_threads(1);

    Threads.@threads for i ∈ eachindex(nrounds)
        tim[i] = @elapsed models[i] = xgboost(
            (rand(10,100), rand(1.:10.,100)), 
            num_round=nrounds[i],
            verbosity = 1,
            watchlist=(;),
        )
    end
    BLAS.set_num_threads(prevBLASthreads);
    return models,minimum(tim)              # In MWE, let us just return models and best time as example
end

@info "Starting"
out1,tim1 =  gridsearch([10; 50; 100; 250;500]) # This runs
@info "First time done, fastest took $tim1"
importancereport(out1[1])

out2,tim2 =  gridsearch([10; 50; 100; 250;500]) # This crashes
@info "Second time done, fastest took $tim2"

It’s been a while since I’ve used MATLAB, but isn’t parfor distributed rather than threaded? I.e. the equivalent would be using Distributed and then divide your work among worker processes? (In which case you also wouldn’t have to worry about thread safety.)

2 Likes

I cannot reproduce the issue with XGBoost.jl with Julia 1.9.2 on Linux. What platform are you on?

(jl_ikzb83) pkg> st
Status `/tmp/jl_ikzb83/Project.toml`
  [22787eb5] Term v2.0.5
  [009559a3] XGBoost v2.3.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD FX(tm)-8350 Eight-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, bdver1)
  Threads: 8 on 8 virtual cores

1 Like

I’m using 10 core M1.

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores

with XGBoost v2.3.1, and also with XGBoost v2.3.2

Actually, I tried on a linux machine and it works fine, so it seems the problem may be MacOS-related.

Depends on the settings :slight_smile:

I am not too familiar with Julia’s Distributed.
In this specific situation, in a CV split the data (matrix X and vector y) should be shared between the processes, as it is not necessary to copy them (they are not modified by the function, and can be quite large). Would something like this:

@distributed for i in each index(params)
    model[i] = train(X,y)
end

pass X and y by reference or copy them?

Distributed creates completely separate worker processes, so you have to make sure functions and data are available on each. There are SharedArrayss for sharing data between workers.