Thread safety of `mul!` in `ToeplitzMatrices.jl` with FFTW plan inside Factorization

adomasbaliuka · June 8, 2023, 7:05pm

The creation of FFTW plans is not threadsafe https://discourse.julialang.org/t/fftw-plans-for-multiple-threads/21861. (Or is it?) However, if the plan is pre-allocated, that should be thread safe? I want to pre-compute a matrix factorization (that uses FFTW internally) and use it in multiple threads to do multiplications (which use FFT internally). I guess the issue is that the multiplication makes another plan internally.

In this example, it seems that pre-computing the factorization speeds up the multiplication by about 2x (in a single thread). However, when naively used in a multi-threaded context, the results are wrong.

The code below performs the same computation with all combinations of “threaded/not-threaded” and “with/without pre-computing factorization”. How to parallelize the non-threaded code-path with pre-factorization?

using Random, SHA, Test
using ToeplitzMatrices, LinearAlgebra, FFTW
using Folds

"""
Computes matrix vector product T*x, rounds numbers, takes mod2 and returns BitVector
"""
function toep_mul_mod2(
    T::AbstractMatrix,
    x::Vec,
) where {Vec<:Union{BitVector,SubArray{Bool,N,BitMatrix} where N}}
    # slow for BitVector input, so we reallocate a Vector{Bool}.
    # Using Vector{Bool} throughout uses too much memory in my application
    round.(Int64, T * Vector(x)) .% 2 .% Int8
end

"""
Computes matrix-vector-product T*x, rounds result, take mod2 and returns BitVector
Takes as input a factorization of a matrix.
Note: In a multi-threaded context, this gives wrong results!
"""
function toep_mul_mod2(Tfac::ToeplitzMatrices.ToeplitzFactorization, x::Any)
    x = convert(Vector{Float64}, collect(x))  # slow for BitVector input, so we reallocate a Vector{Bool}.

    # pre-allocate result
    result = zeros(Float64, length(Tfac.tmp) - length(x) + 1)   # This uses a dirty hack to get the output size from the Factorization object

    mul!(result, Tfac, x, 1.0, 1.0)  # This use seems intended but is not well-documented
    # Just doing `Tfac*x` is not implemented
    # See https://github.com/JuliaLinearAlgebra/ToeplitzMatrices.jl/issues/101

    round.(Int64, result) .% 2 .% Int8
end

"""
For each sub-vector of length `block_size`, compute matrix-vector-product T*x, 
round result, take mod2, concatenate results again and return BitVector
"""
function toep_mul_mod2_subblocks(xbits, T; prefactorize, use_Folds, block_size = 4096)

    Tfac = LinearAlgebra.factorize(T)

    if use_Folds
        return Folds.mapreduce(
            vcat,
            eachcol(reshape(xbits, block_size, :)),
            init = BitVector(),
        ) do block
            toep_mul_mod2(prefactorize ? Tfac : T, block) .% Bool
        end
    else
        return mapreduce(
            vcat,
            eachcol(reshape(xbits, block_size, :)),
            init = BitVector(),
        ) do block
            toep_mul_mod2(prefactorize ? Tfac : T, block) .% Bool
        end
    end
end

check_hash(x::BitVector) =
    (@show sha256(reinterpret(UInt8, x.chunks)) .|> Char |> String) ==
    "\e\u91i4\u85«Î*Ú½Òhaã\x1eðEvä¡»\t\b§àË.ãr\x1fÅY"

function test_main()
    # generates a random Toeplitz matrix and long BitVector
    rng = MersenneTwister(42)
    T = Toeplitz{Float64}(
        [1.0; rand(rng, [0.0, 1.0], 1759)],
        [1.0; rand(rng, [0.0, 1.0], 4095)],
    )
    x = rand(rng, [0, 1], 4096 * 1000) .% Bool

    for prefactorize in [true, false]
        for use_Folds in [true, false]
            # Exception occurs only when using threading (Folds.jl) AND pre-factorizing the matrix
            # The `InexactError` happens because the result of the matrix-vector multiplication is wrong.
            if prefactorize && use_Folds
                @show @test_throws InexactError toep_mul_mod2_subblocks(
                    x,
                    T;
                    use_Folds = true,
                    prefactorize = true,
                ) |> check_hash
            else
                @test toep_mul_mod2_subblocks(x, T; use_Folds, prefactorize) |> check_hash
            end

        end
    end
end

test_main()

stevengj · June 8, 2023, 7:07pm

Yes. Plan execution is thread-safe. But I don’t know how ToeplitzMatrices uses FFTW.

adomasbaliuka · June 8, 2023, 7:45pm

Thanks for the response!

I guess, the problem might be the copyto! in the mul! function (ToeplitzMatrices.jl/linearalgebra.jl at 63c9ae4e0c697d745a6a454608ffe40d4b1f2118 · JuliaLinearAlgebra/ToeplitzMatrices.jl · GitHub), which performs the multiplications. It seems to modify the factorization object. I guess it’s simply that this mul! method is not thread safe. I wish there was a way to tell which methods or libraries are thread-safe, but that’s a big ask.

This leaves me with the options of giving up, or trying to figure out how to make ToeplitzMatrices.jl thread-safe, which might not even be possible without a big performance hit.

This is the function in question:

function mul!(
    y::StridedVector, A::ToeplitzFactorization, x::StridedVector, α::Number, β::Number
)
    n = length(x)
    m = length(y)
    vcvr_dft = A.vcvr_dft
    N = length(vcvr_dft)
    if m > N || n > N
        throw(DimensionMismatch(
            "Toeplitz factorization does not match size of input and output vector"
        ))
    end

    T = Base.promote_eltype(y, A, x, α, β)
    tmp = A.tmp
    dft = A.dft
    @inbounds begin
        copyto!(tmp, 1, x, 1, n)
        for i in (n+1):N
            tmp[i] = zero(eltype(tmp))
        end
        mul!(tmp, dft, tmp)
        for i in 1:N
            tmp[i] *= vcvr_dft[i]
        end
        dft \ tmp
        if iszero(β)
            for i in 1:m
                y[i] = α * maybereal(T, tmp[i])
            end
        else
            for i in 1:m
                y[i] = muladd(α, maybereal(T, tmp[i]), β * y[i])
            end
        end
    end

    return y
end

stevengj · June 8, 2023, 8:09pm

It re-uses the same A.tmp buffer, which is not thread-safe.

adomasbaliuka · June 8, 2023, 8:32pm

Am I right to assume the copyto! is the problem because it modifies the buffer? Reading the buffers should be fine, right?

What I tried now is to modify the mul! function by introducing an option not to reuse the buffer:

function mul!(
    y::StridedVector, A::ToeplitzFactorization, x::StridedVector, α::Number, β::Number; dont_modify_A=false,
)
# ...

    if dont_modify_A
        tmp = copy(A.tmp)
    else
        tmp = A.tmp
    end

# ...

And then it seems to work correctly. However, I’d rather be sure that it has to work correctly…

stevengj · June 8, 2023, 8:42pm

The copyto! and the mul!. The point is that you can’t have multiple threads writing to the same temporary buffer in parallel.

adomasbaliuka · June 9, 2023, 7:20am

Thanks! It seems the copy operation I put in solves the problem, as I don’t think anything else modifies contents of the factorization.

I guess the package maintainers wouldn’t want to implement thread safety for this feature but I can always use my modified version.

Topic		Replies	Views
FFTW multithreading and plans General Usage fftw , multithreading	3	733	July 11, 2023
FFTW Plans for multiple threads Julia at Scale fftw , multithreading	1	2342	March 14, 2019
Is CUDA.jl and FFTW threadsafe? GPU fftw , multithreading , cuda , fft	4	466	May 22, 2023
Segmentation Fault with FFTW plans and multi-threading General Usage fftw , multithreading	1	141	July 8, 2024
Thread safe cache General Usage	4	745	August 30, 2021

Thread safety of `mul!` in `ToeplitzMatrices.jl` with FFTW plan inside Factorization

Related topics