Thread safety of `mul!` in `ToeplitzMatrices.jl` with FFTW plan inside Factorization

The creation of FFTW plans is not threadsafe https://discourse.julialang.org/t/fftw-plans-for-multiple-threads/21861. (Or is it?) However, if the plan is pre-allocated, that should be thread safe? I want to pre-compute a matrix factorization (that uses FFTW internally) and use it in multiple threads to do multiplications (which use FFT internally). I guess the issue is that the multiplication makes another plan internally.

In this example, it seems that pre-computing the factorization speeds up the multiplication by about 2x (in a single thread). However, when naively used in a multi-threaded context, the results are wrong.

The code below performs the same computation with all combinations of “threaded/not-threaded” and “with/without pre-computing factorization”. How to parallelize the non-threaded code-path with pre-factorization?

using Random, SHA, Test
using ToeplitzMatrices, LinearAlgebra, FFTW
using Folds

"""
Computes matrix vector product T*x, rounds numbers, takes mod2 and returns BitVector
"""
function toep_mul_mod2(
    T::AbstractMatrix,
    x::Vec,
) where {Vec<:Union{BitVector,SubArray{Bool,N,BitMatrix} where N}}
    # slow for BitVector input, so we reallocate a Vector{Bool}.
    # Using Vector{Bool} throughout uses too much memory in my application
    round.(Int64, T * Vector(x)) .% 2 .% Int8
end

"""
Computes matrix-vector-product T*x, rounds result, take mod2 and returns BitVector
Takes as input a factorization of a matrix.
Note: In a multi-threaded context, this gives wrong results!
"""
function toep_mul_mod2(Tfac::ToeplitzMatrices.ToeplitzFactorization, x::Any)
    x = convert(Vector{Float64}, collect(x))  # slow for BitVector input, so we reallocate a Vector{Bool}.

    # pre-allocate result
    result = zeros(Float64, length(Tfac.tmp) - length(x) + 1)   # This uses a dirty hack to get the output size from the Factorization object

    mul!(result, Tfac, x, 1.0, 1.0)  # This use seems intended but is not well-documented
    # Just doing `Tfac*x` is not implemented
    # See https://github.com/JuliaLinearAlgebra/ToeplitzMatrices.jl/issues/101

    round.(Int64, result) .% 2 .% Int8
end

"""
For each sub-vector of length `block_size`, compute matrix-vector-product T*x, 
round result, take mod2, concatenate results again and return BitVector
"""
function toep_mul_mod2_subblocks(xbits, T; prefactorize, use_Folds, block_size = 4096)

    Tfac = LinearAlgebra.factorize(T)

    if use_Folds
        return Folds.mapreduce(
            vcat,
            eachcol(reshape(xbits, block_size, :)),
            init = BitVector(),
        ) do block
            toep_mul_mod2(prefactorize ? Tfac : T, block) .% Bool
        end
    else
        return mapreduce(
            vcat,
            eachcol(reshape(xbits, block_size, :)),
            init = BitVector(),
        ) do block
            toep_mul_mod2(prefactorize ? Tfac : T, block) .% Bool
        end
    end
end

check_hash(x::BitVector) =
    (@show sha256(reinterpret(UInt8, x.chunks)) .|> Char |> String) ==
    "\e\u91i4\u85«Î*Ú½Òhaã\x1eðEvä¡»\t\b§àË.ãr\x1fÅY"

function test_main()
    # generates a random Toeplitz matrix and long BitVector
    rng = MersenneTwister(42)
    T = Toeplitz{Float64}(
        [1.0; rand(rng, [0.0, 1.0], 1759)],
        [1.0; rand(rng, [0.0, 1.0], 4095)],
    )
    x = rand(rng, [0, 1], 4096 * 1000) .% Bool

    for prefactorize in [true, false]
        for use_Folds in [true, false]
            # Exception occurs only when using threading (Folds.jl) AND pre-factorizing the matrix
            # The `InexactError` happens because the result of the matrix-vector multiplication is wrong.
            if prefactorize && use_Folds
                @show @test_throws InexactError toep_mul_mod2_subblocks(
                    x,
                    T;
                    use_Folds = true,
                    prefactorize = true,
                ) |> check_hash
            else
                @test toep_mul_mod2_subblocks(x, T; use_Folds, prefactorize) |> check_hash
            end

        end
    end
end

test_main()

Yes. Plan execution is thread-safe. But I don’t know how ToeplitzMatrices uses FFTW.

1 Like

Thanks for the response!

I guess, the problem might be the copyto! in the mul! function (ToeplitzMatrices.jl/linearalgebra.jl at 63c9ae4e0c697d745a6a454608ffe40d4b1f2118 · JuliaLinearAlgebra/ToeplitzMatrices.jl · GitHub), which performs the multiplications. It seems to modify the factorization object. I guess it’s simply that this mul! method is not thread safe. I wish there was a way to tell which methods or libraries are thread-safe, but that’s a big ask.

This leaves me with the options of giving up, or trying to figure out how to make ToeplitzMatrices.jl thread-safe, which might not even be possible without a big performance hit.

This is the function in question:

function mul!(
    y::StridedVector, A::ToeplitzFactorization, x::StridedVector, α::Number, β::Number
)
    n = length(x)
    m = length(y)
    vcvr_dft = A.vcvr_dft
    N = length(vcvr_dft)
    if m > N || n > N
        throw(DimensionMismatch(
            "Toeplitz factorization does not match size of input and output vector"
        ))
    end

    T = Base.promote_eltype(y, A, x, α, β)
    tmp = A.tmp
    dft = A.dft
    @inbounds begin
        copyto!(tmp, 1, x, 1, n)
        for i in (n+1):N
            tmp[i] = zero(eltype(tmp))
        end
        mul!(tmp, dft, tmp)
        for i in 1:N
            tmp[i] *= vcvr_dft[i]
        end
        dft \ tmp
        if iszero(β)
            for i in 1:m
                y[i] = α * maybereal(T, tmp[i])
            end
        else
            for i in 1:m
                y[i] = muladd(α, maybereal(T, tmp[i]), β * y[i])
            end
        end
    end

    return y
end

It re-uses the same A.tmp buffer, which is not thread-safe.

1 Like

Am I right to assume the copyto! is the problem because it modifies the buffer? Reading the buffers should be fine, right?

What I tried now is to modify the mul! function by introducing an option not to reuse the buffer:

function mul!(
    y::StridedVector, A::ToeplitzFactorization, x::StridedVector, α::Number, β::Number; dont_modify_A=false,
)
# ...

    if dont_modify_A
        tmp = copy(A.tmp)
    else
        tmp = A.tmp
    end

# ...

And then it seems to work correctly. However, I’d rather be sure that it has to work correctly…

The copyto! and the mul!. The point is that you can’t have multiple threads writing to the same temporary buffer in parallel.

1 Like

Thanks! It seems the copy operation I put in solves the problem, as I don’t think anything else modifies contents of the factorization.

I guess the package maintainers wouldn’t want to implement thread safety for this feature but I can always use my modified version.