FoldsCUDA not working with simple reduction

I was trying to find the largest error between two matrix columns and I wanted to try parallelizing it on the GPU:

using FLoops
using CUDA
using FoldsCUDA

#To loop through all column pairs
allpairs(v) = ((i,j) for j in v for i in v if i > j)

function maxScore(data::CuArray{T}) where T

    @floop CUDAEx() for (i,j) in allpairs(axes(data,2))

        X = view(data,:,i)
        Y = view(data,:,j)
        currentScore = sum(abs2,X-Y)

        @reduce() do (bestScore = zero(T); currentScore)
            if bestScore < currentScore
                bestScore = currentScore

    return bestScore

When I run this function I get some odd errors about InvalidIRError and unsupported dynamic function invocation (call to print_to_string(xs...) which doesn’t make sense to me. Does anyone see the problem I’m missing? The data I was using as input was just data = CUDA.rand(100,100)

That probably means some error code is being compiled, which often does I/O to a buffer because of string interpolation. Inspect the backtrace and try to avoid the code path leading to it. Worst case, this may need additional “quirks” ( that disable the offending string interpolation.