FoldsCUDA not working with simple reduction

I was trying to find the largest error between two matrix columns and I wanted to try parallelizing it on the GPU:

using FLoops
using CUDA
using FoldsCUDA

#To loop through all column pairs
allpairs(v) = ((i,j) for j in v for i in v if i > j)

function maxScore(data::CuArray{T}) where T

    @floop CUDAEx() for (i,j) in allpairs(axes(data,2))

        X = view(data,:,i)
        Y = view(data,:,j)
        currentScore = sum(abs2,X-Y)

        @reduce() do (bestScore = zero(T); currentScore)
            if bestScore < currentScore
                bestScore = currentScore
            end
        end
    end

    return bestScore
end

When I run this function I get some odd errors about InvalidIRError and unsupported dynamic function invocation (call to print_to_string(xs...) which doesn’t make sense to me. Does anyone see the problem I’m missing? The data I was using as input was just data = CUDA.rand(100,100)

That probably means some error code is being compiled, which often does I/O to a buffer because of string interpolation. Inspect the backtrace and try to avoid the code path leading to it. Worst case, this may need additional “quirks” (https://github.com/JuliaGPU/CUDA.jl/blob/master/src/device/quirks.jl) that disable the offending string interpolation.