Threading with @views and @tensor

Hello, I am working on a code where array views are used in tensor contractions. I had written a similar one before and I was able to make it fast with the threading tools available in Julia, namely @sync and @spawn.

However, this new one is a bit more complex, with more @views and @tensor calls. I have reduced it as much as possible to show the example here:

function example(T2, A)
    o = size(T2,1)
    v = size(T2,4)

    # Pre-allocate Intermediate arrays
    Ws  = [Array{Float64}(undef, v,v,v) for i = 1:Threads.nthreads()] 
    Evals = zeros(Threads.nthreads())

    @sync for i in 1:o
        Threads.@spawn begin

            id = Threads.threadid()
            W = Ws[id]

            @views V = A[i,:,:,:]
            for j in 1:i
                for k in 1:j

                    @views G = T2[k,j,:,:]

                    @tensor W[a,b,c]  =  V[a,b,d]*G[c,d] 
                    @tensor W[a,b,c] -=  2*V[a,b,d]*G[d,c] 

                    Evals[id] += sum(W)

    Et = sum(Evals)
    println("Final (T) contribution: $Et")

The problem is that this code is not stable. I believe there is some race condition that I cannot figure out. Each time the code is run a different result is returned (for the same T2 and A). Is it safe to use @views?

Notes: By removing the second @tensor call, the results become consistent up to 15 decimal places.

Threads.@threads works fine.

personally, I have never heard of views not being threadsafe but there was a recent post that mentions thread-unsafe behaviour with pre-allocated Arrays that are indexed by threadid() whenever there is a “yield” involved:

On first glance, this doesnt seem to be the case here, but maybe there is something weird happening under the hood in the @tensor macro?
You can find out if your issue is connected with this by allocating W inside the loop ( just for testing, of course)

Likewise, if you really are suspicious of @views causing the issue, you could temporarily remove the macro and write the indices explicitly to see if the problem ramains.

Its curious that removing the second @tensor call seems to remove thread-unsafety, it might be worth to try and write the loops over a,b,c explicitly.

hopefully one of these ideas helps you :slight_smile:

First of all, let me repeat what Salmon just mentioned (thanks!): do not use Evals[threadid()] for reduction and especially do not use it with @spawn. It probably works somewhat OK for now but there is no guarantee that it works in the future.

(Having said that, technically speaking, I don’t think the program in the OP has a data race.)

This is likely because that sum(Evals) evaluates the sum with different “grouping”/parenthesising. Let s(p) be the result of sum(W) on thread p. Since Julia can schedule the tasks on arbitrary OS threads, there are multiple possible groupings:

# o == 4
# nthreads() == 3

Evals == [(s(1) + s(2)) + s(3), s(4), 0]
sum(Evals) == ((s(1) + s(2)) + s(3)) + s(4)

Evals == [s(1), s(2) + s(3), s(4)]
sum(Evals) == (s(1] + (s(2) + s(3))) + s(4)

Evals == [s(1), s(2), s(3) + s(4)]
sum(Evals) == (s(1] + s(2)) + (s(3) + s(4))

Since + on floats is not strictly associative, these result can be different.

By the way, this is exactly why JuliaFolds packages like FLoops.jl and Folds.jl are very strict about deterministic reduction. By default, these packages compute a reproducible answer (for well-behaving user functions) even with non-exact associative functions like addition on floats.


Thank you. That is something I had never considered. How big can be the error due to this associativity issue? I was seeing pretty big variation like ~0.01.
If I drop the + operation. Just storing the values say as:

Evals = zeros(Threads.nthreads(), o)
Evals[id, i] = sum(W)

Does it still have the problem just because of sum(W) ?

julia> a = rand(10^3);

julia> sum(a)

julia> sum(shuffle(a))
1 Like

Very big: Array ordering and naive summation (Stefan Karpinski demonstrated that you can get any values in [0,2^970) by summing identical 2046 values in different order)

Hmm… This is tricky, but I think it’s fine. (It’s tricky because it had the same problem if it were using the transpose Evals[i, id].)

1 Like

Actually, since sum uses pairwise summation, it can still produce different results if nthreads()-wide chunk of Evals crosses the base case of sum. I was assuming a native serial summation.