Getting different result from serial and multithreaded code in Julia

Most of the macros here don’t make sense. You only need a single @inbounds on the outer loop. @simd should also only be used on the inner loop when the code inside is simple.

Also @fastmath for creating a view doesn’t really do much.

I would recommend removing all of these macros and testing your code again.

The problem is not in my serial code. The main problem is in my multithreaded code. If I remove those macros that doesn’t fix my problem in the multithreaded code.

Can you explain why the codes executed in the inner loops are different?

For both you and us, it would be helpful if you could create a minimum working executable example that demonstrates the problem. It’s very difficult to debug code that we cannot execute. Rather than giving us large parts of your code, try to reduce the problem down to the simplest demonstration of it. For me this often helps me to solve the problem myself.

One thing I am worried about here is that the order of the additions of many floating point values especially if they vary over a wide range of magnitudes.

julia> 0.1 + 0.1 + 100000.
100000.2

julia> 100000. + 0.1 + 0.1
100000.20000000001

The order that you add floating point values can matter.

1 Like

when
@inbounds @simd for i in 1:nocc, @inbounds @simd for j in 1:i

then

i , j = 1 , 1
i , j = 2 , 1
i , j = 2 , 2
i , j = 3 , 1
i , j = 3 , 2
i , j = 3 , 3
i , j = 4 , 1
i , j = 4 , 2
i , j = 4 , 3
i , j = 4 , 4
i , j = 5 , 1
i , j = 5 , 2
i , j = 5 , 3
i , j = 5 , 4
i , j = 5 , 5

when
@inbounds @simd for i in 1:nocc, @inbounds @simd for j in 1:nocc

then

i , j = 1 , 1
i , j = 1 , 2
i , j = 1 , 3
i , j = 1 , 4
i , j = 1 , 5
i , j = 2 , 1
i , j = 2 , 2
i , j = 2 , 3
i , j = 2 , 4
i , j = 2 , 5
i , j = 3 , 1
i , j = 3 , 2
i , j = 3 , 3
i , j = 3 , 4
i , j = 3 , 5
i , j = 4 , 1
i , j = 4 , 2
i , j = 4 , 3
i , j = 4 , 4
i , j = 4 , 5
i , j = 5 , 1
i , j = 5 , 2
i , j = 5 , 3
i , j = 5 , 4
i , j = 5 , 5

Only the upper triangle indices are within the loop and the lower triangle elements are generated from the condition.

                        if i!=j
                            t2new[j,i,:,:] += transpose(temp)
                        end

I thought this will make the for loop small and the code will be faster.

Why one has

and the other has

1 Like

@simd should only be used on an inner loop, when it is safe to re-order the operations and the inner operation is simple. You can read the help ?@simd for more info on it.

Basically, I am calling a Julia program in my python code. Here’s a minimum working executable example.

multithreading_tensor.jl

using Distributed
using TensorOperations,SharedArrays
using Base.Threads
using PyCall
using IterTools
using Einsum,LinearAlgebra,LoopVectorization
using DependencyWalker, LibSSH2_jll
using HDF5

println("number of threads ", nthreads())

function build_temp(t1,nocc,nvir,Lov_df,Lvv_df,tau)
    nocc, nvir = size(t1)
    tempf = zeros(nvir,nvir)
    Lov_T120 = permutedims(Lov_df,(2,3,1))
    local t2new_temp = Array{Float64,4}(undef,(nocc,nocc,nvir,nvir))
    local temp = Array{Float64,2}(undef,(nvir,nvir))
    temps = [zeros(nvir,nvir) for i = 1:Threads.nthreads()]
    Threads.@threads for i in 1:nocc
        @inbounds begin
            id = Threads.threadid()
            temp = temps[id]
            for j in 1:nocc
                @views tau_ij_ = tau[i,j,:,:]
                @tensor begin
                    temp[a,b] += (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvv_df[t,a,c])
                    temp[a,b] -= t1[k,a]*Lov_T120[k,c,t]*tau_ij_[c,d]*Lvv_df[t,b,d]
                    temp[a,b] += Lvv_df[t,a,c]*tau_ij_[c,d]*Lvv_df[t,b,d]
                end
            end
        end
    end #sync
    temp_final= sum(temps)
    println("temp from multithreaded code ")
    println(temp_final)
    temp_final
end

function build_temp_serial(t1,nocc,nvir,Lov_df,Lvv_df,tau)
    nocc, nvir = size(t1)
    Lov_T120 = permutedims(Lov_df,(2,3,1))
    t2new_temp = Array{Float64,4}(undef,(nocc,nocc,nvir,nvir))
    temp = Array{Float64,2}(undef,(nvir,nvir))
    for i in 1:nocc
        @inbounds begin
            for j in 1:nocc
                @views tau_ij_ = tau[i,j,:,:]
                @tensor begin
                    temp[a,b] += (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvv_df[t,a,c])
                    temp[a,b] -= t1[k,a]*Lov_T120[k,c,t]*tau_ij_[c,d]*Lvv_df[t,b,d]
                    temp[a,b] += Lvv_df[t,a,c]*tau_ij_[c,d]*Lvv_df[t,b,d]
                end
            end
        end
    end #sync
    println("temp from serial code ")
    println(temp)
    temp
end

pycall_multhread.py

import numpy as np

t1 = np.random.rand(5,2)
nocc,nvir = t1.shape
Lov_df = np.random.rand(60,5,2)
Lvv_df = np.random.rand(60,2,2)
tau = np.random.rand(5,5,2,2)

from julia import Main
Main.include("./multithreading_tensor.jl")
temp_multithread = Main.build_temp(t1,nocc,nvir,Lov_df,Lvv_df,tau)
temp = Main.build_temp_serial(t1,nocc,nvir,Lov_df,Lvv_df,tau)

#print(np.array_equal(temp_multithread, temp))
print(np.array(temp_multithread) == np.array(temp))

output is this:

number of threads 12
temp from multithreaded code 
[-5326.715913540711 -4519.379019415272; -4521.8646523288735 -3232.290541631156]
temp from serial code 
[-5326.715913540712 -4519.379019415272; -4521.864652328874 -3232.290541631155]
[[False  True]
 [False False]]

These outputs are basically the same under re-ordering of the floating point math, they should be considered equivalent. Using something like isapprox would say the outputs are essentially equal. I think numpy has a similar method, because this is very common with floating point arithmetic.

1 Like

This was a real surprise. Is this something new? Because I’m quite sure I remember seeing ‘official-seeming’ tutorials (don’t remember where) showing threadid used for this. I thought it was the canonical solution.

Is there any drop-in replacement?

See these discussion

One solution appears to be using FLoops.jl FAQ · FLoops

Basically, when I am trying to implement the function build_temp in the full code, that I provided, it gives me the wrong result. In my MWE the input matrices (t1,Lov_df,Lvv_df,tau) are just random numbers, but in the original program, these matrices are different and give significantly different results in comparison with the serial code.

The suggestion to avoid that patterns is old as pointed in FLoops docs, but what’s new is that, AFAIU, since 1.7 the tasks can effectively migrate: Julia 1.7 says it can switch the thread your task is on. How often does that happen, and how can it be disabled?

I don’t know of any drop in replacement other than executing the threaded loop at a higher level, defining a chunk index independent of the thread (which is what I do in the ChunkSplitters.jl package I mentioned above, and which I’m using to avoid this in simple parallel loops like this one).

I am not sure how we can proceed given that your MWE doesn’t have the issues you have described. Have you tried running your real input matrices through the functions as given in the MWE and then comparing the outputs?

2 Likes

Good point.

The missing + seems important.

Thank you for the suggestion. Now my temp is matching with the serial code. I thought @tensor and @tensoropt doesn’t make any difference.

Thank you for your responses. Now suppose I want to update array t2new by t2new[i,j,:,:] += temp. But again I am getting different t2new from serial and multithreaded code.

Here is the GitHub link to my code(MWE): https://github.com/Sudipta-code/ccsd_julia.git

Fork this repository and run: python3 pycall_multhread.py . I have stored all the arrays in hdf5 files ( eg.Loo_df.h5 , Lov_df.h5, Lvv_df.h5, t1.h5, t2new.h5, tau.h5)

Following is the julia code multithreading_tensor.jl which I am calling from a python code named pycall_multhread.py.

using Distributed
using TensorOperations,SharedArrays
using Base.Threads
using PyCall
using HDF5

println("number of threads ", nthreads())

Lovdf = h5open("Lov_df.h5", "r") do file_Lov
    read(file_Lov, "/Lov")
end

Lvvdf = h5open("Lvv_df.h5", "r") do file_Lvv
    read(file_Lvv, "/Lvv")
end

t1 = h5open("t1.h5", "r") do file_t1
    read(file_t1, "/t1")
end

tau = h5open("tau.h5", "r") do file_tau
    read(file_tau, "/tau")
end

t2new = h5open("t2new.h5", "r") do file_t2new
    read(file_t2new, "/t2new")
end


nocc,nvir = size(t1)
#t2new = zeros(nocc,nocc,nvir,nvir)
#temp = zeros(nvir,nvir)


function build_temp(t1,nocc,nvir,Lovdf,Lvvdf,tau)
    tempf = zeros(nvir,nvir)
    Lov_T120 = permutedims(Lovdf,(2,3,1))
    #t2new_temp = Array{Float64,4}(undef,(nocc,nocc,nvir,nvir))
    temp = Array{Float64,2}(undef,(nvir,nvir))
    temps = [zeros(nvir,nvir) for i = 1:Threads.nthreads()]
    t2new_temps = [zeros(nocc,nocc,nvir,nvir) for i = 1:Threads.nthreads()]
    Threads.@threads for i in 1:nocc
        @inbounds begin
            id = Threads.threadid()
            temp = temps[id]
            t2new =  t2new_temps[id]
            for j in 1:nocc
                @views tau_ij_ = tau[i,j,:,:]
                @tensoropt (k=>x, t=>100x, d=>100x, c=>100x, a=>100x, b=>100x) begin
                    temp[a,b] = (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvvdf[t,a,c])
                    temp[a,b] -= t1[k,a]*Lov_T120[k,c,t]*tau_ij_[c,d]*Lvvdf[t,b,d]
                    temp[a,b] += Lvvdf[t,a,c]*tau_ij_[c,d]*Lvvdf[t,b,d]
                end
                t2new[i,j,:,:] += temp
            end
        end
    end #sync
    temp_final= sum(temps)
    t2new_final = sum(t2new_temps)
    println("temp from multithreaded code ")
    display(temp_final)
    println("t2new from multithreaded code ")
    display(t2new_final)
    temp_final
end



function build_temp_serial(t1,nocc,nvir,Lovdf,Lvvdf,tau)
    Lov_T120 = permutedims(Lovdf,(2,3,1))
    t2new_temp = Array{Float64,4}(undef,(nocc,nocc,nvir,nvir))
    temp = Array{Float64,2}(undef,(nvir,nvir))
    for i in 1:nocc
        @inbounds begin
            for j in 1:nocc
                @views tau_ij_ = tau[i,j,:,:]
                @tensoropt (k=>x, t=>100x, d=>100x, c=>100x, a=>100x, b=>100x) begin
                    temp[a,b] = (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvvdf[t,a,c])
                    temp[a,b] -= t1[k,a]*Lov_T120[k,c,t]*tau_ij_[c,d]*Lvvdf[t,b,d]
                    temp[a,b] += Lvvdf[t,a,c]*tau_ij_[c,d]*Lvvdf[t,b,d]
                end
                t2new[i,j,:,:] += temp
            end
        end
    end #sync
    println("temp from serial code ")
    display(temp)
    println("t2new from serial code ")
    display(t2new)
    temp
end

temp_final = build_temp(t1,nocc,nvir,Lovdf,Lvvdf,tau)
temp = build_temp_serial(t1,nocc,nvir,Lovdf,Lvvdf,tau)

Now the problem is when I comment out the line #t2new[i,j,:,:] += temp, array temp is coming same for both serial and multithreaded code. But when I remove the comment (as written in the code, Line no:54) again, temp and t2new both are coming different. Maybe I am doing some silly mistakes, but I can not figure it out. I am new to Julia and also in parallel programing.

Shouldn’t this be += in both functions?

If I do += then, I am not getting the correct temp that I expect.

When temp[a,b] += (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvvdf[t,a,c]), then temp is:

2×2 Matrix{Float64}:
 -0.102307   -0.0585356
 -0.0585356  -0.0793371

When temp[a,b] = (-t1[k,b]*Lov_T120[k,d,t]*tau_ij_[c,d]*Lvvdf[t,a,c]), then temp is:

2×2 Matrix{Float64}:
 -0.0125728     1.8677e-17
  1.82111e-17  -0.00865733

And this is the temp I expect.

Sorry, I forgot to add the t2new.h5 file in the repository. Now it’s updated. Clone the ccsd_julia repository and start your terminal with export JULIA_NUM_THREADS=12 or (depending on your system specifications) and then run: python3 pycall_multhread.py