The actual code that runs into problem is like the following:
s = mol.max_buf_size
Threads.@threads for ijkl = 1 : nbas^4
t1, l = divrem(ijkl-1, nbas)
t2, k = divrem(t1, nbas)
i, j = divrem(t2, nbas)
shells = get!(()-> Cint[0, 0, 0, 0], task_local_storage(), :shell)::Vector{Cint}
shells[1] = i; shells[2] = j; shells[3] = k; shells[4] = l
is, ie, ni = _get_sen(i+1, mol)
js, je, nj = _get_sen(j+1, mol)
ks, ke, nk = _get_sen(k+1, mol)
ls, le, nl = _get_sen(l+1, mol)
buf = get!(()-> zeros(s, s, s, s), task_local_storage(), :cache)::Array{Float64, 4}
buf .= 0.0
ccall(
(:cint2e_sph, lib_int()), Cint,
(
Ptr{Cdouble}, Ptr{Cint}, Ptr{Cint}, Cint, Ptr{Cint}, Cint, Ptr{Cdouble}, Ptr{Any}
),
buf, shells, mol.atm, mol.geo.n_atoms, mol.bas, nbas, mol.env, C_NULL
)
inte[is:ie, js:je, ks:ke, ls:le] .+= reshape(view(buf, 1:ni*nj*nk*nl), ni, nj, nk, nl)
end
Yes, many threads may write to the same array inte
at the same time, but it can be promised that they will write to different parts.
I’ve print the address of buf
, and found a strange thing in the non-function approach. For most of the time, the address of buf
in one thread will keep the same during the loop, and each thread will have an individual address. Occasionally, one thread will happen to use the buf
of another thread, which results in completely wrong results.
This peculiar behaviour may explain another fact that astonishes me even more. I’ve try to use a new array as the buf each time, with the code like the following, but the assertion in the code may fail:
s = mol.max_buf_size
Threads.@threads for ijkl = 1 : nbas^4
t1, l = divrem(ijkl-1, nbas)
t2, k = divrem(t1, nbas)
i, j = divrem(t2, nbas)
shells = get!(()-> Cint[0, 0, 0, 0], task_local_storage(), :shell)::Vector{Cint}
shells[1] = i; shells[2] = j; shells[3] = k; shells[4] = l
is, ie, ni = _get_sen(i+1, mol)
js, je, nj = _get_sen(j+1, mol)
ks, ke, nk = _get_sen(k+1, mol)
ls, le, nl = _get_sen(l+1, mol)
buf = zeros(ni, nj, nk, nl)
buf .= 0.0
ccall(
(:cint2e_sph, lib_int()), Cint,
(
Ptr{Cdouble}, Ptr{Cint}, Ptr{Cint}, Cint, Ptr{Cint}, Cint, Ptr{Cdouble}, Ptr{Any}
),
buf, shells, mol.atm, mol.geo.n_atoms, mol.bas, nbas, mol.env, C_NULL
)
@assert length(buf) == ni * nj * nk * nl
inte[is:ie, js:je, ks:ke, ls:le] .+= buf
end
If a thread may sometimes use the buf
of another thread, then the size will of course dismatch. But why will it happen?
I’ve also checked the documentation on threadcall. If I understand it correctly, I should use ccall
instead of @threadcall
here, because I do want to prevent executing other tasks until the call returns.
For those who insist on trying, I do have tried to use @threadcall
here, which results in ConcurrencyViolationError(msg="lock must be held")
. But there is nothing to lock here. Only inte
is shared among threads, but first, different threads handles different parts of it, and second, if it is locked, then the whole multi-threading is completely useless.
What’s more strange, if I wrapper codes in the for loop into a function, and run it like the following, there seems to be no problem:
function _2e_inner(inte, ijkl, mol)
# just like above
end
Threads.@threads for ijkl = 1 : nbas^4
_2e_inner(inte, ijkl, mol)
end
However, due to some other reasons, I cannot use a function here, and I’d like to know the exact reason of the peculiar behaviour and possible solutions without wrapping into functions.
The actual code is in a quantum chemistry package trying to call libcint for electron integrals. The construction of bas
and env
is somehow trumblesome, but I haven’t found a simpler MWE.
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 96 × Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 8 on 96 virtual cores
maybe solution
using @spawn
instead of @thread
. Any possible problems?
function _get_partition(x, N)
res = zeros(Int, N)
res .= x ÷ N
for i = 1 : x - x ÷ N * N
res[i] += 1
end
res2 = zeros(Int, N+1)
for i = 1 : N
res2[i+1] = res2[i] + res[i]
end
return res, res2
end
bs, bs2 = _get_partition(nbas^4, Threads.nthreads())
@sync for it = 1 : Threads.nthreads()
Threads.@spawn begin
shells = Cint[0, 0, 0, 0]
buf = zeros(s, s, s, s)
for ijkl = bs2[it] + 1 : bs2[it+1]
t1, l = divrem(ijkl-1, nbas)
t2, k = divrem(t1, nbas)
i, j = divrem(t2, nbas)
shells[1] = i; shells[2] = j; shells[3] = k; shells[4] = l
is, ie, ni = _get_sen(i+1, mol)
js, je, nj = _get_sen(j+1, mol)
ks, ke, nk = _get_sen(k+1, mol)
ls, le, nl = _get_sen(l+1, mol)
buf .= 0.0
ccall(
(:cint2e_sph, lib_int()), Cint,
(
Ptr{Cdouble}, Ptr{Cint}, Ptr{Cint}, Cint, Ptr{Cint}, Cint, Ptr{Cdouble}, Ptr{Any}
),
buf, shells, mol.atm, mol.geo.n_atoms, mol.bas, nbas, mol.env, C_NULL
)
# view(inte, is:ie, js:je, ks:ke, ls:le) .+= buf
view(inte, is:ie, js:je, ks:ke, ls:le) .+= reshape(view(buf, 1:ni*nj*nk*nl), ni, nj, nk, nl)
end
end
end
ps
For those who are interested, reasons for not using a function is the following.
In my actual usage, the name :cint2e_sph
need to be passed as an argument, and ccall
cannot accept variables as the function symbol. Thus, in my actual code, I wrap it into a macro like the following:
macro _int_2e!(res, name, mol)
return quote
# something else
ccall(
($(esc(name)), lib_int()), Cint,
# something else
)
end
end
It seems then that the ccall
part can never be put into a function, so the above solution is not practical.
The actual code as required
I’ve uploaded a MWE file. To use it, one needs to first compile the aforementioned libcint , and accordingly modify the Base.DL_LOAD_PATH
. Setting METHOD
to one of (:MNT, :MT, :MST, :NT)
, and just run main()
.
:MNT
is none threaded and correct, :NT
seems to be correct as the maybe solution part above; :MST
suffers from synchronization problems as explained in the file, the actual error if I add @sync
inside my macro will be
ERROR: LoadError: syntax: invalid let syntax around task.jl:475
:MT
is the main problematic code, but the error only occurs occasionally, and one may need to run main()
several times to trigger one. For the same reason, I’m not 100% sure whether the :NT
way is really correct, or I have just happened not to trigger the error during my several runs.
mwe.jl (10.8 KB)