CellListMap: How do I loop through its neighbour list in a thread-safe way?

Ahmed_Salih · December 26, 2022, 12:10pm

Hello!

In the following piece of code, I noticed that adding Threads.@threads would lead to the wrong result. My limited understanding is that it is because my arrays are not “thread-safe” - I want to learn how to make them so. The code is given below:

using CellListMap
using StaticArrays

N      = 100
points = rand(SVector{3,Float64},N)
system = InPlaceNeighborList(x=points, cutoff=0.5, parallel=false)
list   = neighborlist!(system)
# List is an vector of tuples in the format of; (index of point i, index of point j, euclidean distance between i and j)

result_single_thread = zeros(N)
for L in list
    i = L[1]; j = L[2]; d = L[3]

    result_single_thread[i] += d
    result_single_thread[j] += d
end

result_multi_thread = zeros(N)
Threads.@threads for L in list
    i = L[1]; j = L[2]; d = L[3]

    result_multi_thread[i] += d
    result_multi_thread[j] += d
end

deviance_between_single_and_multi_thread = sum(result_single_thread .- result_multi_thread)

println("Deviance between single and multi thread")
println(deviance_between_single_and_multi_thread)
println("Sum SINGLE: $(sum(result_single_thread))")
println("Sum SINGLE: $(sum(result_multi_thread))")

Which returns:

Deviance between single and multi thread
17.944560782803904
Sum SINGLE: 1050.8378293270782
Sum MULTI: 1032.8932685442744

Which means that my result using the multi-threaded approach is wrong.

How would I fix this issue?

Kind regards

Ahmed_Salih · December 26, 2022, 12:11pm

@lmiq pinging you as you told me it was fine for me to do so

I don’t think it has anything to do with CellListMap as such, I just included it in the title if someone else in the future might have a similar struggle as mine / gap in knowledge.

Kind regards

lmiq · December 26, 2022, 1:27pm

You need to allocate one array of results for each thread and add them at the end. Packages as Floops.jl do that for you.

I use this package in many cases: GitHub - m3g/ChunkSplitters.jl: Simple chunk splitters for parallel loop executions

With this pattern:

using ChunkSplitters
function chunk_splitters(N, list; nchunks = Threads.nthreads())
    results = [ zeros(N) for _ in 1:nchunks ]
    Threads.@threads for (list_range, ichunk) in chunks(list, nchunks)
        for i in list_range
            i, j, d = list[i] 
            results[ichunk][i] += d
            results[ichunk][j] += d
        end 
    end
    return sum(results)
end

Note, however, that with such a simple operation, you will probably be memory-bounded (see this thread for some discussion about this).

I can get a good speedup from multi-threading for a list of pairs of about 3 millions pairs.

Here is a complete running example, in which I preallocate the arrrays used for multi-threading:

Code

using Test
using CellListMap
using StaticArrays
using BenchmarkTools
using ChunkSplitters

function single_threaded(result, list)
    result .= 0.0
    for i in eachindex(list)
        i, j, d = list[i]
        result[i] += d
        result[j] += d
    end
    return result
end

function chunk_splitters(result, list, nchunks, results_threaded)
    # reset
    result .= 0.0
    for i in eachindex(results_threaded)
        results_threaded[i] .= 0.0
    end
    # compute for each chunk
    Threads.@threads for (Lrange, ichunk) in chunks(list, nchunks)
        for i in Lrange
            i, j, d = list[i] 
            results_threaded[ichunk][i] += d
            results_threaded[ichunk][j] += d
        end 
    end
    # reduce
    for i in eachindex(results_threaded)
        result .+= results_threaded[i]
    end
    return result
end

function main()

    N      = 10^4
    # generates points with the atomic density of water and a realistic cutoff
    # for molecular systems
    points, box = CellListMap.xatomic(N)

    system = InPlaceNeighborList(x=points, cutoff=box.cutoff)
    list   = neighborlist!(system)
    result = zeros(N)

    nchunks = Threads.nthreads()
    results_threaded = [ copy(result) for _ in 1:nchunks ]

    println("Number of pairs in list: $(length(list))")

    @test single_threaded(result, list) ≈ chunk_splitters(result, list, nchunks, results_threaded)
    @btime single_threaded($result, $list)
    @btime chunk_splitters($result, $list, $nchunks, $results_threaded)

    nothing
end

With which I get:

julia> main()
Number of pairs in list: 2689668
  7.570 ms (0 allocations: 0 bytes)
  2.188 ms (49 allocations: 4.36 KiB)

Ahmed_Salih · December 26, 2022, 3:48pm

Thank you very much!

I gave it a go, but it broke my code, so I will have to try re-implementing it at some point step-by-step. I could confirm though that the error in result is gone, so the splitting of array results is working as intended.

As you mention it will probably not speed up a simulation with a low N, but I like learning how to do both single and multi-threaded versions.

Kind regards

Ahmed_Salih · December 29, 2022, 3:44pm

Just a short update.

I got it to work now, no issues at all, so that is great!

As you mentioned at this low resolution I am memory bound, when I increase number of particles then I get “execution” bound - so at the end it was not worth it to do a threaded version for the CPU code

Thank you for me showing me how, I learned quite a bit

Kind regards

lmiq · December 29, 2022, 3:48pm

What you mean?

By the way, since you are updating the lists at every step, the best is to not compute the neighbor lists at all, and just map the force computation directly using CellListMap, and that will run already in parallel, by default.

For large systems the memory used to store the neighbor list can be limiting, and computing the list is almost as expensive as computing forces for the same set of particles.

Ahmed_Salih · December 29, 2022, 3:52pm

I don’t know the right word, but what I meant is something like this:

When I have few particles, the threaded approach is slower
When I have a lot of particles, the threaded approach is faster, but still too slow for my purposes

So when I am using the code “to learn” it is better for me to use fewer particles and a single-threaded approach

And when I have finally learned enough, I would try to use CellListMap with GPU to speed up the process, since the force calculation becomes very slow on CPU

Kind regards

Topic		Replies	Views
How to parallelize list comprehension and map with multithreading? General Usage multithreading	20	5775	July 2, 2021
Data structures for threaded computing Performance	23	2922	October 23, 2019
Memory allocation in multi-thread vs single-thread Julia at Scale performance	0	641	August 7, 2018
Use Dynamic Arrays (Vectors) during Multi-threading Julia at Scale multithreading , array , memory-allocation	7	2086	September 10, 2018
How to add arrays using multithreading? General Usage question	14	2333	March 6, 2018

CellListMap: How do I loop through its neighbour list in a thread-safe way?

Related topics