Let me break my answer into pieces according to your questions.
For my test parameters @btime
for no threads gives me 2.89 s, for 2 threads 4.11s, for 3 threads 3.92 s and for 4 threads 3.71 seconds (my machine has 4 physical cores). So although there seems to be some scaling with Threads.nthreads()
I am not getting in range of the non-threaded loops.
The structs consist of 3 Arrays. Field1 has ~10^2
entries atmost, in the tests it hast like 10 entries. The other contain up to 10^7
entries, in the test its roughly a thousand each.
struct MyStruct
Field1 :: Vector{Float64}
Field2 :: Array{Float64, 4}
Field3 :: Array{Float64, 4}
end
The computeKernel()
functions compute one dimensional integrals, so inside of these I define inner functions which are passed as kernels to some integration function. In that process I allocate memory temporarily for a buffer that saves intermediate results.
Within the calculation elements of one struct are used to compute elements of the other struct, so there is in fact quite a lot of data access (so cache might be a concern?! ).
Not sure how I test if two threads write into the same cache line⦠Only thing I can say that the loops are definitely threadsafe, in the sense that no chunck of memory is simultaneously accessed for both reading and writing.