Warning: In this original benchmarks there was a problem with my usage of @btime
, as pointed by @mikmoore. See the corrected benchmarks (which already solve question Q1) in a reply below.
Hi everyone. I am playing around with the Iteration interface and could use some help figuring out what happens behind the scenes in some usage examples.
I will be using
using .Iterators
using BenchmarkTools
prod = product(repeated(1:5, 9)...);
f(x) = sum(x)
I am interested in understanding what are the trade-offs between performance and memory usage between different ways of iterating over prod
, when prod
would be too large to collect
and keep in memory.
So, the standard way of iterating without collecting would be simply using Iterators.map
:
@btime map(f, prod)
> 2.745 ms (5 allocations: 14.90 MiB)
Good enough for memory, for if I collect prod
and look into it with varinfo
, I get
name size summary
––––––– ––––––––––– ––––––––––––––––––––––––––––––––––––––––––––
prodcol 134.111 MiB 5Ă—5Ă—5Ă—5Ă—5Ă—5Ă—5Ă—5Ă—5 Array{NTuple{9, Int64}, 9}
However, when I iterate with a for loop, I get
@btime for x in prod
f(x);
end
> 234.641 ms (5859375 allocations: 864.27 MiB)
But if I first collect, I get
@btime for x in collect(prod)
f(x);
end
> 420.362 ms (5858868 allocations: 491.73 MiB)
(Q1) What is the reason for the difference in runtimes and memory usage between the two for loops? Does the first one not collect prod to be able to iterate over it? And why is the map version so much faster? Is it simply because it avoids collecting and allocating memory?
Second, since my real use case will involve an iterator which is very large, and for each element of this iterator an expensive computation needs to be done, I am looking into ways of combining something similar to map
(or anything which does not collect nor requires indexing) together with multithreading.
Since Threads.@threads
requires indexing, I tried instead:
using ThreadsX
@btime ThreadsX.map(f, prod);
> 119.817 ms (5865420 allocations: 485.55 MiB)
(Q2) As far as I could find out/guess, roughly, ThreadsX.map
is not collecting prod
, but rather it is partitioning the iterator into nthreads
and then running serialized mapping on each thread (though I am not sure this is correct!). Then, why is the memory usage so high when compared to Iterators.map
? And is the increased computation time just the overhead of distributing the threads? I.e., if f
would be an expensive computation, then should I see an improvement with ThreadsX?
Lastly, I have noticed it is not possible to use Threads.@threads for p in prod
since prod
is not indexable. However, FLoops.@floop
work:
using FLoops
@btime @floop for x in prod
f(x);
end
> 40.263 ÎĽs (835 allocations: 54.41 KiB)
From the documentation, as long as the iterator supports the SplittablesBase.jl
interface, this would work (so similar to ThreadsX.map
), so fair enough.
(Q3) However, there is a drastic difference both in runtime and memory usage for FLoops.@floop
. So, why is it so much faster than, say, ThreadsX.map
and how come it uses even less memory than Iterators.map
?
and
(Q4) Do you have any recommendations/rules of thumb to structure computations where prod
is too big to be collect and f(x)
is very expensive to compute and one has access to many cores on shared memory or a distributed memory cluster?