We evaluated the new send_yield!
and recv_yield!
implementations for broadcast operations, noting significant performance improvements over the previous bcast_yield
(which used older send_yield
and recv_yield
). All tests were performed with 16 MPI local ranks, measuring performance using @time
.
DenseArray Broadcast Performance
We tested with rand(100, 100)
for lightweight data and rand(10000, 10000)
for huge data. We explored three implementation types: total out-of-place (send_yield/recv_yield), in-place logic to out-of-place, and total in-place (send_yield!/recv_yield!) for DenseArrays.
For lightweight data, the in-place logic to out-of-place implementation didn’t show a significant performance gain compared to the total in-place or out-of-place. However, for massive data, both total in-place and in-place for out-of-place implementations demonstrated substantially better performance compared to complete out-of-place implementations.
Lightweight data
# Total out-of-place
2.270116 seconds (346.71 k allocations: 18.757 MiB, 83.23% compilation time)
# In-place logic to out-of-place data
4.911291 seconds (302.37 k allocations: 13.880 MiB, 44.60% compilation time)
# Total in-place
2.019524 seconds (346.75 k allocations: 17.605 MiB, 89.63% compilation time)
Massive data
# Total out-of-place
111.232434 seconds (1.34 M allocations: 11.215 GiB, 3.82% gc time, 1.45% compilation time)
# in-place logic to out-of-place
70.331297 seconds (249.44 k allocations: 774.560 MiB, 0.65% gc time, 5.24% compilation time)
# Total in-place
32.585331 seconds (866.13 k allocations: 29.493 MiB, 6.32% compilation time)
We gain 3.4x performance compare to total out-of-place
SparseArray Broadcast Performance
For SparseArrays, we used sprand(10000, 10000, 0.0008)
for lightweight data, and sprand(10000, 10000, 0.008)
for Medium-sized data, and sprand(10000, 10000, 0.6)
for massive/denser data. We utilized two implementations: total out-of-place and in-place logic to out-of-place.
When dealing with lightweight sparse data, total out-of-place performed much better than in-place. Conversely, for medium and massive sparse data, the in-place logic implementation showed significantly better results. This efficiency comes from its ability to more performantly serialize metadata and directly send the data payload.
Lightweight data
# Total out-of-place
3.113323 seconds (441.72 k allocations: 74.634 MiB, 7.15% gc time, 63.99% compilation time)
# In-place logic to out-of-place data
9.231610 seconds (291.51 k allocations: 15.695 MiB, 45.60% compilation time)
Out-of-place 2,96x performance compare to in-place
Medium-sized data
# Total out-of-place
13.246720 seconds (450.35 k allocations: 466.911 MiB, 17.59% gc time, 32.13% compilation time)
# In-place logic to out-of-place data
5.510648 seconds (362.31 k allocations: 15.306 MiB, 27.50% compilation time)
Using In-place, we reported a 2,5x performance improvement compare to out-of-place
Massive/denser data
# Total out-of-place
248.220491 seconds (2.69 M allocations: 1.847 GiB, 10.94% gc time, 1.55% compilation time)
# In-place logic to out-of-place data
52.117064 seconds (882.54 k allocations: 942.113 MiB, 1.32% gc time, 5.26% compilation time)
Gained 4,76x performance improvement using In-place
Conclusion
The new send_yield!
and recv_yield!
Implementations often provide significant performance improvements for broadcast operations. Our tests indicate that while in-place strategies may not always benefit lightweight data, they appear highly beneficial for achieving substantial performance gains with massive data volumes.
Moreover, we aim to implement broadcast using the RMA window to gain performance with a huge number of ranks.