Dagger: Benchmarking Broadcast using send_yield! and recv_yield!

We evaluated the new send_yield! and recv_yield! implementations for broadcast operations, noting significant performance improvements over the previous bcast_yield (which used older send_yield and recv_yield). All tests were performed with 16 MPI local ranks, measuring performance using @time.


DenseArray Broadcast Performance

We tested with rand(100, 100) for lightweight data and rand(10000, 10000) for huge data. We explored three implementation types: total out-of-place (send_yield/recv_yield), in-place logic to out-of-place, and total in-place (send_yield!/recv_yield!) for DenseArrays.

For lightweight data, the in-place logic to out-of-place implementation didn’t show a significant performance gain compared to the total in-place or out-of-place. However, for massive data, both total in-place and in-place for out-of-place implementations demonstrated substantially better performance compared to complete out-of-place implementations.

Lightweight data

# Total out-of-place
2.270116 seconds (346.71 k allocations: 18.757 MiB, 83.23% compilation time)
# In-place logic to out-of-place data
4.911291 seconds (302.37 k allocations: 13.880 MiB, 44.60% compilation time)
# Total in-place
2.019524 seconds (346.75 k allocations: 17.605 MiB, 89.63% compilation time)

Massive data

# Total out-of-place
111.232434 seconds (1.34 M allocations: 11.215 GiB, 3.82% gc time, 1.45% compilation time)
# in-place logic to out-of-place
70.331297 seconds (249.44 k allocations: 774.560 MiB, 0.65% gc time, 5.24% compilation time)
# Total in-place
32.585331 seconds (866.13 k allocations: 29.493 MiB, 6.32% compilation time)

We gain 3.4x performance compare to total out-of-place


SparseArray Broadcast Performance

For SparseArrays, we used sprand(10000, 10000, 0.0008) for lightweight data, and sprand(10000, 10000, 0.008) for Medium-sized data, and sprand(10000, 10000, 0.6) for massive/denser data. We utilized two implementations: total out-of-place and in-place logic to out-of-place.

When dealing with lightweight sparse data, total out-of-place performed much better than in-place. Conversely, for medium and massive sparse data, the in-place logic implementation showed significantly better results. This efficiency comes from its ability to more performantly serialize metadata and directly send the data payload.

Lightweight data

# Total out-of-place
3.113323 seconds (441.72 k allocations: 74.634 MiB, 7.15% gc time, 63.99% compilation time)
# In-place logic to out-of-place data
9.231610 seconds (291.51 k allocations: 15.695 MiB, 45.60% compilation time)

Out-of-place 2,96x performance compare to in-place

Medium-sized data

# Total out-of-place
13.246720 seconds (450.35 k allocations: 466.911 MiB, 17.59% gc time, 32.13% compilation time)
# In-place logic to out-of-place data
5.510648 seconds (362.31 k allocations: 15.306 MiB, 27.50% compilation time)

Using In-place, we reported a 2,5x performance improvement compare to out-of-place

Massive/denser data

# Total out-of-place
248.220491 seconds (2.69 M allocations: 1.847 GiB, 10.94% gc time, 1.55% compilation time)
# In-place logic to out-of-place data
52.117064 seconds (882.54 k allocations: 942.113 MiB, 1.32% gc time, 5.26% compilation time)

Gained 4,76x performance improvement using In-place


Conclusion

The new send_yield! and recv_yield! Implementations often provide significant performance improvements for broadcast operations. Our tests indicate that while in-place strategies may not always benefit lightweight data, they appear highly beneficial for achieving substantial performance gains with massive data volumes.

Moreover, we aim to implement broadcast using the RMA window to gain performance with a huge number of ranks.

2 Likes