A macro to unroll by hand but not by hand?

Results on my machine (I can’t see order of magnitude slow down)


julia> @btime v1($D, Val(4))
  16.125 μs (4 allocations: 31.75 KiB)
(0.6448711574501893, 0.2574365553023608, 0.15220038582379217, 0.10580079514815209)

julia> @btime v2($D, Val(4))
  3.365 μs (0 allocations: 0 bytes)
(0.6448711574501889, 0.25743655530236076, 0.15220038582379197, 0.10580079514815198)

julia> @btime v3($D, Val(4))
  3.458 μs (0 allocations: 0 bytes)
(0.6448711574501889, 0.25743655530236076, 0.15220038582379197, 0.10580079514815198)

julia> @btime v1($D, Val(30))
  318.875 μs (30 allocations: 238.12 KiB)
(0.6448711574501893, 0.2574365553023608, 0.15220038582379217, 0.10580079514815209, 0.08017002510945055, 0.06410806936660542, 0.05320097597224929, 0.04536635637370755, 0.03949783153807502, 0.03495613511968046, 0.031347930888187145, 0.028418959158835428, 0.025998145904505637, 0.02396649158751632, 0.022238857123897404, 0.0207528423483709, 0.019461747651250177, 0.01832997970811076, 0.017329970351146754, 0.01644006032068933, 0.015643014586369287, 0.01492496081474157, 0.014274617330067178, 0.013682722898519838, 0.01314160963951319, 0.012644879028781236, 0.012187153219416135, 0.011763882112192241, 0.011371192189755026, 0.011005766987281907)

julia> @btime v2($D, Val(30))
  17.791 μs (0 allocations: 0 bytes)
(0.6448711574501889, 0.25743655530236076, 0.15220038582379197, 0.10580079514815198, 0.08017002510945062, 0.06410806936660547, 0.05320097597224934, 0.04536635637370749, 0.03949783153807499, 0.0349561351196805, 0.031347930888187166, 0.028418959158835418, 0.025998145904505606, 0.02396649158751633, 0.0222388571238974, 0.0207528423483709, 0.019461747651250163, 0.01832997970811077, 0.017329970351146758, 0.016440060320689332, 0.015643014586369283, 0.014924960814741576, 0.014274617330067184, 0.01368272289851984, 0.01314160963951317, 0.012644879028781213, 0.012187153219416118, 0.011763882112192226, 0.01137119218975503, 0.01100576698728192)

julia> @btime v3($D, Val(30))
  28.458 μs (0 allocations: 0 bytes)
(0.6448711574501889, 0.25743655530236076, 0.15220038582379197, 0.10580079514815198, 0.08017002510945062, 0.06410806936660547, 0.05320097597224934, 0.04536635637370749, 0.03949783153807499, 0.0349561351196805, 0.031347930888187166, 0.028418959158835418, 0.025998145904505606, 0.02396649158751633, 0.0222388571238974, 0.0207528423483709, 0.019461747651250163, 0.01832997970811077, 0.017329970351146758, 0.016440060320689332, 0.015643014586369283, 0.014924960814741576, 0.014274617330067184, 0.01368272289851984, 0.01314160963951317, 0.012644879028781213, 0.012187153219416118, 0.011763882112192226, 0.01137119218975503, 0.01100576698728192)

1 Like

Neither can I (although your results in the 30-fold unrolling case is interesting), but I think the big slow down only shows up in combination with @tturbo


Yes, I was following up on @tkf’s comment which was, as I understand it, about loop unrolling in general. To be clear, I think there is usually no need for @generated or Base.Cartesian macros when all you need is unrolling some loop with a statically known number of iterations: in such cases working with NTuples or SArrays is usually a better idea.

Your specific use case is a bit more complicated though, mainly because of interactions with @tturbo (some of which fall into the broad category of “macros are powerful tools, but don’t compose too well with other tools”). In this particular case, my understanding is that @tturbo has to somewhat “understand” what your code does in order to transform it into something faster, and it doesn’t know how to handle ntuple.

2 Likes

Indeed, this was intended w.r.t. @tturbo & @generated version.

This seems to be an accurate description of the problem. Maybe a non-macro unrolling on MVectors instead of NTuples would allow @tturbo to understand what is happening, but I am not sure.

Btw, what is wrong with Base.Cartesian macros that pushes you guys to want to avoid them ? I dont get it.

I don’t think there’s anything inherently wrong with Base.Cartesian macros, and I do use them in a lot of HPC-related cases.

However, there is something inherent to macros that make them very powerful tools. But it comes with an associated cost. To name a few things that immediately come to my mind:

  1. as a user, macros don’t compose too well with other macros (an example that we’ve seen here is the order in which macros are expanded: how do you know whether @tturbo expands first, or whether it takes steps to expand all macros in your code before transforming it?)
  2. as a developer, macros are difficult to get right (the most salient point probably being hygiene)

These issues would be manageable, but it happens that in Julia we do have a lot of other ways to do complicated things without involving macros:

  1. when macros are used to factor out code and improve DRYness, higher-order functions often help achieving the same result, in ways that are easier to debug, and more composable
  2. when it comes to achieving high performance while keeping a simple & beautiful code (like making sure a loop is unrolled for example), there are numerous techniques (e.g. involving NTuples, function barriers or constant propagation) that sometimes allow achieving the same effect as writing a custom macro, but in more generic ways.

In a more “social” aspect, the counterpart to this technical discussion is that there are a lot of questions (here on discourse as well as in other forums) where people ask how to write a macro and/or how to fix some macro they’ve started developing. But it often happens that, along the discussion, we come to the conclusion that they should have used another tool.
I can’t speak on behalf of @tkf, but I think one interesting point in their comment was addressed to future readers who’ll have found this thread while they were looking for some keyword related to “unrolling”. And it’s important IMHO that we present them alternative ways to achieve effects similar to loop unrolling by other means than @generated functions and Base.Cartesian macros.

5 Likes

I don’t have anything to add ffevotte’s comment :slight_smile:

But some more random comments…

(1) It’d be interesting to see if you can leverage instruction level parallelism by using a parallel prefix sum Prefix sum - Wikipedia instead of accumulate on a tuple/SVector. I’m not sure if it matters when you call exp but I find a similar approach work well for simpler coputation like cumsum.

(2) FYI, FLoops.jl works with the v3 example above because you can write arbitrary parallel reduction with it

julia> using FLoops

julia> function v3_floop(D, ::Val{M}) where {M}
           r0 = ntuple(_->0.0, Val(M))
           @floop for i in 1:length(D)
               Di = (exp(-D[i]), ntuple(_->D[i], Val(M-1))...)
               e = accumulate(*, Di)
               @reduce r = (.+)(r0, e)
           end
           r ./ length(D)
       end;

julia> v3_floop(rand(1000), Val(4))
(0.6231422289649688, 0.269264496773052, 0.16628201243649704, 0.11921451372815403)

although (.+)(r0, e) is a bit ugly. I wrote a patch so that @reduce r .+= e or @reduce r .= r0 .+ e work last week. It’ll be released soon-ish.

2 Likes