Could Transducers help here?

performance

#1

I’m trying to get a finer understanding of where Transducers.jl can help, especially by comparison to explicit loops and iterators.

Here is a small example (freely adapted from advent of code, day 6, part2).

A version fully implemented using loops looks like this:

manhattan(a, b) = (b.-a) .|> abs |> sum

function fun(points, N)
    count = 0
    for x in 1:N, y in 1:N
        s = 0
        for p in points
            s += manhattan((x,y), p)
        end
        if s < 50
            count += 1
        end
    end
    count
end

It does not allocate, and runs relatively fast:

julia> N = 1000;
julia> points = zip(rand(1:N, 50), rand(1:N, 50)) |> collect;
julia> @benchmark fun($points, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.647 ms (0.00% GC)
  median time:      33.373 ms (0.00% GC)
  mean time:        33.620 ms (0.00% GC)
  maximum time:     36.444 ms (0.00% GC)
  --------------
  samples:          149
  evals/sample:     1

Here is a second version, using iterators and generators to replace the outer loop with a use of count:

function fun_iter1(points, N)
    count(Iterators.product(1:N, 1:N)) do p1
        s = 0
        for p2 in points
            s += manhattan(p1, p2)
        end
        s < 50
    end
end

The readability is (IMO) improved, and the performance drop is negligible: one allocation (probably to store the Iterators.product(...) iterator:

julia> @benchmark fun_iter1($points, $N)
BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     33.272 ms (0.00% GC)
  median time:      33.629 ms (0.00% GC)
  mean time:        33.738 ms (0.00% GC)
  maximum time:     35.302 ms (0.00% GC)
  --------------
  samples:          149
  evals/sample:     1

Now, if I also want to replace the inner loop with a use of sum, we get an implementation which is (IMO) even clearer:

function fun_iter2(points, N)
    count(Iterators.product(1:N, 1:N)) do p1
        sum(manhattan(p1, p2) for p2 in points) < 50
    end
end

… but this time the performance noticeably degraded. It attribute this to the allocation of a (small) generator object use as the argument of sum at each loop iteration:

julia> @benchmark fun_iter2($points, $N)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  1000001
  --------------
  minimum time:     49.331 ms (2.08% GC)
  median time:      50.463 ms (2.68% GC)
  mean time:        51.291 ms (4.34% GC)
  maximum time:     95.284 ms (43.08% GC)
  --------------
  samples:          98
  evals/sample:     1

So I was wondering: would this kind of algorithm would be a good candidate to use Transducers instead? But I did not manage to get any Transducer-based version which would perform better than this last iterator-based version.

(Of course, if anyone thinks of something to make the iterator-based version perform as well as the loop-only one, I’m also interested!)


#2

Here are a few ways to do it:

using Transducers

manhattan(a, b) = (b.-a) .|> abs |> sum

function fun_xf_simple(points, N)
    sum_distance_xf =
        Map(p1 -> mapfoldl(Map(p2 -> manhattan(p1, p2)), +, points))
    mapfoldl(
        sum_distance_xf |> Filter(x -> x < 50) |> Count(),
        right,
        Iterators.product(1:N, 1:N))
end

function fun_xf_setinput(points, N)
    dummy_input = Iterators.product(points, points)
    ed = eduction(MapSplat(manhattan), dummy_input)
    sum_distance_xf =
        Map(p1 -> foldl(+, setinput(ed, Iterators.product(points, (p1,)));
                        simd=true))
    mapfoldl(
        sum_distance_xf |> Filter(x -> x < 50) |> Count(),
        right,
        Iterators.product(1:N, 1:N))
end

function fun_xf_internal(points, N)
    rf = Transducers.reducingfunction(  # internal function
        MapSplat(manhattan),
        +,
        eltype(Iterators.product(points, points));
        simd = true)
    sum_distance_xf =
        Map(p1 -> transduce(rf, 0, Iterators.product(points, (p1,))))
    mapfoldl(
        sum_distance_xf |> Filter(x -> x < 50) |> Count(),
        right,
        Iterators.product(1:N, 1:N))
end

N = 1000
points = zip(rand(1:N, 50), rand(1:N, 50)) |> collect

For a reference, the manual loop performance in my laptop is:

julia> @benchmark fun($points, $N)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     22.833 ms (0.00% GC)
  median time:      22.903 ms (0.00% GC)
  mean time:        23.106 ms (0.00% GC)
  maximum time:     33.237 ms (0.00% GC)
  --------------
  samples:          217
  evals/sample:     1

Sadly, fun_xf_simple is super slow (~ a few sec) with current Transducers.jl. But fun_xf_setinput does somewhat OK job (comparable to iterator version?).

julia> fun_xf_setinput($points, $N)
BenchmarkTools.Trial:
  memory estimate:  45.78 MiB
  allocs estimate:  2000053
  --------------
  minimum time:     46.484 ms (15.15% GC)
  median time:      47.965 ms (15.70% GC)
  mean time:        48.660 ms (16.24% GC)
  maximum time:     62.052 ms (14.04% GC)
  --------------
  samples:          103
  evals/sample:     1

I think the performance penalty fun_xf_setinput is coming from the keyword argument handling overhead in Julia. Indeed, if I use internal function of Transducers.jl that skips keyword argument handling, I can get to the speed of the manual loop:

julia> fun_xf_internal($points, $N)
BenchmarkTools.Trial:
  memory estimate:  2.47 KiB
  allocs estimate:  50
  --------------
  minimum time:     28.591 ms (0.00% GC)
  median time:      28.682 ms (0.00% GC)
  mean time:        29.099 ms (0.00% GC)
  maximum time:     40.810 ms (0.00% GC)
  --------------
  samples:          172
  evals/sample:     1

But I also note that the vanilla (non-Transducers.jl) mapfoldl does a great job here:

julia> function fun_foldl(points, N)
           count(Iterators.product(1:N, 1:N)) do p1
               mapfoldl(p2 -> manhattan(p1, p2), +, points; init=0) < 50
           end
       end
fun_foldl (generic function with 1 method)

julia> @benchmark fun_foldl($points, $N)
BenchmarkTools.Trial:
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     25.528 ms (0.00% GC)
  median time:      25.945 ms (0.00% GC)
  mean time:        26.076 ms (0.00% GC)
  maximum time:     36.849 ms (0.00% GC)
  --------------
  samples:          192
  evals/sample:     1

So, in this case, I’d implement the inner loop using the vanilla mapfoldl like this (untested):

sum_distance_xf = Map() do p1
    mapfoldl(p2 -> manhattan(p1, p2), +, points; init=0)
end

A few remarks

  • Yes, fun_xf_simple should not be that slow. I have some fixes in mind.
  • fun_xf_internal shows the “speed limit” of Transducers.jl. It’s very close to the manual loop so I think it’s reasonable.
  • fun_xf_setinput is the recommended way to get rid of the setup cost when you need to use transducers inside a tight loop. I’m hoping that future Julia will optimize away keyword argument handling so that it’s as fast as it can be.
  • Meanwhile, I’m wondering if I should make reducingfunction a public API so that we don’t need to wait for such optimizations.
  • Other than setting up sum_distance_xf, transducers version is longer than iterator version as Transducers.jl does not implement count. Maybe it makes sense to implement the shortcut count(xf::Transducer, itr) = mapfoldl(xf |> Count(), right, itr).

By the way, I noticed that all these functions return 0 with this parameter. It looks like the threshold of 50 is too small. But what’s the threshold that gives us non-trivial results? Fortunately, Transducers.jl let us reuse the code so we can simply do:

mapfoldl(sum_distance_xf, +, Iterators.product(1:N, 1:N)) / N^2

which gives me ~34000. I think it’s nice to have this kind of reusability (although, to be fair, I think the outer loop is not critical to the performance so it’s possible to write efficient-enough composable code using iterators in this case).


#3

Many thanks!

Ah, I had tried a solution based on MapSplat(manhattan) applied to Iterators.product(points, (p1,)), but I was missing the setinput trick to be able to reuse the same transducer. That makes sense!

It looks like it would be useful, so I would be in favor of making it public. But maybe you feel that enlarging the API would prevent future re-designs of the internal code…

Maybe it would reduce the entry cost of using transducers for certain tasks. But at least the example about covariance in the Transducers tutorial makes it very clear how one should use tranducers to count things.

I think so, too (even though, as you said, this example is not the best to illustrate the point).


#4

If readability is the main concern, then I would say that the first, explicit loop implementation is (by far!) the clearest. Both count, product and especially do reduce legibility. When you start introducing mapfoldl and the like you lose me completely.

Transducers.jl intrigue me, don’t get me wrong, but in this case we go from clear as glass (explicit loop) to Brainfuck territory.


#5

@DNF Haha, I get what you are saying. It’s always possible to combine manual inner loop with transducers like this:

function sum_distance(p1)
    s = 0
    for p in points
        s += manhattan(p1, p2)
    end
    return s
end

sum_distance_xf = Map(sum_distance)

and you don’t need to use do block :slight_smile:

But I’d like to argue (vanilla) mapfoldl/foldl is not so hard. I think this is one of the thing worth getting used to it. For example, it gives us opportunity to parallelize the code “for free” by replacing it with some parallel implementation of mapreduce/reduce. For this point of view, I recommend Guy Steele’s talks like Four Solutions to a Trivial Problem and How to Think about Parallel Programming: Not!.