Concatenating iterables without allocating memory

Perhaps it is easiest to not try to be too smart and just do

(min1, max1), (min2, max2)  = extrema(signal1), extrema(signal2)
return min(min1, min2), max(max1, max2)
2 Likes

@mschauer, this approach was already suggested two times in previous posts.
I don’t think it is being “too smart”. In every modern language (C# with LINQ, Java with Streams) I am allowed to do such lazy concatenation of arrays without performance penalties. I think there is a bug or a design flaw that needs to be addressed.

I think this is because you are going “against the grain” of the memory layout here. Compare what you are trying with the “generic” but very natural version using an unzip *)

mins, maxs = unzip(extrema.(signals))
minimum(mins), maximum(maxs)

It is to me more natural to compute the extremas first of each series and combine the results than to combine the series and compute the extremas. That also respects the locality of the memory better.

*) something is indeed missing

2 Likes

It doesn’t seem so unnatural to me to want to treat two (or more!) series as one, and do some operation on it. For example, it could be that they are semantically one series that just happened to be split up for size reasons.

Perhaps the operation even depends on boundary effects which makes it hard to apply the operation on the parts separately.

2 Likes

@mschauer I really like your approach.

But please consider those 2 examples:

const signal1 = rand(10000000)
const signal2 = rand(10000000)
julia> @btime Iterators.flatten(($signal1, $signal2)) |> extrema
  283.061 ms (20000002 allocations: 610.35 MiB)
(6.728651613663317e-8, 0.9999999625593397)

julia> @btime Iterators.flatten(zip($signal1, $signal2)) |> extrema
  68.612 ms (3 allocations: 64 bytes)
(6.728651613663317e-8, 0.9999999625593397)

In the first one I am going exactly with the memory layout and allocation is high, performance is poor. In the second one I am going “against the grain”, interleaving data from both arrays. For some reason that I don’t understand this “against the grain” version is faster and optimal.

Also when I define my own function

function my_extrema(itr)
    begin
        vmin = vmax = first(itr)
        for item in itr
            vmax = max(item, vmax)
            vmin = min(item, vmin)
        end
    end

    return (vmin, vmax)
end

Performance as is good as I would expect for original extrema from Julia base library.

julia> @btime Iterators.flatten(($signal1, $signal2)) |> my_extrema
  68.700 ms (2 allocations: 48 bytes)
(6.728651613663317e-8, 0.9999999625593397)

If someone is able to explain it to me I would be thankful. I am going to investigate it anyway but I have no time this right now.

2 Likes

Good that you insist. I also agree with you that something is not right when computing extrema of a flattened iterator. It only has nothing to do with flatten (as you also said). Something is not right about how extrema works…

Just because I’m currently working with it, here’s OnlineStats’ performance:

julia> o = Extrema()
Extrema: n=0 | value=(Inf, -Inf)

julia> @btime foreach(x -> fit!(o, x), ($signal1, $signal2))
  51.465 ms (0 allocations: 0 bytes)

julia> value(o)
(4.8156351573069855e-8, 0.9999999801147899)

For comparison, here’s what flatten results in on my machine:

julia> @btime Iterators.flatten(zip($signal1, $signal2)) |> extrema

  43.234 ms (3 allocations: 64 bytes)
(4.8156351573069855e-8, 0.9999999801147899)

So OnlineStats needs zero allocations but takes 20% longer…

1 Like

@yakir12 , thanks for another solution. I am not concerned about those tens or hundreds bytes. Hundreds of megabytes and gigabytes from above examples are painful for me.
Probably your example is slightly slower, cause access to variables can be optimised better than fields of mutable struct. 20% slower is fine for me, 400% slower not.

1 Like

Out of curiosity, are you running on a system where you are not allowed to allocate (e.g. a GPU)? If not, why do you care about the total number of allocations and not just the time it takes to run the operation?

@kristoffer.carlsson I am starting a quantitative finance project and I am exploring possibility to use Julia as a backend for numeric calculations. I am not planning to use GPU, just CPU. You’re right, I care about total time of calculation only, but I can see that there is a correlation between time of execution and amount of allocations. Also operation that clearly should not allocate at all allocates a lot (610 MiB for 150 MiB of data processed) in some rare circumstances (Base.extrema) and works as expected for other cases (Base.sum, Base.minimum, my_extrema).
I am willing to investigate it when I find some time for debugging and reading generated lowered and native code but I am also hoping that this is a known issue that someone already works on.

I agree that it would be great if map, filter, etc. are lazy. But I think that core devs had to balance compilation time, execution time, and usability. I imagine non-lazy map was an OK solution for this especially in early days of Julia compiler.

Unlike many languages (except Clojure and what else?), Julia has builtin transducers. I think it’s cool that you can do:

julia> function demo()
           a = (1, 2, 3)
           b = SVector(4, 5, 6)
           return sum(x for xs in (a, b) for x in xs)
           # another way to write Iterators.flatten
       end
demo (generic function with 1 method)

julia> @code_llvm demo()

;  @ REPL[80]:2 within `demo'
define i64 @julia_demo_19412() {
L29.1:
;  @ REPL[80]:4 within `demo'
  ret i64 21
}

I tried to explain why doing flatten using transducers works so well in my talk in the last JuliaCon (YouTube). See, e.g., this slide.

If you care the performance of lazy iterators, I think Julia is one the best languages. Well, at least as long as you use foldl and reduce everywhere.

4 Likes

@tkf, thanks for links, I saw your recent PRs and I think the work you did is outstanding.

I think I did such things 10 years ago in F#, and can easily do stuff like this also in C#. I prepared even performance comparison with C#.

csharp> var a = new [] {1, 2, 3};
csharp> var b = new [] {4, 5, 6};
csharp> Time(() => new [] {a,b}.SelectMany(x => x).Sum());
00:00:00.0002290
julia> const a = [1,2,3];
julia> const b = [4,5,6];
julia> @btime sum(x for xs in ($a, $b) for x in xs)
  54.606 ns (8 allocations: 192 bytes)
21

225 ns vs 54 ns

csharp> var a = new [] {1, 2, 3};
csharp> var b = Enumerable.Repeat(1, 10000000);
csharp> Time(() => new [] {a,b}.SelectMany(x => x).Sum());
00:00:00.1963760
julia> const a = [1,2,3];
julia> const b = [1 for i in 1:10000000];
julia> @btime sum(x for xs in ($a, $b) for x in xs)
  8.107 ms (8 allocations: 192 bytes)
10000006

196 ms vs 8 ms

csharp> Time(() => new [] {a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b}.SelectMany(x => x).Sum());
00:00:01.8202740
julia> @btime sum(x for xs in (a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b,a,b) for x in xs)
  86.810 ms (44 allocations: 1.17 KiB)
100000060

1820 ms vs 86 ms

I used interactive C# session, one can say that compiled would be faster. I checked that with all optimisations enabled it’s 1200 ms (on mac, win could do better, but no more that twice). What I really wanted to prove, that you are wrong, but I failed. Julia seems to be superior in this field, no doubt.

Anyway I still think this behaviour of flatten iterator with extrema seems to be a bug and I opened an issue with more details: Extrema, flattened iterator and too much memory alloation ¡ Issue #34385 ¡ JuliaLang/julia ¡ GitHub

4 Likes

As I said in my first comment, reduce with Iterators.flatten was slow in Julia < 1.4 which did not have transducers. Maybe you are checking this only in a released version of Julia. See:

In 1.5.0-DEV.67:

julia> @btime sum(x for xs in ($a, $b) for x in xs)
  5.718 ns (0 allocations: 0 bytes)

In 1.3.1:

julia> @btime sum(x for xs in ($a, $b) for x in xs)
  40.254 ns (8 allocations: 192 bytes)
5 Likes

Right, I used 1.3.1, on 1.5 for the last one I have:

  43.255 ms (1 allocation: 16 bytes)
100000060

It’s more that 25x faster than C#. I wouldn’t believe if someone told me that. I wonder if more complicated iterators like linked lists or trees will be that much faster too.

1 Like

My guess is that there won’t be much improvement for the usual linked lists or trees as iteration over such structures probably is dominated by chasing the pointers. But, I think transducers would be a great framework for iterating over “hybrid” data structures like B-trees.

2 Likes