Very best way to concatenate array of arrays, while applying a function

BambOoxX · October 3, 2022, 9:59am

Continuing on this nice dicussion about reduce would there be a way to use the same syntax while applying another function, if at al relevant.

Basically, I have some calls that look like reduce(vcat, fun.(array_of_array)) where I want to apply fun to all items in array_of_array. Given that I am not interested in the result of a single call of fun, would it make a difference to have something like a reduce(vcat, a -> fun(a), array_of_array) ?

I am not familiar enough with reduce to say if something already exists or not to do that…

carstenbauer · October 3, 2022, 10:10am

Didn’t look at the other discussion but sounds just like mapreduce to me.

mariok90 · October 3, 2022, 10:11am

You could use mapreduce:

arr_of_arr = [[rand() for _ in 1:3] for _ in 1:5]

fun(x) = x / 2
mapreduce(fun, vcat, arr_of_arr)

edit: I was a little bit too slow

DNF · October 3, 2022, 10:18am

[rand(3) for _ in 1:5]

Another issue is if you want the map function to work element-wise on each subarray.

mcabbott · October 3, 2022, 10:37am

Note that reduce(vcat, fun.(array)) may well hit a special method for vectors of vectors or matrices, which allocates one array to write into. There is no such method for mapreduce(fun, vcat, array), and thus it will call vcat many times, allocating length(array)-1 arrays in the process. This will probably be much more expensive than the one temporary array from fun.(array).

If fun(a) returns a vector, then you can try vec(stack(fun, array)), with using Compat first.

BambOoxX · October 3, 2022, 10:55am

Damn, I forgot mapreduce. I do not have that map reflex yet.

I just benchmarked it quickly and it turns out your are very much right about this. For instance:

fun = x -> 3*x
arr_of_arr = [[rand() for _ in 1:1000] for _ in 1:100]
 @benchmark reduce(vcat,$fun.($arr_of_arr))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  136.800 μs …   5.901 ms  ┊ GC (min … max):  0.00% … 96.36%
 Time  (median):     170.600 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   224.968 μs ± 309.902 μs  ┊ GC (mean ± σ):  19.03% ± 13.03%

  ██▄▂▁▂▁  ▁                                                  ▁ ▂
  ███████▇▇█▇▆▅▄▄▃▄▃▄▄▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▇▆▆█ █
  137 μs        Histogram: log(frequency) by time       2.09 ms <

 Memory estimate: 1.54 MiB, allocs estimate: 104.

julia> @benchmark mapreduce($fun,vcat,$arr_of_arr)
BenchmarkTools.Trial: 1462 samples with 1 evaluation.
 Range (min … max):  2.166 ms … 15.543 ms  ┊ GC (min … max):  0.00% … 46.46%
 Time  (median):     3.730 ms              ┊ GC (median):    35.78%
 Time  (mean ± σ):   3.402 ms ±  1.167 ms  ┊ GC (mean ± σ):  23.45% ± 17.79%

  ▂▅▆█▅▂▂▂  ▁       ▁▇▇▇▆▄▃▂  ▁                              ▁
  ████████████▇▆▇▄▆▆█████████▇█▅▇▇▇▆▆▆▆▆▄▁▅▁▆▁▄▅▁▅▁▁▄▁▄▁▄▁▁▄ █
  2.17 ms      Histogram: log(frequency) by time      6.8 ms <

 Memory estimate: 39.30 MiB, allocs estimate: 297.

Do you know why mapreduce cannot hit the same methods than reduce ?

Per · October 3, 2022, 11:22am

That’s becaues reduce can check the sizes beforehand to know how much memory to allocate, but mapreduce cannot know the sizes of arrays output by the user function before it is called.

tbeason · October 3, 2022, 12:19pm

This is a strong statement and I only half mean it, but… I kind of wish I had never learned about mapreduce. I reach for it quite often now but I usually find out that it is doing things in a pretty inefficient way (because of the shortcomings listed here).

rocco_sprmnt21 · October 3, 2022, 1:36pm

using compat

julia> @benchmark reduce(vcat,$fun.($arr_of_arr))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  154.500 μs …   4.178 ms  ┊ GC (min … max):  0.00% … 83.20%        
 Time  (median):     215.400 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   284.858 μs ± 259.627 μs  ┊ GC (mean ± σ):  14.05% ± 14.42%        

  ▄▇█▇▆▅▄▄▃▃▂▁                                      ▁           ▂
  ███████████████▇▇▆▆▆▆▆▆▅▄▄▄▄▄▃▄▄▄▄▃▅▃▂▅▄▅▄▃▄▄▅▃▄▆██▇▇▇▆▆▆▇▆▆▆ █
  154 μs        Histogram: log(frequency) by time       1.43 ms <

 Memory estimate: 1.54 MiB, allocs estimate: 104.

julia> @benchmark mapreduce($fun,vcat,$arr_of_arr)
BenchmarkTools.Trial: 922 samples with 1 evaluation.
 Range (min … max):  3.434 ms … 15.972 ms  ┊ GC (min … max):  0.00% … 44.82%
 Time  (median):     4.903 ms              ┊ GC (median):    15.35%
 Time  (mean ± σ):   5.394 ms ±  1.416 ms  ┊ GC (mean ± σ):  14.93% ±  6.12%

          █▅▄▁▂
  ▂▃▄▂▃▃▃▅█████▇▆▄▅▄▃▃▃▃▃▃▂▃▃▂▃▂▃▃▂▃▃▂▃▃▄▃▃▂▂▃▁▂▂▂▂▂▁▁▁▁▁▁▁▂ ▃
  3.43 ms        Histogram: frequency by time        10.8 ms <

 Memory estimate: 39.30 MiB, allocs estimate: 297.

julia> @benchmark vec(stack($fun, $arr_of_arr))
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  116.700 μs …   3.675 ms  ┊ GC (min … max):  0.00% … 85.37%        
 Time  (median):     194.900 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   252.690 μs ± 194.317 μs  ┊ GC (mean ± σ):  11.77% ± 14.80%        

  ▁ ▂██▆▅▅▄▄▄▃▂▁▁                                               ▂
  █████████████████▇▅▇▆▆▆▆▅▅▅▅▄▅▄▅▅▆▆▆▆▆▇▇▇▇▆▆▇▆▇▆▆▅▆▆▅▅▅▅▅▅▃▃▅ █
  117 μs        Histogram: log(frequency) by time       1.15 ms <

 Memory estimate: 1.54 MiB, allocs estimate: 104.

Mason · October 3, 2022, 3:25pm

By the way, this will only work if fun always returns a vector of the same length.

mcabbott · October 4, 2022, 12:20am

This is true, although it could still do better than pairwise vcat, for instance by guessing that all are the same size as the first. Or by repeated doubling like append!.

Indeed, this is one of the things which makes stack easier.

Mostly nobody has got around to it. PR 31636 had a go. Another option would simply be to send this to reduce(vcat, map(f, xs)), as this will usually be better. (This is also what happens for mapreduce(f, op, xs, ys) at present.)

The other option would be to write a flatten companion for stack, since these clever reduce overloads are hard to discover & quite fragile – it’s easy to step off the fast path, as with init keyword for example.

Topic		Replies	Views
Fastest way to concatenate many arrays along existing axis? General Usage linearalgebra , arrays	5	419	June 19, 2024
Mapreduce with function that has multiple return values General Usage	2	619	June 18, 2020
Very best way to concatenate an array of arrays General Usage arrays	47	28252	July 24, 2021
Concatenate each General Usage arrays	4	311	October 26, 2021
Why is reduce with concatenation type unstable? General Usage question	2	559	August 19, 2020

Very best way to concatenate array of arrays, while applying a function

Related topics