How to quickly create array of array?

question

#1

I find that creating uninitialized arrays of arrays, e.g. with a comprehension:

f1(N,n) = [Matrix{Complex128}(N,N) for i=1:n]

is relatively slow, especially compared to creating a single large array:

f2(N,n) = Matrix{Complex128}(n*N,N)
julia> N=10;n=4000;

julia> @benchmark f1($N,$n)
BenchmarkTools.Trial:
  memory estimate:  6.93 MiB
  allocs estimate:  4002
  --------------
  minimum time:     220.194 μs (0.00% GC)
  median time:      291.726 μs (0.00% GC)
  mean time:        479.013 μs (39.90% GC)
  maximum time:     56.723 ms (97.35% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark f2($N,$n)
BenchmarkTools.Trial:
  memory estimate:  6.10 MiB
  allocs estimate:  2
  --------------
  minimum time:     17.727 μs (0.00% GC)
  median time:      22.393 μs (0.00% GC)
  mean time:        105.279 μs (73.39% GC)
  maximum time:     54.741 ms (99.64% GC)
  --------------
  samples:          10000
  evals/sample:     1

Is there a faster way to do this?

Thanks


#2

Assuming you want the arrays to be independent no. You are creating many small arrays so there will be much higher overhead. If you don’t actually need an array of array but just something with a similar indexing behavior, you should be able to define a wrapper around Array{Complex128,3} or Matrix{Complex128} that can be allocated faster.


#3

I see thanks, an Array{Complex128,3} might work.


#4

Unless I did something wrong, it seems that allocating all the views needed for matrix multiplication make this option slower in the end:

julia> X = Array{Complex128,3}(10,10,1000);

julia> U = rand(Complex128,10,10,1000);

julia> U2 = [U[:,:,i] for i=1:1000];

julia> X2 = [X[:,:,i] for i=1:1000];

julia> function inplace_cumprod1!(X,U)
           n = size(U,3)
           n == size(X,3) || throw(DimensionMismatch("not the same number of submatrices!"))
           @views copy!(X[:,:,1],U[:,:,1])
           @inbounds for i = 2:n
               @views A_mul_B!(X[:,:,i],U[:,:,i],X[:,:,i-1])
           end
       end
inplace_cumprod1! (generic function with 1 method)

julia> function inplace_cumprod2!(X,U)
           n = length(U)
           n == length(X) || throw(DimensionMismatch("not the same number of submatrices!"))
           copy!(X[1],U[1])
           @inbounds for i = 2:n
               A_mul_B!(X[i],U[i],X[i-1])
           end
       end
inplace_cumprod2! (generic function with 1 method)

julia> @benchmark inplace_cumprod1!($X,$U)
BenchmarkTools.Trial:
  memory estimate:  187.44 KiB
  allocs estimate:  2999
  --------------
  minimum time:     1.307 ms (0.00% GC)
  median time:      1.389 ms (0.00% GC)
  mean time:        1.416 ms (0.30% GC)
  maximum time:     2.222 ms (26.43% GC)
  --------------
  samples:          3521
  evals/sample:     1

julia> @benchmark inplace_cumprod2!($X2,$U2)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.213 ms (0.00% GC)
  median time:      1.289 ms (0.00% GC)
  mean time:        1.311 ms (0.00% GC)
  maximum time:     2.538 ms (0.00% GC)
  --------------
  samples:          3804
  evals/sample:     1

#5

If you actually allocates all the views this cannot possibly be faster. After all, avoiding those allocation was the whole point of using a single array. The allocation elision on 0.6 isn’t very good. Should be better on master.


#6

It seems that if I allocate the views in advance, it can be faster, both the allocation and the matrix multiplication"

function allocate_matrices1(N,n)
    return [Matrix{Complex128}(N,N) for i=1:n]
end

function allocate_matrices3(N,n)
    A = Array{Complex128,3}(N,N,n)
    return [view(A,:,:,i) for i=1:n]
end

Allocation:

julia> N=10;n=4000;

julia> @benchmark allocate_matrices1($N,$n)
BenchmarkTools.Trial:
  memory estimate:  6.93 MiB
  allocs estimate:  4002
  --------------
  minimum time:     227.346 μs (0.00% GC)
  median time:      290.482 μs (0.00% GC)
  mean time:        565.908 μs (47.15% GC)
  maximum time:     2.795 ms (0.00% GC)
  --------------
  samples:          8788
  evals/sample:     1


julia> @benchmark allocate_matrices3($N,$n)
BenchmarkTools.Trial:
  memory estimate:  6.38 MiB
  allocs estimate:  4006
  --------------
  minimum time:     71.532 μs (0.00% GC)
  median time:      79.929 μs (0.00% GC)
  mean time:        233.513 μs (62.77% GC)
  maximum time:     1.400 ms (92.38% GC)
  --------------
  samples:          10000
  evals/sample:     1

Cumprod (function in earlier post):

julia> A = allocate_matrices3(N,n); B = allocate_matrices3(N,n);

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.638 ms (0.00% GC)
  median time:      5.902 ms (0.00% GC)
  mean time:        5.929 ms (0.00% GC)
  maximum time:     6.809 ms (0.00% GC)
  --------------
  samples:          842
  evals/sample:     1

julia> A = allocate_matrices1(N,n); B = allocate_matrices1(N,n);

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.458 ms (0.00% GC)
  median time:      6.730 ms (0.00% GC)
  mean time:        6.754 ms (0.00% GC)
  maximum time:     7.545 ms (0.00% GC)
  --------------
  samples:          739
  evals/sample:     1

Is the matrix multiplication faster because the arrays are contiguous in memory?

Also, would there be any obvious problem with doing this?


#7

All code given so far are valid code so no danger/problem from this part. The compiler/optimizer support for allocation is expected to keep improving over time and among them inplace_cumprod1! should theoretically be the easist for the compiler to fully optimize. In another word, the only possible “problem” is “premature optimization”, or at least optimization to a certain compiler limitation.

Possibly. Hard to tell without more careful study/profiling.


#8

I had profiled which is why I was looking into this, but I just did it again and it’s not a problem anymore…


#9

But don’t views need to be re-allocated at each execution with this version?


#10

FWIW, both inplace_cumprod1! and inplace_cumprod2! are a good chunk faster on master on my computer.

julia> versioninfo()
Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT ZEN)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

julia> using BenchmarkTools
INFO: Recompiling stale cache file /home/chris/.julia/lib/v0.6/BenchmarkTools.ji for module BenchmarkTools.

julia> X = Array{Complex128,3}(10,10,1000);

julia> U = rand(Complex128,10,10,1000);

julia> U2 = [U[:,:,i] for i=1:1000];

julia> X2 = [X[:,:,i] for i=1:1000];

julia> function inplace_cumprod1!(X,U)
                  n = size(U,3)
                  n == size(X,3) || throw(DimensionMismatch("not the same number of submatrices!"))
                  @views copy!(X[:,:,1],U[:,:,1])
                  @inbounds for i = 2:n
                      @views A_mul_B!(X[:,:,i],U[:,:,i],X[:,:,i-1])
                  end
              end
inplace_cumprod1! (generic function with 1 method)

julia> function inplace_cumprod2!(X,U)
                  n = length(U)
                  n == length(X) || throw(DimensionMismatch("not the same number of submatrices!"))
                  copy!(X[1],U[1])
                  @inbounds for i = 2:n
                      A_mul_B!(X[i],U[i],X[i-1])
                  end
              end
inplace_cumprod2! (generic function with 1 method)

julia> @benchmark inplace_cumprod1!($X,$U)
BenchmarkTools.Trial: 
  memory estimate:  187.44 KiB
  allocs estimate:  2999
  --------------
  minimum time:     874.181 μs (0.00% GC)
  median time:      887.376 μs (0.00% GC)
  mean time:        901.472 μs (0.63% GC)
  maximum time:     1.721 ms (40.30% GC)
  --------------
  samples:          5542
  evals/sample:     1

julia> @benchmark inplace_cumprod2!($X2,$U2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     795.044 μs (0.00% GC)
  median time:      811.104 μs (0.00% GC)
  mean time:        817.069 μs (0.00% GC)
  maximum time:     1.646 ms (0.00% GC)
  --------------
  samples:          6114
  evals/sample:     1

julia> function allocate_matrices1(N,n)
           return [Matrix{Complex128}(N,N) for i=1:n]
       end
allocate_matrices1 (generic function with 1 method)

julia> function allocate_matrices3(N,n)
           A = Array{Complex128,3}(N,N,n)
           return [view(A,:,:,i) for i=1:n]
       end
allocate_matrices3 (generic function with 1 method)

julia> N=10;n=4000;

julia> A = allocate_matrices3(N,n); B = allocate_matrices3(N,n);

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.473 ms (0.00% GC)
  median time:      3.512 ms (0.00% GC)
  mean time:        3.544 ms (0.00% GC)
  maximum time:     3.958 ms (0.00% GC)
  --------------
  samples:          1411
  evals/sample:     1

julia> A = allocate_matrices1(N,n); B = allocate_matrices1(N,n);

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.614 ms (0.00% GC)
  median time:      3.660 ms (0.00% GC)
  mean time:        3.690 ms (0.00% GC)
  maximum time:     4.402 ms (0.00% GC)
  --------------
  samples:          1352
  evals/sample:     1
julia> versioninfo()
Julia Version 0.7.0-DEV.4020
Commit 1bbd518* (2018-02-19 22:10 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, znver1)
Environment:

julia> using BenchmarkTools, LinearAlgebra

julia> X = Array{Complex128,3}(10,10,1000);
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main
┌ Warning: `Array{T, 3}(m::Int, n::Int, o::Int) where T` is deprecated, use `Array{T, 3}(uninitialized, m, n, o)` instead.
│   caller = top-level scope
└ @ Core :0

julia> U = rand(Complex128,10,10,1000);
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main

julia> U2 = [U[:,:,i] for i=1:1000];

julia> X2 = [X[:,:,i] for i=1:1000];

julia> function inplace_cumprod1!(X,U)
           n = size(U,3)
           n == size(X,3) || throw(DimensionMismatch("not the same number of submatrices!"))
           @views copyto!(X[:,:,1],U[:,:,1])
           @inbounds for i = 2:n
               @views mul!(X[:,:,i],U[:,:,i],X[:,:,i-1])
           end
       end
inplace_cumprod1! (generic function with 1 method)

julia> function inplace_cumprod2!(X,U)
           n = length(U)
           n == length(X) || throw(DimensionMismatch("not the same number of submatrices!"))
           copyto!(X[1],U[1])
           @inbounds for i = 2:n
               mul!(X[i],U[i],X[i-1])
           end
       end
inplace_cumprod2! (generic function with 1 method)

julia> @benchmark inplace_cumprod1!($X,$U)
┌ Warning: `indmin` is deprecated, use `argmin` instead.
│   caller = minimum at trials.jl:112 [inlined]
└ @ Core trials.jl:112
┌ Warning: `indmax` is deprecated, use `argmax` instead.
│   caller = maximum at trials.jl:117 [inlined]
└ @ Core trials.jl:117
BenchmarkTools.Trial: 
  memory estimate:  187.38 KiB
  allocs estimate:  2998
  --------------
  minimum time:     720.254 μs (0.00% GC)
  median time:      763.515 μs (0.00% GC)
  mean time:        791.029 μs (1.42% GC)
  maximum time:     33.777 ms (97.57% GC)
  --------------
  samples:          6312
  evals/sample:     1

julia> @benchmark inplace_cumprod2!($X2,$U2)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     685.600 μs (0.00% GC)
  median time:      696.730 μs (0.00% GC)
  mean time:        698.834 μs (0.00% GC)
  maximum time:     864.746 μs (0.00% GC)
  --------------
  samples:          7147
  evals/sample:     1

julia> function allocate_matrices1(N,n)
           return [Matrix{Complex128}(N,N) for i=1:n]
       end
allocate_matrices1 (generic function with 1 method)

julia> function allocate_matrices3(N,n)
           A = Array{Complex128,3}(N,N,n)
           return [view(A,:,:,i) for i=1:n]
       end
allocate_matrices3 (generic function with 1 method)

julia> N=10;n=4000;

julia> A = allocate_matrices3(N,n); B = allocate_matrices3(N,n);
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main
┌ Warning: `Array{T, 3}(m::Int, n::Int, o::Int) where T` is deprecated, use `Array{T, 3}(uninitialized, m, n, o)` instead.
│   caller = allocate_matrices3(::Int64, ::Int64) at REPL[12]:2
└ @ Main REPL[12]:2

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.875 ms (0.00% GC)
  median time:      2.897 ms (0.00% GC)
  mean time:        2.953 ms (0.00% GC)
  maximum time:     3.398 ms (0.00% GC)
  --------------
  samples:          1692
  evals/sample:     1

julia> A = allocate_matrices1(N,n); B = allocate_matrices1(N,n);
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main
WARNING: Base.Complex128 is deprecated, use ComplexF64 instead.
 in module Main
┌ Warning: `Array{T, 2}(m::Int, n::Int) where T` is deprecated, use `Array{T, 2}(uninitialized, m, n)` instead.
│   caller = #7 at <missing>:0 [inlined]
└ @ Core <missing>:0
┌ Warning: `Array{T, 2}(m::Int, n::Int) where T` is deprecated, use `Array{T, 2}(uninitialized, m, n)` instead.
│   caller = #7 at <missing>:0 [inlined]
└ @ Core <missing>:0

julia> @benchmark inplace_cumprod2!($A,$B)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.113 ms (0.00% GC)
  median time:      3.148 ms (0.00% GC)
  mean time:        3.174 ms (0.00% GC)
  maximum time:     3.541 ms (0.00% GC)
  --------------
  samples:          1571
  evals/sample:     1

Can’t say I know what’s going on, but because I’d guess that the vast majority of time is spent on repeated BLAS calls, the percent difference in speed looks really large/impressive to me. Wouldn’t have thought there’s that much overhead to trim off.


#11

Nice, and there probably would be an even bigger improvement with those deprecations fixed!


#12

Yeah! I just replaced copy -> copyto and didn’t bother with those outside of the benchmarks. I updated my earlier post.
I assume the indmin and indmax within BenchmarkTools itself doesn’t interfere with the results.

A striking improvement, although there is only 1 less allocation. I guess we’ll all hear about it here on Discourse when the compiler starts regularly eliding allocations for views.


#13

I’ll take a free ~15% improvement any day!

Alright, thanks all for your help.