Vector of arrays initialization

Hi,

I have a case where I want to create a container for a list of arrays with known sizes but different types. The first thing that comes to mind is vector of vectors, like this:

nv = 2
T = [Float32, Int32]
nsize = [4,4,4]
vsize = [3,1]

container =
   [zeros(T[iv], vsize[iv], nsize[1], nsize[2], nsize[3]) for iv in 1:nv]

or I can start with an empty vector first

container2 = []

I have a function that actually generates the arrays, from which I want to fill into the container. For simplicity,

@noinline function generateData(T, vsize, nsize)
   d = ones(T, vsize, nsize[1]*nsize[2]*nsize[3])
end

The part that confused me is how to fill the arrays into the container. I tested on a small case

nv = 2
T = [Float32, Int32]
nsize = [16,16,16]
vsize = [3,1]

@noinline function generateData(T, vsize, nsize)
   d = ones(T, vsize, nsize[1]*nsize[2]*nsize[3])
end

container =
   [zeros(T[iv], vsize[iv], nsize[1], nsize[2], nsize[3]) for iv in 1:nv]

container2 = []

# 1st time execution
container[1][:] = generateData(T[1], vsize[1], nsize);
@views container[1][:] = generateData(T[1], vsize[1], nsize);
container[1][:,:,:,:] = generateData(T[1], vsize[1], nsize);
container[1][:,:,:,:] = reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]);
@views container[1][:,:,:,:] = generateData(T[1], vsize[1], nsize);

push!(container2,
   reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]));

println("2nd time execution:")

@time   container[1][:] = generateData(T[1], vsize[1], nsize);

@time   @views container[1][:] = generateData(T[1], vsize[1], nsize);

@time   container[1][:,:,:,:] = generateData(T[1], vsize[1], nsize);

@time   container[1][:,:,:,:] = reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]);

@time   @views container[1][:,:,:,:] = generateData(T[1], vsize[1], nsize);

container2 = []

@time   push!(container2,
   reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]));

println("")

and got

2nd time execution:
  0.000021 seconds (3 allocations: 48.094 KiB)
  0.000030 seconds (3 allocations: 48.094 KiB)
  0.000061 seconds (40 allocations: 48.922 KiB)
  0.000062 seconds (40 allocations: 48.922 KiB)
  0.000017 seconds (4 allocations: 48.141 KiB)

Note that the array dimensions are slightly different: the function returns a 2D array while the container stores 4D arrays.
There must be something I missed about slices and views.

  1. Do syntaxes like A[:] on the left hand side of equal sign imply views?
  2. For multi-dimensional arrays, what is the difference between A[:] and A[:,:,:,:]?
  3. Would it be better if in my use case I start with an empty vector of type Any and push to it every time I call the data generation function? For performance, how can I take advantage of the fact that I know ahead the sizes and types of data?

If you’re doing benchmarking, then you’re presumably worried about performance … in which case you should really think hard about this data structure, because it’s terrible for performance (it forces Julia to do type-unstable dynamic dispatch). Are you sure you can’t promote them all to a common type?

e.g. all Float32 and Int32 values, from your example, can be represented exactly by a Float64 value.

PS. Use @btime with $ interpolation for benchmarking this sort of thing. Using @time with global variables to time tiny calculations like this is a recipe for misleading results.

1 Like

Thanks for the tips. The real data I’m facing are actually not generated in Julia, but coming from a file. Each array in the file may be of type Float32, Int32, UInt64, Float64, Bool etc., so I’m not sure if I can promote them to a common type.

Profiling with btime:

using BenchmarkTools

nv = 4
T = [Float32, Int32, UInt64, Float64]
nsize = [16,16,16]
vsize = [3,1]

@noinline function generateData(T, vsize, nsize)
   d = ones(T, vsize, nsize[1]*nsize[2]*nsize[3])
end

container =
   [zeros(T[iv], vsize[iv], nsize[1], nsize[2], nsize[3]) for iv in 1:nv]

container2 = []

push!(container2,
   reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]));

@btime   $container[1][:] = generateData($T[1], $vsize[1], $nsize);

@btime   @views $container[1][:] = generateData($T[1], $vsize[1], $nsize);

@btime   $container[1][:,:,:,:] = generateData($T[1], $vsize[1], $nsize);

@btime   $container[1][:,:,:,:] = reshape(generateData($T[1], $vsize[1], $nsize), $vsize[1], $nsize[1], $nsize[2], $nsize[3]);

@btime   @views $container[1][:,:,:,:] = generateData($T[1], $vsize[1], $nsize);

container2 = []

@time   push!(container2,
   reshape(generateData(T[1], vsize[1], nsize), vsize[1], nsize[1], nsize[2], nsize[3]));

println("")

gives

  3.298 μs (3 allocations: 48.09 KiB)
  3.349 μs (3 allocations: 48.09 KiB)
  25.219 μs (40 allocations: 48.92 KiB)
  27.910 μs (43 allocations: 49.09 KiB)
  29.277 μs (40 allocations: 48.92 KiB)
  0.000022 seconds (7 allocations: 48.312 KiB)

For the push! method, I don’t know how to use btime as the final container2 would grow drastically in size…

GitHub - JuliaData/DataFrames.jl: In-memory tabular data in Julia is designed for this sort of circumstance.