Efficient parsing of arrays of arrays of arrays or arbitrary number of arrays of arrays?

Marcell_Havlik · July 30, 2020, 12:15pm

Hey Julianers,
I parsed a JSON file with JSON2.jl and now I have to postprocess the results it into an Array{Array{Float32, N},1}. The problem is that, the JSON parsed struct is an arrays of arrays of arrays of … and so on… like: Array{Array{Array{Float64,1},1},1} or Array{Array{Array{Array{Float64,1},1},1},1} (and it can be arbitrary dimensions in extreme cases)

I know the concrete type is array of tensors under the hood.

For me the fastest version was for Array of 2D tensors:
@time convert(Array{Array{Float32,2},1},[hcat(convert(Array{Array{Float32,1},1},arr)::Array{Array{Float32,1},1}...)' for arr in data])::Array{Array{Float32,2},1}

Can someone help me to achieve better performance?

My test code is:

@time data = [[[1::Any for i = 1:10] for j=1:10000] for k=1:100]
@time convert(Array{Array{Float32,2},1},[hcat(convert(Array{Array{Float32,1},1},arr)::Array{Array{Float32,1},1}...)' for arr in data])::Array{Array{Float32,2},1}

My results:

0.157031 seconds (1.24 M allocations: 172.817 MiB)
0.154512 seconds (1.06 M allocations: 231.872 MiB)

I see it is about 4MiB of memory. So reaching 50x bigger memory consumption size sounds like there is something wrong with my code.

How can I reach better performance?
Thanks in advance!

hendri54 · July 30, 2020, 5:56pm

Possibly useful here:

https://stackoverflow.com/questions/37476815/julia-converting-vector-of-arrays-to-array-for-arbitrary-dimensions

or

https://github.com/SciML/RecursiveArrayTools.jl#vectorofarray

Marcell_Havlik · July 30, 2020, 6:15pm

I think I have an idea, how to cover arbitrary dimensions of tensor! Can some expert check it?

From Flux:

# code from Flux
unsqueeze(xs, dim) = reshape(xs, (size(xs)[1:dim-1]..., 1, size(xs)[dim:end]...))
stack(xs, dim) = cat(unsqueeze.(xs, dim)..., dims=dim)

# custom codes
get_dims(arr) = (length(size(arr[1]))>0 ? (size(arr)..., get_dims(arr[1])...) : size(arr))

# gen data arriving from JSON.jl or JSON2.jl
arbitrary_1d_array_from_file(arr_size) = size(arr_size,1)>1 ? [arbitrary_1d_array_from_file(arr_size[2:end]) for _ in 1:arr_size[1]] : [1:arr_size[1];]

# helper functions
recursive_array(::Val{N}) where {N} = N>1 ? Array{recursive_array(Val(N-1)),1} : Array{Float32, 1} 
conv(::Val{N}, arr) where {N} = (type = recursive_array(Val(N)); convert(type,arr)::type) 
conv_reshaped(::Val{N}, arr) where {N} = convert(Array{Float32,N},arr)::Array{Float32,N}
stack_all(::Val{N},arr) where {N} = N>1 ? stack(stack_all(Val(N-1),arr),1) : stack(arr,1)

# The tests
d=arbitrary_1d_array_from_file([9,8,7,6,5])
dims= get_dims(d)
dims_size = length(dims)
println("Start arbitrary arrays $(typeof(d)) $dims")
res = d
res = conv(Val(dims_size),res)
res = stack_all(Val(dims_size-1),res)
res = reshape(res, dims)
res = conv_reshaped(Val(dims_size),res)
res_dims= get_dims(res)
println("Final array $(typeof(res)) $res_dims")

Result:

Start arbitrary arrays Array{Array{Array{Array{Array{Int64,1},1},1},1},1} (9, 8, 7, 6, 5)
Final array Array{Float32,5} (9, 8, 7, 6, 5)

I have problem with the speeds but… now at least it covers arbitrary dimensions of tensors.
I think this can be useful for a lot of julianers as data parsing will be more and more important.

I hope my code can help someone and he can also help to improve this mechanism.

Marcell_Havlik · July 30, 2020, 6:19pm

I checked the RecursiveArrayTools, but it was for 2D things. Or I just couldn’t understand how to use it for 3,4,5D tensors.

Thank you for the feedback I will check the other thing out tomorrow!

ChrisRackauckas · July 30, 2020, 6:31pm

It’s for ND. You can put any dimensional array, it’s just Vector of Array (i.e. vector on the outside)

Marcell_Havlik · July 31, 2020, 7:38am

I could only use it for Array{Array{T,N},1} case. How do you use for an array like mentioned before? (It lacks documentation and didn’t know how to figure it out.)

recursive_array(::Val{N}) where {N} = N>1 ? Array{recursive_array(Val(N-1)),1} : Array{Float32, 1} 
arr=recursive_array([9,8,7,6,5])

How would you reach this part of the array, like arr[2:4,4,3,2,2:3], can you help me?

ChrisRackauckas · July 31, 2020, 10:23am

oh, that’s a ComponentArray or a MultiScaleArray

Marcell_Havlik · July 31, 2020, 11:56am

MultiScaleArray is something that I was searching for! Thank you for the answer!

For now I think I will go with my solution and later on test the speed. I really have to know how does this scale against my test code.

Topic		Replies	Views
JSON list of lists to Julia matrix, preferably fast and with low memory overhead Performance	5	2672	January 18, 2021
Converting nested vectors of Any New to Julia	1	143	April 3, 2023
JSON write and read back Arrays General Usage json	2	1386	July 18, 2020
Converting array of array of array to multidimensional array General Usage arrays	9	2826	June 26, 2022
How to quickly create array of array? General Usage question	12	1015	February 20, 2018

Efficient parsing of arrays of arrays of arrays or arbitrary number of arrays of arrays?

Related topics