Efficient parsing of arrays of arrays of arrays or arbitrary number of arrays of arrays?

Hey Julianers,
I parsed a JSON file with JSON2.jl and now I have to postprocess the results it into an Array{Array{Float32, N},1}. The problem is that, the JSON parsed struct is an arrays of arrays of arrays of … and so on… like: Array{Array{Array{Float64,1},1},1} or Array{Array{Array{Array{Float64,1},1},1},1} (and it can be arbitrary dimensions in extreme cases)

I know the concrete type is array of tensors under the hood.

For me the fastest version was for Array of 2D tensors:
@time convert(Array{Array{Float32,2},1},[hcat(convert(Array{Array{Float32,1},1},arr)::Array{Array{Float32,1},1}...)' for arr in data])::Array{Array{Float32,2},1}

Can someone help me to achieve better performance?

My test code is:

@time data = [[[1::Any for i = 1:10] for j=1:10000] for k=1:100]
@time convert(Array{Array{Float32,2},1},[hcat(convert(Array{Array{Float32,1},1},arr)::Array{Array{Float32,1},1}...)' for arr in data])::Array{Array{Float32,2},1}

My results:

0.157031 seconds (1.24 M allocations: 172.817 MiB)
0.154512 seconds (1.06 M allocations: 231.872 MiB)

I see it is about 4MiB of memory. So reaching 50x bigger memory consumption size sounds like there is something wrong with my code.

How can I reach better performance?
Thanks in advance!

Possibly useful here:

or

1 Like

I think I have an idea, how to cover arbitrary dimensions of tensor! Can some expert check it?

From Flux:

# code from Flux
unsqueeze(xs, dim) = reshape(xs, (size(xs)[1:dim-1]..., 1, size(xs)[dim:end]...))
stack(xs, dim) = cat(unsqueeze.(xs, dim)..., dims=dim)

# custom codes
get_dims(arr) = (length(size(arr[1]))>0 ? (size(arr)..., get_dims(arr[1])...) : size(arr))

# gen data arriving from JSON.jl or JSON2.jl
arbitrary_1d_array_from_file(arr_size) = size(arr_size,1)>1 ? [arbitrary_1d_array_from_file(arr_size[2:end]) for _ in 1:arr_size[1]] : [1:arr_size[1];]

# helper functions
recursive_array(::Val{N}) where {N} = N>1 ? Array{recursive_array(Val(N-1)),1} : Array{Float32, 1} 
conv(::Val{N}, arr) where {N} = (type = recursive_array(Val(N)); convert(type,arr)::type) 
conv_reshaped(::Val{N}, arr) where {N} = convert(Array{Float32,N},arr)::Array{Float32,N}
stack_all(::Val{N},arr) where {N} = N>1 ? stack(stack_all(Val(N-1),arr),1) : stack(arr,1)

# The tests
d=arbitrary_1d_array_from_file([9,8,7,6,5])
dims= get_dims(d)
dims_size = length(dims)
println("Start arbitrary arrays $(typeof(d)) $dims")
res = d
res = conv(Val(dims_size),res)
res = stack_all(Val(dims_size-1),res)
res = reshape(res, dims)
res = conv_reshaped(Val(dims_size),res)
res_dims= get_dims(res)
println("Final array $(typeof(res)) $res_dims")

Result:

Start arbitrary arrays Array{Array{Array{Array{Array{Int64,1},1},1},1},1} (9, 8, 7, 6, 5)
Final array Array{Float32,5} (9, 8, 7, 6, 5)

I have problem with the speeds but… now at least it covers arbitrary dimensions of tensors.
I think this can be useful for a lot of julianers as data parsing will be more and more important.

I hope my code can help someone and he can also help to improve this mechanism.

I checked the RecursiveArrayTools, but it was for 2D things. Or I just couldn’t understand how to use it for 3,4,5D tensors.

Thank you for the feedback I will check the other thing out tomorrow!

It’s for ND. You can put any dimensional array, it’s just Vector of Array (i.e. vector on the outside)

I could only use it for Array{Array{T,N},1} case. How do you use for an array like mentioned before? (It lacks documentation and didn’t know how to figure it out.)

recursive_array(::Val{N}) where {N} = N>1 ? Array{recursive_array(Val(N-1)),1} : Array{Float32, 1} 
arr=recursive_array([9,8,7,6,5])

How would you reach this part of the array, like arr[2:4,4,3,2,2:3], can you help me?

oh, that’s a ComponentArray or a MultiScaleArray

MultiScaleArray is something that I was searching for! Thank you for the answer!

For now I think I will go with my solution and later on test the speed. I really have to know how does this scale against my test code.