Hello,
I would like to know if there is better way to stack multidimensional array like vcat
Below is the code which i am working with.
data1 = rand(10000,3072);
data2 = rand(10000,3072);
data3 = rand(10000,3072);
data4 = rand(10000,3072);
data5 = rand(10000,3072);
function stack_Array_1()
Xaxis = [];
for i=1:5
push!(Xaxis,eval(Symbol("data$i")))
end
Xaxis
end
function stack_Array_2()
Yaxis = Array{Float64,2}(0,3072)
for i=1:5
Yaxis = vcat(Yaxis,eval(Symbol("data$i")))
end
Yaxis
end
@btime stack_Array_1();
2.319 μs (33 allocations: 1.61 KiB)
Any[5]
10000×3072 Array{Float64,2}:
10000×3072 Array{Float64,2}:
10000×3072 Array{Float64,2}:
10000×3072 Array{Float64,2}:
10000×3072 Array{Float64,2}:
@btime stack_Array_2();
1.973 s (75 allocations: 3.43 GiB)
50000×3072 Array{Float64,2}:
0.635014 0.462685 0.185559 0.295888 … 0.123956 0.971991 0.559269
0.409272 0.57993 0.820811 0.993251 0.12653 0.553527 0.577177
0.660994 0.114057 0.702278 0.119705 0.354153 0.681063 0.057957
0.11004 0.729124 0.25563 0.717678 … 0.554839 0.800087 0.779025
0.854646 0.217248 0.834483 0.49127 0.245325 0.748648 0.725246
0.733918 0.799065 0.349517 0.917985 0.619041 0.0812406 0.321144
⋮ ⋱ ⋮
0.397121 0.578101 0.832732 0.508987 0.85815 0.61081 0.735447
0.355548 0.771022 0.584872 0.0710232 0.901111 0.567234 0.735604
0.281855 0.117889 0.164787 0.719332 0.359149 0.668798 0.570658
0.0783613 0.947521 0.327537 0.722403 … 0.152016 0.173811 0.346503
0.905858 0.611356 0.158429 0.0897009 0.788216 0.790752 0.968152
0.293524 0.558019 0.123042 0.221605 0.325241 0.666398 0.310829
I would like the Arrays to be vertically concatenated which is done by function stack_Array_2()
but the performance is poor. stack_Array_1()
does perform good but iam not sure if there is a way to stack the arrays. Kindly let me know if there is any way to get the desired result.
Thank You.
dataarrays = [rand(10000, 3072) for _ in 1:5]
vcat(dataarrays...)
Don’t use eval
here. Don’t vcat
recursively, use the ...
operator.
2 Likes
@Tamas_Papp
dataarrays = [rand(10000, 3072) for _ in 1:5]
I am working on CIFAR10 python pickle dataset. Which has 5 batch (type Array{UInt8,2}(10000,3072) each) files with 10000 records in each. Hence had use 5 different rand
arrays as an example. But Thank you for the above list comprehension method it will be useful for testing.
Don’t use eval here
Kindly elaborate. since i had to loop over 5 different datasets, i thought that was my option to loop over all of it. I will be glad to learn if thats the wrong way to go about it.
use the … operator.
Thank you for the splat operator. Performance has improved a bit for the below code Please let me know if i am doing anything wrong.
@btime final_data1 = stack_Array_1();
2.353 μs (33 allocations: 1.61 KiB)
@btime final_data2 = stack_Array_2();
2.274 s (75 allocations: 3.43 GiB)
@btime final_data3 = stack_Array_3();
668.435 ms (49 allocations: 1.14 GiB)
function stack_Array_1()
Xaxis = [];
for i=1:5
push!(Xaxis,eval(Symbol("data$i")))
end
Xaxis
end
function stack_Array_2()
Yaxis = Array{Float64,2}(0,3072)
for i=1:5
Yaxis = vcat(Yaxis,eval(Symbol("data$i")))
end
Yaxis
end
function stack_Array_3()
Xaxis = [];
for i=1:5
push!(Xaxis,eval(Symbol("data$i")))
end
vcat(Xaxis...)
end
I told you above, yet you are posting the same code (with eval
and recursive vcat
).
eval
is not necessary here. Generally, you should not touch eval
unless for generated code.
I understand that rand
is for the MWE, but just read whatever data structure you have into a vector of arrays, if that’s the most convenient.
@Tamas_Papp
you should not touch eval unless for generated code.
Understood. I will keep this in mind. I didn’t know any other way to iterate to variables in for loop hence i had to use eval
Below is my code which i am working on. And trying to fix the performance issue with Stacking the arrays.
using PyCall
@pyimport pickle
function load_pickle_data(ROOT)
xs=[]
ys=[]
for b=1:5
f=joinpath(ROOT, "data_batch_$b")
X,Y = pickle_batch(f)
push!(xs,X)
push!(ys,Y)
end
(vcat(xs...),ys)
end
function pickle_batch(file)
fo=open(file,"r")
datadict = pickle.loads(pybytes(read(fo)))
X=datadict["data"]
Y=datadict["labels"]
(X,Y)
end
Don’t iteratively create variables in a loop in normal code. Use some other data structure (an array, a dictionary, etcetera).
1 Like