Dynamically generate DataFrame in function with number of columns determined by length of array arguments

"""
I am trying to generate a function that creates a DataFrame dynamically with number of columns determined by length of array arguments (n) to the function. 
But I failed to generate the DataFrame dynamically because I don't understand how to use a `for loop` to generate DataFrame columns.

Noticed that the dataframe is defined in lines 19 through 30 and lines 25 through 28 are the old code that did not attempt dynamic generation of the DataFrame columns. 
The attempted for loop in lines 21 through 23 does not work and something needs to replace it.
"""

{
function testf(x, y, z) # inputs are arrays measured over time periods of a year; by definition, x - y = z, or y = x - z
n = length(x)
td = today()
byr = year(td) - t # calcs the current base year

if sum(y) == 0 && mode(y) == 0 && length(z) == n # inner if for case where y is unkown and will be derived from x and z
  	y = y .+ (x .- z)

	zx = zeros(n); # placeholder z to x ratio array 
	zy = zeros(n); # placeholder z to y ratio array
	for i in 1:n
  		zx[i] = z[i]/x[i]; # filling the zx ratio array through nth element
  		zy[i] = z[i]/y[i]; # filling the zy ratio array through nth element
	end # for i loop

	ave_zx = mean(zx); ave_zy = mean(zy); 
	ave_zx_rnd = round(ave_zx,digits=6); ave_zy_rnd = round(ave_zy,digits=6);

	df1 = DataFrame([:item => ["x", "y", "z", repeat("-", 12), "z/x", "z/y", "Ave(z)/Ave(x)", "Ave(z)/Ave(y)"], # begin dynamic DF generation based on n = length of data arrays
#
		for j in 1:n
			Symbol(byr-(j-1)) => [x[j], y[j], z[j], repeat("-", 12), round(zx[j],digits=6), round(zy[j],digits=6), "n/a", "n/a"],
		end # for loop
#
#  		Symbol(byr-0) => [x[1], y[1], z[1], repeat("-", 12), round(zx1,digits=6), round(zy1,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea 
#  		Symbol(byr-1) => [x[2], y[2], z[2], repeat("-", 12), round(zx2,digits=6), round(zy2,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea 
#  		Symbol(byr-2) => [x[3], y[3], z[3], repeat("-", 12), round(zx3,digits=6), round(zy3,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea
#  		Symbol("Ave($(byr-(2))-$(byr))") => [ave_x_rnd, ave_y_rnd, ave_z_rnd, repeat("-", 12), ave_zx_rnd, ave_zy_rnd, ave_z_to_ave_x_rnd, ave_z_to_ave_y_rnd]   # this is previous code used to fill df1 before for loop idea
	Symbol("Ave($(byr-(j-1))-$(byr))") => [ave_x_rnd, ave_y_rnd, ave_z_rnd, repeat("-", 12), ave_zx_rnd, ave_zy_rnd, ave_z_to_ave_x_rnd, ave_z_to_ave_y_rnd]
	]) # end dynamic DF generation based on n = length of data arrays

println()
show(df1) # display the dynamic DataFrame generated by the for loop
println()

else
	println("The function testf() has errors and cannot execute.")	
end # end if sum(y) == 0...

end # end function testf()
}



"""
I'm trying to generate output like this for the case of n=3.
"""

β”‚ Row β”‚ item β”‚ 2020 β”‚ 2019 β”‚ 2018 β”‚ Ave(2018-2020) β”‚
β”‚ β”‚ Any β”‚ Any β”‚ Any β”‚ Any β”‚ Any β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ x β”‚ 31323718 β”‚ 14342460 β”‚ 10353288 β”‚ 1.86732e7 β”‚
β”‚ 2 β”‚ y β”‚ 30609098 β”‚ 14188281 β”‚ 10066719 β”‚ 1.8288e7 β”‚
β”‚ 3 β”‚ z β”‚ 714620.0 β”‚ 154179.0 β”‚ 286569.0 β”‚ 3.85123e5 β”‚
β”‚ 4 β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚
β”‚ 5 β”‚ z/x β”‚ 0.022814 β”‚ 0.01075 β”‚ 0.027679 β”‚ 0.020414 β”‚
β”‚ 6 β”‚ z/y β”‚ 0.023347 β”‚ 0.010867 β”‚ 0.028467 β”‚ 0.020893 β”‚
β”‚ 7 β”‚ Ave(z)/Ave(x) β”‚ n/a β”‚ n/a β”‚ n/a β”‚ 0.020624 β”‚
β”‚ 8 β”‚ Ave(z)/Ave(y) β”‚ n/a β”‚ n/a β”‚ n/a β”‚ 0.021059 β”‚

![image|690x206](upload://hZVoNOa4yWWfAUe4DcUkkPcZHRk.png)

sorry, formatting is not ideal here:

function testf(x, y, z) # inputs are arrays measured over time periods of a year; by definition, x - y = z, or y = x - z
n = length(x)
td = today()
byr = year(td) - t # calcs the current base year

if sum(y) == 0 && mode(y) == 0 && length(z) == n # inner if for case where y is unkown and will be derived from x and z
  	y = y .+ (x .- z)

	zx = zeros(n); # placeholder z to x ratio array 
	zy = zeros(n); # placeholder z to y ratio array
	for i in 1:n
  		zx[i] = z[i]/x[i]; # filling the zx ratio array through nth element
  		zy[i] = z[i]/y[i]; # filling the zy ratio array through nth element
	end # for i loop

	ave_zx = mean(zx); ave_zy = mean(zy); 
	ave_zx_rnd = round(ave_zx,digits=6); ave_zy_rnd = round(ave_zy,digits=6);

	df1 = DataFrame([:item => ["x", "y", "z", repeat("-", 12), "z/x", "z/y", "Ave(z)/Ave(x)", "Ave(z)/Ave(y)"], # begin dynamic DF generation based on n = length of data arrays
#
		for j in 1:n
			Symbol(byr-(j-1)) => [x[j], y[j], z[j], repeat("-", 12), round(zx[j],digits=6), round(zy[j],digits=6), "n/a", "n/a"],
		end # for loop
#
#  		Symbol(byr-0) => [x[1], y[1], z[1], repeat("-", 12), round(zx1,digits=6), round(zy1,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea 
#  		Symbol(byr-1) => [x[2], y[2], z[2], repeat("-", 12), round(zx2,digits=6), round(zy2,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea 
#  		Symbol(byr-2) => [x[3], y[3], z[3], repeat("-", 12), round(zx3,digits=6), round(zy3,digits=6), "n/a", "n/a"],   # this is previous code used to fill df1 before for loop idea
#  		Symbol("Ave($(byr-(2))-$(byr))") => [ave_x_rnd, ave_y_rnd, ave_z_rnd, repeat("-", 12), ave_zx_rnd, ave_zy_rnd, ave_z_to_ave_x_rnd, ave_z_to_ave_y_rnd]   # this is previous code used to fill df1 before for loop idea
	Symbol("Ave($(byr-(j-1))-$(byr))") => [ave_x_rnd, ave_y_rnd, ave_z_rnd, repeat("-", 12), ave_zx_rnd, ave_zy_rnd, ave_z_to_ave_x_rnd, ave_z_to_ave_y_rnd]
	]) # end dynamic DF generation based on n = length of data arrays

println()
show(df1) # display the dynamic DataFrame generated by the for loop
println()

else
	println("The function testf() has errors and cannot execute.")	
end # end if sum(y) == 0...

end # end function testf()

Lots of code there, but you can just do

df = DataFrame()

for x in my_suff
    df[!, x] = ...
end
2 Likes

This is all pretty convoluted but if I read correctly it seems to boil down to the fact that you are putting a loop inside the DataFrames constructor, which doesn’t work.

The normal way is to do what Peter says above. Alternatively, you could replace your loop with a comprehension, so instead of

DataFrame(x => some_vector, 
          for i in 1:10
              Symbol(col_$i) => f(i)
          end)

you could have

DataFrame(x => some_vector, [Symbol(col_$i) => f(i) for i in 1:10]...)

update as of 12pm PDT Wed 6/23/21

here is a fixed update based on recommendations received this morning, stripping away complexity to address the posting question:

using DataFrames

julia> x = [1_000_000, 1_200_000, 1_500_000]; y = [800_000, 900_000, 1_100_000]; z = [200_000, 300_000, 400_000];

julia> n = length(x)
3

julia> df = DataFrame([Symbol(β€œcol_$j”) => [x[j], y[j], z[j]] for j in 1:n]…)
3Γ—3 DataFrame
β”‚ Row β”‚ col_1 β”‚ col_2 β”‚ col_3 β”‚
β”‚ β”‚ Int64 β”‚ Int64 β”‚ Int64 β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ 1000000 β”‚ 1200000 β”‚ 1500000 β”‚
β”‚ 2 β”‚ 800000 β”‚ 900000 β”‚ 1100000 β”‚
β”‚ 3 β”‚ 200000 β”‚ 300000 β”‚ 400000 β”‚

here is a more detailed version of the fix applied to my original posting question:

using Statistics, StatsBase, DataFrames

function testf(x, y, z; t=0) # inputs are arrays measured over time periods of a year; by definition, x - y = z, or y = x - z
n = length(x)
td = today()
yr = year(td) - t # calcs the current base year

if sum(y) == 0 && mode(y) == 0 && length(z) == n # inner if for case where y is unkown and will be derived from x and z
  	y = y .+ (x .- z)

	zx = zeros(n); # placeholder z to x ratio array 
	zy = zeros(n); # placeholder z to y ratio array
	for i in 1:n
  		zx[i] = z[i]/x[i]; # filling the zx ratio array through nth element
  		zy[i] = z[i]/y[i]; # filling the zy ratio array through nth element
	end # for i loop

	ave_zx = mean(zx); ave_zy = mean(zy); 
	ave_zx_rnd = round(ave_zx,digits=6); ave_zy_rnd = round(ave_zy,digits=6);

df1 = DataFrame(
:item => [β€œx”, β€œy”, β€œz”, repeat(β€œ-”, 12), β€œz/x”, β€œz/y”],
[Symbol(β€œ$(yr-(j-1))”) => [x[j], y[j], z[j], repeat(β€œ-”, 12), round(zx[j],digits=6), round(zy[j],digits=6)] for j in 1:n]…)

println()
show(df1) # display the dynamic DataFrame generated by the splat ...
println()

else
	println("The function testf() has errors and cannot execute.")	
end # end if sum(y) == 0...

end # end function testf()

julia> testf([1_000_000, 1_200_000, 1_500_000, 1_600_000], zeros(4), [200_000, 300_000, 400_000, 500_000])

6Γ—5 DataFrame
β”‚ Row β”‚ item β”‚ 2021 β”‚ 2020 β”‚ 2019 β”‚ 2018 β”‚
β”‚ β”‚ String β”‚ Any β”‚ Any β”‚ Any β”‚ Any β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1 β”‚ x β”‚ 1000000 β”‚ 1200000 β”‚ 1500000 β”‚ 1600000 β”‚
β”‚ 2 β”‚ y β”‚ 800000.0 β”‚ 900000.0 β”‚ 1.1e6 β”‚ 1.1e6 β”‚
β”‚ 3 β”‚ z β”‚ 200000 β”‚ 300000 β”‚ 400000 β”‚ 500000 β”‚
β”‚ 4 β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚
β”‚ 5 β”‚ z/x β”‚ 0.2 β”‚ 0.25 β”‚ 0.266667 β”‚ 0.3125 β”‚
β”‚ 6 β”‚ z/y β”‚ 0.25 β”‚ 0.333333 β”‚ 0.363636 β”‚ 0.454545 β”‚

I’m confused. Did you attempt the solution I presented earlier in the thread? Did it nt fit your use-case for some reason?

Additionally, please make sure you are blocking your code correctly.

Your breaks --- in the data frame are a bit unusual. See PrettyTables.jl for a better solution to break up your data frame into parts while printing.

Yes, indirectly through nilshg’s comments.

The essential part I learned from both you and nilshg is:

df = DataFrame([Symbol(β€œcol_$j”) => [x[j], y[j], z[j]] for j in 1:n]…)

This is how it looks in my function that is context specific to my application (with PrettyTables.jl):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ item          β”‚         2020 β”‚         2019 β”‚         2018 β”‚ Ave(2018-2020) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ R             β”‚   31,323,718 β”‚   14,342,460 β”‚   10,353,288 β”‚     18,673,155 β”‚
β”‚ C             β”‚   30,609,098 β”‚   14,188,281 β”‚   10,066,719 β”‚     18,288,033 β”‚
β”‚ G             β”‚      714,620 β”‚      154,179 β”‚      286,569 β”‚        385,123 β”‚
β”‚ ------------  β”‚ ------------ β”‚ ------------ β”‚ ------------ β”‚   ------------ β”‚
β”‚ G/R           β”‚     0.022814 β”‚      0.01075 β”‚     0.027679 β”‚       0.020414 β”‚
β”‚ G/C           β”‚     0.023347 β”‚     0.010867 β”‚     0.028467 β”‚       0.020893 β”‚
β”‚ Ave(G)/Ave(R) β”‚          n/a β”‚          n/a β”‚          n/a β”‚       0.020624 β”‚
β”‚ Ave(G)/Ave(C) β”‚          n/a β”‚          n/a β”‚          n/a β”‚       0.021059 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What I mean to say is that having --- as a value in your DataFrame column is very awkward.

  1. It means your column are of type Vector{Any} which is bad for performance
  2. It means your vectors have an awkward string "----" in the middle.

PrettyTables allows you to have your data frame not include the "---" but display it with horizontal lines for legibility.