How can I create a DataFrame with many columns programatically?

If I want to create a DataFrame with just 2 columns I can do:

DataFrame(a=rand(Normal(0, 1), 10), b=rand(Normal(0, 1), 10))

But what if I want to create a DataFrame with hundreds or thousands of columns with a given name?

For example with the following column names, just 48 for this example, (or whatever other container you prefer).

vec(map(IterTools.product(‘a’:‘d’, ‘a’:‘d’, string.(2001:2003))) do (x, y, z) xy’_’*z end)

And for this example we will also fill each column with

rand(Normal(0, 1), 10)

I will use it later to create an example for benchmarking purposes.

You can use .=> to broadcast the Pair operator:

julia> DataFrame(col_names .=> [randn(10) for _ in eachindex(col_names)])
10×48 DataFrame. Omitted printing of 42 columns
│ Row │ aa2001    │ ba2001    │ ca2001    │ da2001    │ ab2001     │ bb2001    │
│     │ Float64   │ Float64   │ Float64   │ Float64   │ Float64    │ Float64   │
├─────┼───────────┼───────────┼───────────┼───────────┼────────────┼───────────┤
│ 1   │ 0.187453  │ 0.922989  │ -1.8337   │ 1.10598   │ 0.2527     │ -1.26025  │
│ 2   │ -0.698976 │ 0.0275297 │ -0.779797 │ 0.325134  │ 0.184392   │ -0.541654 │
│ 3   │ -0.533002 │ 0.617138  │ -0.721395 │ -1.4459   │ -0.109285  │ 0.943458  │
│ 4   │ 2.59956   │ -0.24512  │ -0.556589 │ -1.33378  │ 2.12868    │ 0.856728  │
│ 5   │ 1.99969   │ 1.69739   │ -0.374264 │ 0.269507  │ -0.604224  │ 0.612185  │
│ 6   │ -0.136302 │ 0.922046  │ 1.21671   │ 1.17714   │ 0.90012    │ -0.58445  │
│ 7   │ 0.431472  │ 1.08326   │ -1.8062   │ -1.42047  │ 0.990874   │ -2.76279  │
│ 8   │ -0.602431 │ -0.300705 │ -0.184261 │ 0.613706  │ 0.232971   │ -0.548315 │
│ 9   │ -0.437662 │ -0.808732 │ -0.714415 │ 1.16602   │ 0.00941199 │ -0.352265 │
│ 10  │ 1.10546   │ 1.38544   │ -0.329173 │ -0.765127 │ 0.605886   │ 1.60454   │
4 Likes

There is a DataFrames constructor for this:

julia> using DataFrames, Random

julia> names = [randstring(5) for _ ∈ 1:10]
10-element Array{String,1}:
 "dzRQt"
 "kKXW7"
 "0JSL6"
 "Ns7VN"
 "uZyLz"
 "70f0T"
 "A3PrD"
 "Od9Lz"
 "Guazy"
 "pTw48"

julia> data = randn(10, 10)
10×10 Array{Float64,2}:
  0.309686   -1.1974    -0.0187716  -0.303907    1.32897   -0.277437   1.33409    1.88879    0.603044  -1.4253
 -0.338923   -1.03677    1.01156    -1.74512    -0.87579   -0.060289   0.643243  -1.37126    0.400429   0.689121
 -0.140837    0.193948  -0.411703   -0.260852    0.789106   0.842438   0.679892   0.834983  -1.18727   -0.178523
  0.0755439   1.50667   -0.0136337  -0.462559   -0.191108  -1.10486    2.57489   -0.682026   1.65719    1.08617
  0.403895    2.62865    0.257171    0.39861     1.11401    1.30457    0.767682   0.60543    0.449838   0.354192
  0.704756    1.01318   -1.47469     0.0364399   0.906231  -1.05733    0.169764  -0.142383  -1.41441   -0.861899
  0.833152    1.14731    1.2926     -0.913615    0.957537   1.25694    0.01692   -1.75855   -0.665406  -1.43099
  0.106316    0.833295  -0.269914   -0.867696    0.763117   0.651651   0.317162  -0.882739   0.139936   0.174196
  0.53614     0.346916  -0.541661   -1.94401     0.542825   0.882737   0.240241  -1.3405    -1.46032   -0.883309
 -0.315214   -1.39484   -1.02137     1.91367     0.965089   1.52959   -1.46762    0.435068   1.80926   -0.502492

julia> DataFrame(data, names)
10×10 DataFrame
│ Row │ dzRQt     │ kKXW7    │ 0JSL6      │ Ns7VN     │ uZyLz     │ 70f0T     │ A3PrD    │ Od9Lz     │ Guazy     │ pTw48     │
│     │ Float64   │ Float64  │ Float64    │ Float64   │ Float64   │ Float64   │ Float64  │ Float64   │ Float64   │ Float64   │
├─────┼───────────┼──────────┼────────────┼───────────┼───────────┼───────────┼──────────┼───────────┼───────────┼───────────┤
│ 1   │ 0.309686  │ -1.1974  │ -0.0187716 │ -0.303907 │ 1.32897   │ -0.277437 │ 1.33409  │ 1.88879   │ 0.603044  │ -1.4253   │
│ 2   │ -0.338923 │ -1.03677 │ 1.01156    │ -1.74512  │ -0.87579  │ -0.060289 │ 0.643243 │ -1.37126  │ 0.400429  │ 0.689121  │
│ 3   │ -0.140837 │ 0.193948 │ -0.411703  │ -0.260852 │ 0.789106  │ 0.842438  │ 0.679892 │ 0.834983  │ -1.18727  │ -0.178523 │
│ 4   │ 0.0755439 │ 1.50667  │ -0.0136337 │ -0.462559 │ -0.191108 │ -1.10486  │ 2.57489  │ -0.682026 │ 1.65719   │ 1.08617   │
│ 5   │ 0.403895  │ 2.62865  │ 0.257171   │ 0.39861   │ 1.11401   │ 1.30457   │ 0.767682 │ 0.60543   │ 0.449838  │ 0.354192  │
│ 6   │ 0.704756  │ 1.01318  │ -1.47469   │ 0.0364399 │ 0.906231  │ -1.05733  │ 0.169764 │ -0.142383 │ -1.41441  │ -0.861899 │
│ 7   │ 0.833152  │ 1.14731  │ 1.2926     │ -0.913615 │ 0.957537  │ 1.25694   │ 0.01692  │ -1.75855  │ -0.665406 │ -1.43099  │
│ 8   │ 0.106316  │ 0.833295 │ -0.269914  │ -0.867696 │ 0.763117  │ 0.651651  │ 0.317162 │ -0.882739 │ 0.139936  │ 0.174196  │
│ 9   │ 0.53614   │ 0.346916 │ -0.541661  │ -1.94401  │ 0.542825  │ 0.882737  │ 0.240241 │ -1.3405   │ -1.46032  │ -0.883309 │
│ 10  │ -0.315214 │ -1.39484 │ -1.02137   │ 1.91367   │ 0.965089  │ 1.52959   │ -1.46762 │ 0.435068  │ 1.80926   │ -0.502492 │
4 Likes

What if I want to initialize each column with something different?
What format should the “data” in your example have? A a dictionary of vectors, a tuple, a vector or what?

in data above, it is a matrix. and names is a vector of Strings or a vector of Symbols

You can also just add the names in a loop. This sill be fast

names = map(Iterators.product('a':'d', 'a':'d', string.(2001:2003))) do (x, y, z) 
       x*'_' * y * '_' *z 
       end |> vec

df = DataFrame()
for n in names
    df[!, n] = rand(Normal(0, 1), 10)
end
1 Like