DataFrame from array of arrays

array
dataframes

#1

Hello, all. I’m trying to create a DataFrame from an array of arrays that is returned from an API, and I’m having a heck of a time getting this coded! The goal is to not have to hard-code the column names, as these may change depending on what parameters are used when calling the API.

The array that I am trying to convert to a DataFrame looks like this:

78-element Array{Any,1}:
 Any["Emp", "year", "quarter", "sex", "agegrp", "ownercode", "firmsize", "seasonadj", "industry", "state", "county"]
 Any["3410", "2017", "3", "0", "A00", "A05", "0", "U", "00", "40", "001"]                                           
 Any["915", "2017", "3", "0", "A00", "A05", "0", "U", "00", "40", "003"]                                                                                  
 ⋮                                                                                                                  
 Any["23884", "2017", "3", "0", "A00", "A05", "0", "U", "00", "40", "131"]                                          
 Any["5099", "2017", "3", "0", "A00", "A05", "0", "U", "00", "40", "133"]                                           

If I hard-code the column names, I can create the DataFrame like this:

df = DataFrame(Emp=String[], year=String[], quarter=String[], sex=String[], agegrp=String[], ownercode=String[], firmsize=String[], seasonadj=String[], industry=String[], state=String[], county=String[])
for i = 2 : length(employment_data)
    push!(df.Emp, employment_data[i][1])
    push!(df.year, employment_data[i][2])
    push!(df.quarter, employment_data[i][3])
    push!(df.sex, employment_data[i][4])
    push!(df.agegrp, employment_data[i][5])
    push!(df.ownercode, employment_data[i][6])
    push!(df.firmsize, employment_data[i][7])
    push!(df.seasonadj, employment_data[i][8])
    push!(df.industry, employment_data[i][9])
    push!(df.state, employment_data[i][10])
    push!(df.county, employment_data[i][11])
end

Aside from being ridiculously verbose, the column names here can’t be changed. It seems like there should be an easy way to loop through the first array (as it contains the column names), create a DataFrame with these column names, and then push the rest of the arrays to the DataFrame, but I cannot achieve a working solution.


#2

I would probably use a generator for this. Say x is the name of your array.

namelist = Symbol.(x[1])
df  = DataFrame()

for (i, name) in enumerate(namelist)
    df[name] =  [x[j][i] for j in 2:length(x)]
end

#3

There also is a constructor DataFrame(columns, names), so for example:

julia> columns = Any[rand(10), rand(10)];

julia> DataFrame(columns, [:col1, :col2])
10×2 DataFrame
│ Row │ col1     │ col2     │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.620302 │ 0.763272 │
│ 2   │ 0.591029 │ 0.335824 │
│ 3   │ 0.684387 │ 0.24118  │
│ 4   │ 0.282933 │ 0.542262 │
│ 5   │ 0.942279 │ 0.185193 │
│ 6   │ 0.35253  │ 0.500711 │
│ 7   │ 0.74824  │ 0.49447  │
│ 8   │ 0.102255 │ 0.660015 │
│ 9   │ 0.485545 │ 0.897344 │
│ 10  │ 0.12191  │ 0.43754  │


EDIT: nevermind, I thought you had the columns rather than the rows so the above makes little sense.


#4

Each array is a row here, though. So OP would need to do some reshaping to get it to work.


#5

This is brilliant, thank you. For future readers who may also be new to Julia (like me), I’ve added some comments to your code which I think explain what is happening (please correct if I’m wrong!):

#= convert each item in array x[1] to a Symbol by broadcasting Symbol() across the array
with dot syntax =#
namelist = Symbol.(x[1])

# construct an empty DataFrame
df  = DataFrame()

#= loop through the namelist array, create a column in the DataFrame entitled namelist[i]
and assign its values by using an array comprehension to build an array with the
appropriate values, starting at the second array in array x=# 
for (i, name) in enumerate(namelist)
    df[name] =  [x[j][i] for j in 2:length(x)]
end