Excel data to dataframes

I have some data in different columns of an Excel sheet. I want to read the different columns into separate dataframes. I know the the column name/number where the data for each dataframe starts but not number of rows of data they contain. One example of the type of data is shown:
Excel_example
I want two data frames in this case. The first one containing SECTOR as a header and containing all data in column A. The second one containing INDIVIDUAL as header with all the data in column C.

How can I do this using XLSX or otherwise?

If it’s something you only need to do once, you can use ClipData.jl to copy and paste into a DataFrame easily.

If it’s something you need to do programmatically, the solution is XLSX.jl, but I don’t know much about how to work with that package. Hopefully someone else can chime in.

Thanks. I need to do it programmatically.

Could you check the following:

using XLSX, DataFrames

xf = XLSX.readxlsx(filename)
m = xf[1][:]

df = DataFrame(m[2:end,:],:auto)
rename!(df, Symbol.(m[1,:]))

df_list = []
for (h,c) in pairs(eachcol(df))
    if all(ismissing.(c))
        select!(df, Not(h))
    else
        dh = DataFrame(; h => c);
        dh[!,h] = convert.(eltype(dh[!,1]), df[:,h])
        dropmissing!(dh, h)
        push!(df_list, dh)
    end
end

It creates one single dataframe df and pushes one dataframe per non-empty column into a vector of dataframes:

Output:
julia> df
4Γ—2 DataFrame
 Row β”‚ SECTOR   INDIVIDUAL 
     β”‚ Any      Any        
─────┼─────────────────────
   1 β”‚ IT       ONE
   2 β”‚ FINANCE  TWO
   3 β”‚ missing  THREE
   4 β”‚ missing  FOUR

julia> df_list
2-element Vector{Any}:
 2Γ—1 DataFrame
 Row β”‚ SECTOR  
     β”‚ String  
─────┼─────────
   1 β”‚ IT
   2 β”‚ FINANCE
 4Γ—1 DataFrame
 Row β”‚ INDIVIDUAL 
     β”‚ String     
─────┼────────────
   1 β”‚ ONE
   2 β”‚ TWO
   3 β”‚ THREE
   4 β”‚ FOUR
1 Like

Beter to do

df = DataFrame(m[2:end,:], :auto)

What are you expecting? It works as I expected.

julia> df_list = [DataFrame(rand(2,2), :auto) for i in 1:2]
2-element Vector{DataFrame}:
 2Γ—2 DataFrame
 Row β”‚ x1         x2        
     β”‚ Float64    Float64   
─────┼──────────────────────
   1 β”‚ 0.504304   0.338425
   2 β”‚ 0.0633497  0.0394208
 2Γ—2 DataFrame
 Row β”‚ x1        x2         
     β”‚ Float64   Float64    
─────┼──────────────────────
   1 β”‚ 0.899258  0.69782
   2 β”‚ 0.899377  0.00734581

julia> push!(df_list, DataFrame(h = [1, 2, 3]))
3-element Vector{DataFrame}:
 2Γ—2 DataFrame
 Row β”‚ x1         x2        
     β”‚ Float64    Float64   
─────┼──────────────────────
   1 β”‚ 0.504304   0.338425
   2 β”‚ 0.0633497  0.0394208
 2Γ—2 DataFrame
 Row β”‚ x1        x2         
     β”‚ Float64   Float64    
─────┼──────────────────────
   1 β”‚ 0.899258  0.69782
   2 β”‚ 0.899377  0.00734581
 3Γ—1 DataFrame
 Row β”‚ h     
     β”‚ Int64 
─────┼───────
   1 β”‚     1
   2 β”‚     2
   3 β”‚     3

I see. No, you want DataFrame(; h => c). The Pair syntax lets you work with names programmatically.

1 Like

Thank you for the code. Few questions:

  1. Why do the column type is shown as Any when it is string?
  2. This code results in same number of rows in all the dataframes, so it add data with missing if number of rows in one dataframe is lower than others. The data I have has different number of rows in each column. Is it possible to create this vector of dataframe with different number of rows, ie. if missing value can be excluded?

FWIW, I don’t think your answer is particularly compelling.

There are better functions in XLSX to work with this. @mjanun I will try and make an MWE with a solution I think is more elegant soon.

Thank you for looking in to it. I will wait for your example.

@mjanun, see code edited above to meet your requirement.

NB: supposedly code β€œnot particularly compelling nor elegant”

1 Like

Here is something I think might be a bit more robust

julia> using XLSX, DataFrames;

julia> function num_trailing_missing(x)
           n = length(x)
           s = 0
           while true
               n == 0 && break
               !ismissing(x[n]) && break
               s += 1
               n -= 1
           end
           s
       end;

julia> mat = XLSX.readxlsx("testdata.xlsx")[1][:];

julia> inds = [1:1, 3:3]; # You know the columns but not rows

julia> dfs = map(inds) do is
           data = mat[2:end, is]
           nms = mat[1, is]
           df = DataFrame(data, string.(nms))
           min_num_trailing_missings = minimum(num_trailing_missing.(eachcol(df)))
           df = df[1:(end - min_num_trailing_missings), :]
           # narrow the types
           transform(df, names(df) .=> ByRow(identity); renamecols = false)
       end
2-element Vector{DataFrame}:
 2Γ—1 DataFrame
 Row β”‚ SECTOR  
     β”‚ String  
─────┼─────────
   1 β”‚ IT
   2 β”‚ FINANCE
 4Γ—1 DataFrame
 Row β”‚ INDIVIDUAL 
     β”‚ Int64      
─────┼────────────
   1 β”‚          1
   2 β”‚          2
   3 β”‚          3
   4 β”‚          4

Overall this was harder than I thought. I don’t think it’s too different from @rafael.guerra 's answer, actually. However

  1. I take advantage of the fact that you know the starting and ending indices
  2. I narrow the types of the output so they are no longer Any
  3. I drop trailing missing rather than all missing values in the data frame.
1 Like

Thank you. One last question, I see that mat = XLSX.readxlsx("testdata.xlsx")[1][:] refers to the first sheet in the spreadsheet. How can I specify the sheet name instead e.g. if I want to refer to the sheet called Data?

Take a look at the docs with ? readxlsx. You just replace 1 with the name of the sheet, as a String.

1 Like

Are there in your response?

Thanks

No, when I wrote that I thought readtable could allow for subsets of columns, but I guess it only takes in the full sheet.