Excel data to dataframes

mjanun · July 25, 2021, 1:02pm

I have some data in different columns of an Excel sheet. I want to read the different columns into separate dataframes. I know the the column name/number where the data for each dataframe starts but not number of rows of data they contain. One example of the type of data is shown:
Excel_example
I want two data frames in this case. The first one containing SECTOR as a header and containing all data in column A. The second one containing INDIVIDUAL as header with all the data in column C.

How can I do this using XLSX or otherwise?

pdeffebach · July 25, 2021, 1:12pm

If it’s something you only need to do once, you can use ClipData.jl to copy and paste into a DataFrame easily.

If it’s something you need to do programmatically, the solution is XLSX.jl, but I don’t know much about how to work with that package. Hopefully someone else can chime in.

mjanun · July 25, 2021, 1:45pm

Thanks. I need to do it programmatically.

rafael.guerra · July 25, 2021, 3:51pm

Could you check the following:

using XLSX, DataFrames

xf = XLSX.readxlsx(filename)
m = xf[1][:]

df = DataFrame(m[2:end,:],:auto)
rename!(df, Symbol.(m[1,:]))

df_list = []
for (h,c) in pairs(eachcol(df))
    if all(ismissing.(c))
        select!(df, Not(h))
    else
        dh = DataFrame(; h => c);
        dh[!,h] = convert.(eltype(dh[!,1]), df[:,h])
        dropmissing!(dh, h)
        push!(df_list, dh)
    end
end

It creates one single dataframe df and pushes one dataframe per non-empty column into a vector of dataframes:

Output:

julia> df
4×2 DataFrame
 Row │ SECTOR   INDIVIDUAL 
     │ Any      Any        
─────┼─────────────────────
   1 │ IT       ONE
   2 │ FINANCE  TWO
   3 │ missing  THREE
   4 │ missing  FOUR

julia> df_list
2-element Vector{Any}:
 2×1 DataFrame
 Row │ SECTOR  
     │ String  
─────┼─────────
   1 │ IT
   2 │ FINANCE
 4×1 DataFrame
 Row │ INDIVIDUAL 
     │ String     
─────┼────────────
   1 │ ONE
   2 │ TWO
   3 │ THREE
   4 │ FOUR

pdeffebach · July 25, 2021, 3:59pm

Beter to do

df = DataFrame(m[2:end,:], :auto)

pdeffebach · July 25, 2021, 4:06pm

What are you expecting? It works as I expected.

julia> df_list = [DataFrame(rand(2,2), :auto) for i in 1:2]
2-element Vector{DataFrame}:
 2×2 DataFrame
 Row │ x1         x2        
     │ Float64    Float64   
─────┼──────────────────────
   1 │ 0.504304   0.338425
   2 │ 0.0633497  0.0394208
 2×2 DataFrame
 Row │ x1        x2         
     │ Float64   Float64    
─────┼──────────────────────
   1 │ 0.899258  0.69782
   2 │ 0.899377  0.00734581

julia> push!(df_list, DataFrame(h = [1, 2, 3]))
3-element Vector{DataFrame}:
 2×2 DataFrame
 Row │ x1         x2        
     │ Float64    Float64   
─────┼──────────────────────
   1 │ 0.504304   0.338425
   2 │ 0.0633497  0.0394208
 2×2 DataFrame
 Row │ x1        x2         
     │ Float64   Float64    
─────┼──────────────────────
   1 │ 0.899258  0.69782
   2 │ 0.899377  0.00734581
 3×1 DataFrame
 Row │ h     
     │ Int64 
─────┼───────
   1 │     1
   2 │     2
   3 │     3

pdeffebach · July 25, 2021, 4:41pm

I see. No, you want DataFrame(; h => c). The Pair syntax lets you work with names programmatically.

mjanun · July 25, 2021, 4:53pm

Thank you for the code. Few questions:

Why do the column type is shown as Any when it is string?
This code results in same number of rows in all the dataframes, so it add data with missing if number of rows in one dataframe is lower than others. The data I have has different number of rows in each column. Is it possible to create this vector of dataframe with different number of rows, ie. if missing value can be excluded?

pdeffebach · July 25, 2021, 4:53pm

FWIW, I don’t think your answer is particularly compelling.

There are better functions in XLSX to work with this. @mjanun I will try and make an MWE with a solution I think is more elegant soon.

mjanun · July 25, 2021, 4:54pm

Thank you for looking in to it. I will wait for your example.

rafael.guerra · July 25, 2021, 5:27pm

@mjanun, see code edited above to meet your requirement.

NB: supposedly code “not particularly compelling nor elegant”

pdeffebach · July 25, 2021, 5:29pm

Here is something I think might be a bit more robust

julia> using XLSX, DataFrames;

julia> function num_trailing_missing(x)
           n = length(x)
           s = 0
           while true
               n == 0 && break
               !ismissing(x[n]) && break
               s += 1
               n -= 1
           end
           s
       end;

julia> mat = XLSX.readxlsx("testdata.xlsx")[1][:];

julia> inds = [1:1, 3:3]; # You know the columns but not rows

julia> dfs = map(inds) do is
           data = mat[2:end, is]
           nms = mat[1, is]
           df = DataFrame(data, string.(nms))
           min_num_trailing_missings = minimum(num_trailing_missing.(eachcol(df)))
           df = df[1:(end - min_num_trailing_missings), :]
           # narrow the types
           transform(df, names(df) .=> ByRow(identity); renamecols = false)
       end
2-element Vector{DataFrame}:
 2×1 DataFrame
 Row │ SECTOR  
     │ String  
─────┼─────────
   1 │ IT
   2 │ FINANCE
 4×1 DataFrame
 Row │ INDIVIDUAL 
     │ Int64      
─────┼────────────
   1 │          1
   2 │          2
   3 │          3
   4 │          4

Overall this was harder than I thought. I don’t think it’s too different from @rafael.guerra 's answer, actually. However

I take advantage of the fact that you know the starting and ending indices
I narrow the types of the output so they are no longer Any
I drop trailing missing rather than all missing values in the data frame.

mjanun · July 25, 2021, 6:07pm

Thank you. One last question, I see that mat = XLSX.readxlsx("testdata.xlsx")[1][:] refers to the first sheet in the spreadsheet. How can I specify the sheet name instead e.g. if I want to refer to the sheet called Data?

pdeffebach · July 25, 2021, 6:42pm

Take a look at the docs with ? readxlsx. You just replace 1 with the name of the sheet, as a String.

rafael.guerra · July 25, 2021, 7:11pm

Are there in your response?

mjanun · July 25, 2021, 7:14pm

Thanks

pdeffebach · July 25, 2021, 7:21pm

No, when I wrote that I thought readtable could allow for subsets of columns, but I guess it only takes in the full sheet.

Topic		Replies	Views
Create excel file using data from two dataframes General Usage dataframes , xlsx , excel	5	567	February 19, 2024
[ANN] XLSX.jl reaches v0.1.0 Community package , announcement	3	1072	May 21, 2018
Reading in xlsx with multiple header rows using the XLSX Pkg General Usage xlsx	2	1636	May 24, 2021
Converting XLSX to DataFrame General Usage dataframes , xlsx	2	1100	November 8, 2021
Read DataFrame from Excel file skipping one line with units and "cleaning-up" column names New to Julia dataframes , xlsx	11	484	August 24, 2024

Excel data to dataframes

Related topics