Read DataFrame from Excel file skipping one line with units and "cleaning-up" column names

mjmat · May 21, 2024, 7:42pm

I suspect this has been asked before, but I can’t seem to find an answer to this (apparently) simple problem.
Let’s assume I have an Excel file with m columns, and the following rows:

1st row: variable names
2nd row: variable units
rows 3 to n: data (real numbers in my case)

The only way I found to do this is to read the whole data, and then delete the row with units, but it seems a bit inefficient, and also it results in all columns in the dataframe having the data type “Any” while they should be Float.

using DataFrames, XLSX
df = DataFrame(XLSX.readtable("myfile.xlsx", "mysheet"))
df = df[2:end,:]   # Remove 1st row, which has unit strings

Is there a “cleaner” way to do it? Or, at least, is there a simple way to “tell” the dataframe that all variables are real numbers?

Additional question, let’s assume one of the column names is “var A” (with a space). I know it’s acceptable in Julia, but I really prefer when DataFrame column names can be used as symbols. I would like the name “cleaned up”, i.e. changed to e.g. var_A, so that I can do things like x = df.var_A

Thanks!

Dan · May 21, 2024, 8:23pm

Couldn’t see any beautiful way to do this, but going through the requirements, this is possible:

julia> using XLSX, DataFrames, Unitful

julia> dss = XLSX.readdata("TestTable.xlsx", "Sheet1", "A1:C4")
4×3 Matrix{Any}:
   "var A"    "var B"      "var C"
   "m"        "m/s"        "kg"
 10.2       11.2       1000.0
 50.0       60.5       2000.0

julia> DataFrame(
  uparse.(dss[2,:]) .* eachcol(dss[3:end,:]), 
  replace.(dss[1,:],Ref(' '=>'_')))
2×3 DataFrame
 Row │ var_A      var_B        var_C     
     │ Quantity…  Quantity…    Quantity… 
─────┼───────────────────────────────────
   1 │    10.2 m  11.2 m s^-1  1000.0 kg
   2 │    50.0 m  60.5 m s^-1  2000.0 kg

mjmat · May 21, 2024, 8:53pm

Thanks, using unitful does seem much better than discarding units as I proposed, but it’s also more complex - in my simple case all variables have the same units so discarding them seems OK.

For the “clean-up” part, I was hoping for something more general, similar to Matlab’s matlab.lang.makeValidName()

rocco_sprmnt21 · May 21, 2024, 9:30pm

using DataFrames, XLSX
df = DataFrame(XLSX.readtable("mh.xlsx","Foglio1";header=false,first_row=5,column_labels=vec(XLSX.readdata("mh.xlsx",1,"C3:E3"))))

#the excel table
#2#	 C	D	E
#3#  v1	v2	v3
#4#  m	kg	s
#5#  1	2	3
#6#  2	5	8
#7#  3	8	13
#8#  4	11	18
#9#  5	14	23
#10# 6	17	28

julia> df = DataFrame(XLSX.readtable("mh.xlsx","Foglio1";header=false,first_row=5,column_labels=vec(XLSX.readdata("mh.xlsx",1,"C3:E3"))))
6×3 DataFrame
 Row │ v1   v2   v3  
     │ Any  Any  Any
─────┼───────────────
   1 │ 1    2    3
   2 │ 2    5    8
   3 │ 3    8    13
   4 │ 4    11   18
   5 │ 5    14   23
   6 │ 6    17   28

Dan · May 21, 2024, 11:47pm

Since all the data are floating points and of the same unit, you might also consider using a matrix. It is semantically appropriate, and might allow a plethora of matrix based analyses.

julia> Float64.(XLSX.readdata("TestTable.xlsx", "Sheet1", "A3:C4"))
2×3 Matrix{Float64}:
 10.2  11.2  1000.0
 50.0  60.5  2000.0

Note the range in the Sheet discards the names. The column names can be read separately.

mjmat · May 22, 2024, 12:26am

Thank you, calling vec() around the call to get column names is a nice trick.
By experimenting with this, I came up with this to get the variable names in a first pass, not knowing the number of columns in advance:

colnames = collect(skipmissing(XLSX.readdata("filename.xlsx", "sheetname","A1:XFD1" )))    # XFD1 corresponds to column 16384, the maximum in Excel

An then I can skip lines to read only the data rows and use the collected names

df = DataFrame(XLSX.readtable("filename.xlsx", "sheetname", header=false, first_row=3, column_labels=colnames))   # assuming data starts on row 3

mjmat · May 22, 2024, 12:32am

Thanks, it’s true that matrices are useful. In my case there is also a time stamp, so a DataFrame is more convenient, in my first message I forgot to mention that.

rocco_sprmnt21 · May 22, 2024, 7:24am

some ideas to make the expressions valid for slightly more general situations

julia> data=XLSX.readdata("mh.xlsx",1,"A1:AAA100");

julia> fr,fc=findfirst(!ismissing, data).I
(3, 3)

#julia> lc=findlast(!ismissing, data[fr,:])
#5

julia> lc=findnext(ismissing, data[fr,:],fc)-1
5

julia> mh=XLSX.readdata.("mh.xlsx",1,Base.splat(XLSX.CellRef).(Base.product(fr:fr+1,fc:lc)))
2×3 Matrix{String}:
 "v1"  "v2"  "v3"
 "m"   "kg"  "s"

julia> colnames=join.(eachcol(mh),'[').*']'
3-element Vector{String}:
 "v1[m]"
 "v2[kg]"
 "v3[s]"

julia> df = DataFrame(XLSX.readtable("mh.xlsx","Foglio1";header=false,first_row=fr+2,column_labels=colnames))
6×3 DataFrame
 Row │ v1[m]  v2[kg]  v3[s] 
     │ Any    Any     Any
─────┼──────────────────────
   1 │ 1      2       3
   2 │ 2      5       8
   3 │ 3      8       13
   4 │ 4      11      18
   5 │ 5      14      23
   6 │ 6      17      28

nilshg · May 22, 2024, 9:02am

using CSV; rename!(df, CSV.normalizenames(names(df)))

(would maybe be a good PR to XLSX.jl to have this there as well)

nilshg · May 22, 2024, 9:08am

I’ve made and issue here.

Nathan_Boyer · May 22, 2024, 2:57pm

Just so you know, you can still use dot syntax with string names:

julia> x = df."Name with Spaces"
5-element Vector{Int64}:
 1
 2
 3
 4
 5

mjmat · August 24, 2024, 6:14pm

It comes a bit late, but thanks, I didn’t know that and it’s really cool.

Topic		Replies	Views
Remove spaces and units in Dataframe header? New to Julia dataframes	6	1798	September 4, 2020
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	75	September 3, 2024
Excel data to dataframes Data data , dataframes , xlsx , excel	16	4909	July 25, 2021
How to convert string column of dataframe to float with missing values? New to Julia	2	2251	January 12, 2021
How to clean/filter/remove wrong data from DataFrame New to Julia question , dataframes	13	677	February 21, 2022

Read DataFrame from Excel file skipping one line with units and "cleaning-up" column names

Related topics