I suspect this has been asked before, but I can’t seem to find an answer to this (apparently) simple problem.
Let’s assume I have an Excel file with m columns, and the following rows:
1st row: variable names
2nd row: variable units
rows 3 to n: data (real numbers in my case)
The only way I found to do this is to read the whole data, and then delete the row with units, but it seems a bit inefficient, and also it results in all columns in the dataframe having the data type “Any” while they should be Float.
using DataFrames, XLSX
df = DataFrame(XLSX.readtable("myfile.xlsx", "mysheet"))
df = df[2:end,:] # Remove 1st row, which has unit strings
Is there a “cleaner” way to do it? Or, at least, is there a simple way to “tell” the dataframe that all variables are real numbers?
Additional question, let’s assume one of the column names is “var A” (with a space). I know it’s acceptable in Julia, but I really prefer when DataFrame column names can be used as symbols. I would like the name “cleaned up”, i.e. changed to e.g. var_A, so that I can do things like x = df.var_A
Thanks, using unitful does seem much better than discarding units as I proposed, but it’s also more complex - in my simple case all variables have the same units so discarding them seems OK.
For the “clean-up” part, I was hoping for something more general, similar to Matlab’s matlab.lang.makeValidName()
Since all the data are floating points and of the same unit, you might also consider using a matrix. It is semantically appropriate, and might allow a plethora of matrix based analyses.
Thank you, calling vec() around the call to get column names is a nice trick.
By experimenting with this, I came up with this to get the variable names in a first pass, not knowing the number of columns in advance:
colnames = collect(skipmissing(XLSX.readdata("filename.xlsx", "sheetname","A1:XFD1" ))) # XFD1 corresponds to column 16384, the maximum in Excel
An then I can skip lines to read only the data rows and use the collected names
df = DataFrame(XLSX.readtable("filename.xlsx", "sheetname", header=false, first_row=3, column_labels=colnames)) # assuming data starts on row 3
Thanks, it’s true that matrices are useful. In my case there is also a time stamp, so a DataFrame is more convenient, in my first message I forgot to mention that.