How to convert string column of dataframe to float with missing values?

I have a (very simplified below) xlsx file like:

Start             | End               | Q1   | Q2   | Q3
10 Jan 2021 15:00 | 10 Jan 2021 17:00 | 5.50 | 2.70 | -
10 Jan 2021 15:03 | 10 Jan 2021 16:53 | 2.70 | 3.00 | 1.50

In fact, I have several more columns like Q1, Q2, Q3,… and there are around 300 rows. I would like to use Julia to make an analysis of this dataset, which arises from amultiple choice examination given online and downloaded from a Moodle server.

I have been able to read it via XLSX and then transform it to a dataframe. The columns are all recognized, to begin with, as strings. However, I would like to:
(1) transform the columns Q1, Q2, … to float numbers (where obvious), with the corresponding “-” strings transformed to missing values (not strings).
I do not want the “-” string to be replaced by, for instance, the float 0.00, because I would like to keep the record as to which questions were left unanswered (of course, the corresponding grade will, effectively, be considered 0.0, for the calculation of the final total grade)…

Later I will also want to
(2) transform the columns Start and End to proper Date/Time objects, possibly from the Dates package (I have never used it) and then, by subtracting End from Start to have a new column Duration for each test/examination.

I tried the first issue (1) quoted above without success and I will then try the second one later on. Right now, I am particularly concerned about issue (1).

Thanks for any help!

You can use passmissing from Missings.jl to skip over missings with a function, something like

julia> x = ["1.0", "4.5", missing];

julia> passmissing(parse).(Float64, x);

If you have

x = ["1.5", "-"]

you can first do replace(x, "-" => missing) to get missing values.

1 Like

if you convert the xlsx file to csv first, a lot of this can be done using CSV

using CSV, DataFrames, Dates

a = CSV.File("answers.csv", missingstring="-", types=Dict([1,2].=>DateTime),
    dateformat=DateFormat("d u Y H:M"))
DataFrame(a)
Start End Q1 Q2 Q3
DateTime DateTime Float64 Float64 Float64?
2021-01-10T15:00:00 2021-01-10T17:00:00 5.5 2.7 missing
2021-01-10T15:03:00 2021-01-10T16:53:00 2.7 3.0 1.5

See ?CSV.File and ?DateFormat for more info
2 Likes