How to convert string column of dataframe to float with missing values?

mocalvao · January 12, 2021, 10:17pm

I have a (very simplified below) xlsx file like:

Start             | End               | Q1   | Q2   | Q3
10 Jan 2021 15:00 | 10 Jan 2021 17:00 | 5.50 | 2.70 | -
10 Jan 2021 15:03 | 10 Jan 2021 16:53 | 2.70 | 3.00 | 1.50

In fact, I have several more columns like Q1, Q2, Q3,… and there are around 300 rows. I would like to use Julia to make an analysis of this dataset, which arises from amultiple choice examination given online and downloaded from a Moodle server.

I have been able to read it via XLSX and then transform it to a dataframe. The columns are all recognized, to begin with, as strings. However, I would like to:
(1) transform the columns Q1, Q2, … to float numbers (where obvious), with the corresponding “-” strings transformed to missing values (not strings).
I do not want the “-” string to be replaced by, for instance, the float 0.00, because I would like to keep the record as to which questions were left unanswered (of course, the corresponding grade will, effectively, be considered 0.0, for the calculation of the final total grade)…

Later I will also want to
(2) transform the columns Start and End to proper Date/Time objects, possibly from the Dates package (I have never used it) and then, by subtracting End from Start to have a new column Duration for each test/examination.

I tried the first issue (1) quoted above without success and I will then try the second one later on. Right now, I am particularly concerned about issue (1).

Thanks for any help!

pdeffebach · January 12, 2021, 11:07pm

You can use passmissing from Missings.jl to skip over missings with a function, something like

julia> x = ["1.0", "4.5", missing];

julia> passmissing(parse).(Float64, x);

If you have

x = ["1.5", "-"]

you can first do replace(x, "-" => missing) to get missing values.

Mattriks · January 12, 2021, 11:14pm

if you convert the xlsx file to csv first, a lot of this can be done using CSV

using CSV, DataFrames, Dates

a = CSV.File("answers.csv", missingstring="-", types=Dict([1,2].=>DateTime),
    dateformat=DateFormat("d u Y H:M"))
DataFrame(a)

Start	End	Q1	Q2	Q3
DateTime	DateTime	Float64	Float64	Float64?
2021-01-10T15:00:00	2021-01-10T17:00:00	5.5	2.7	missing
2021-01-10T15:03:00	2021-01-10T16:53:00	2.7	3.0	1.5

See ?CSV.File and ?DateFormat for more info

Topic		Replies	Views
Replace missing values based on column data type General Usage package , plotting , strings , dataframes , missing-values	7	929	February 10, 2023
Csv and dataframes and missing values New to Julia dataframes , csv	9	2178	June 12, 2021
Replace Character in a column New to Julia dataframes	4	1103	August 24, 2019
Not able convert string to missing New to Julia	2	808	April 26, 2022
I have a DataFrame with multiple columns of type Union{Missing, String}. What is the most concise manner of converting the non-missing values in Float? General Usage	2	586	January 29, 2021

How to convert string column of dataframe to float with missing values?

Related topics