Dealing with missing values in DataFrame (again)

I am wondering if there has been progress on the usability of the missing interface.
I am trying to write most of my R commands into Julia step by step to build a guide that would have interested people switch over easily.

However I am stuck just before I could get started.

I am working with standard dataset widely used in academic research in finance and economics, the crsp stock files (the daily stock file here)

using DataFrames, Pipe, CSV
df_dsf = CSV.File("./dsf.csv") |> DataFrame();
df_dsf[1:4, [:PERMNO, :date, :RET]]

This returns the dataset of interest:

4Γ—3 DataFrame
β”‚ Row β”‚ PERMNO β”‚ date     β”‚ RET       β”‚
β”‚     β”‚ Int64  β”‚ Int64    β”‚ String?   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 10000  β”‚ 19860106 β”‚ missing   β”‚
β”‚ 2   β”‚ 10000  β”‚ 19860107 β”‚ C         β”‚
β”‚ 3   β”‚ 10000  β”‚ 19860108 β”‚ -0.024390 β”‚
β”‚ 4   β”‚ 10000  β”‚ 19860109 β”‚ 0.000000  β”‚

As you can see RET (stock returns) are a mix of strings (delisting flags for example), missing values, and floats.
Typically we parse this dataset to convert all to numeric and default to missing. For those familiar with R, a data.table syntax would be:

df_dsf[, ret_num := as.numeric(RET) ]

and as.numeric takes care of defaulting both missing and strings to missing (NA).

I am only diving into DataFrame so excuse the syntax but here is my approximation:

df_dsf = @pipe df_dsf |> 
    transform(_, :RET => ByRow(passmissing(x->tryparse(Float64, x))) => :ret_num);
replace!(df_dsf.ret_num , nothing => missing);

The first command is barely readable and assumes long time scavenging for information on all of its three functions (ByRow, passmissing and tryparse).
And on top of this because we cannot default to missing in parse (I saw a bunch of activity on GitHub around this very question), we need to add another command to convert all of the nothing (from parsing strings) ex-post.

I do not mean to be critical. I am here to learn. I hope we have or find easier way of expressing such simple data transformation using DataFrame.
This is maybe an extreme example (I have found other parts of DataFrame to work super well), but it is also the first data transformation lots of academics in finance are doing; it is typical and frequent if not fully representative.

Thanks for your help.

See this issue:
https://github.com/JuliaData/DataFrames.jl/issues/2314
That won’t address the tryparse annoyance, but note you can do something(tryparse(Float64, x), missing) instead calling replace! after the fact.

2 Likes

Unclear exactly what you want. I would just set it to missing right after reading in the file.

df_dsf = replace(df_dsf.RET,"C"=>missing) .|> identity

The call to identity will fix the column type to Union{Missing,Float64}.

Edit: And because there are like 8 missing codes (A,B,C,D,E,S,T,P), you can just generalize the replacement condition to cover all of them.

replace(v -> v isa String ? missing : v,df_dsf.RET) .|> identity

Thank you for the help.

Honestly, my personal problem is a little bit of a sideshow here. I just want to raise awareness to some of the problems when dealing with missing even at a very basic level.

I think the issue linked by @nalimilan captures parts of the experience I am trying to share. AFAIK this is still an open issue. I just want to make sure we try to find a resolution before we rush into freezing DataFrames.

I’ll add that CSV.File also accepts some keyword arguments related to parsing of missing values when reading files. This case would likely be covered by the missingstrings kwarg.

2 Likes

I’ve been playing around with this example and I do think there are some syntax things that will make things better without getting too much in the weeds with regards to missing values.

  1. DataFramesMeta to get rid of the anonymous function
  2. Using DataFramesMeta to hopefully make ByRow easier
  3. Hopefully using DataFramesMeta to work with literals instead of symbols
  4. Using passmissing(tryparse)
df_dsf = @pipe df_dsf |>
    @transform(_, ret_num = @row passmissing(tryparse)(Float64, RET)
    @transform(_, ret_num = @row isnothing(ret_num) ? missing : ret_num)

This is pretty good, and we haven’t worked with missing values yet.

Next, with regards to missing values we could add an option for tryparse to return a default value.

df_dsf = @pipe df_dsf |>
    @transform(_, ret_num = @row passmissing(tryparse)(Float64, RET, missing)

This looks pretty good to me, but we haven’t even touched base DataFrames missing handling. It would be helpful if you could comment on the issue discussion missings here about what specific ways your common problems would be most helped by the proposed chnages.

EDIT: not to mention the CSV.jl missingstring option, which would probably eliminate errant strings from your data frame in the first place.

EDIT: Also I’ve filed an issue for tryparse here.

1 Like