Select observations in a DataFrame from variable with some missing observations

using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)

d = d[(d.sizerank .<= 50), :]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
 [1] checkindex(#unused#::Type{Bool}, inds::Base.OneTo{Int64}, i::Missing)
   @ Base .\abstractarray.jl:671
 [2] checkindex
   @ .\abstractarray.jl:686 [inlined]
 [3] getindex(df::DataFrame, row_inds::Vector{Union{Missing, Bool}}, #unused#::Colon)
   @ DataFrames ~\.julia\packages\DataFrames\zXEKU\src\dataframe\dataframe.jl:458
 [4] top-level scope
   @ REPL[30]:1

The variable sizerank is either type Int64 or missing.
Hence d.sizerank .<= 50 doesnโ€™t work well here.

I hoped the option missingstring="NA" would solve this, but d.sizerank .<= "50" does not correspond to the integer values <50.

using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame, missingstring="NA")
d = d[(d.sizerank .<= "50"), :] 

Hopefully there is an easy way to do this that Iโ€™m missingโ€ฆ

d[coalesce(d.sizerank .<= "50", false), :] uses false in place of missing for the filter.

Thanks but my problem is that

Ah, CSV.read has got the wrong type for that field. It needs to be Int.

d = CSV.read(r.body, DataFrame, missingstring="NA", types=Dict(:sizerank => Int))

Though I notice there are some errors when reading.

โ”Œ Warning: thread = 1 warning: error parsing Int64 around row = 4625, col = 8: ",", error=INVALID: DELIMITED 

Make sure you understand the structure of the csv so you can parse it correctly.

The following hack works, but it ainโ€™t great

using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame, missingstring="NA")
d.sizerank[d.sizerank .== ""] .= "1"
d.sizerank = parse.(Int, d.sizerank)
d = df[(d.sizerank .<= 50), :] 

This is a job for subset in the dev version of DataFrames

using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)
subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)
1 Like

Thanks! The package conflict breaks my current pkg environment.
Iโ€™ll try it as soon as DataFrames is updated

In the current version you can use filter with coalesce

using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)
filter(:sizerank => x -> coalesce(x <= 50, false), d)
1 Like

DataFramesMetaโ€™s @where will drop missings by default.

โ€ฆ but it will be deprecated in favor of @subset and the missings behavior will probably be the same as subset.

If you have problems with the type you could also use something like this

emptytomissing(x) = isempty(x) ? missing : parse(Int,x)
select(df, :sizerank => ByRow(emptytomissing) => :newname)

Can also do passmissing(parse)(Int, x)

You can probably just pass CSV.read(..., missingstring="").

I downloaded the lasted DataFrames, but no luck

julia> using CSV, HTTP, DataFrames;

julia> link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";

julia> r = HTTP.get(link);

julia> d = CSV.read(r.body, DataFrame)
109593ร—9 DataFrame
    Row โ”‚ loc                Date       HPI      HPGyy      HPGqq       HPGmm       id      sizerank  i_treated 
        โ”‚ String             String     Float64  Float64    Float64     Float64     String  Int64?    Int64     
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
      1 โ”‚ Bancroft_and_Area  1/1/2010     131.1   5.64062    7.10785    14.2982     Canada   missing          0
      2 โ”‚ Bancroft_and_Area  2/1/2010     131.6   4.77708    9.84975     0.381388   Canada   missing          0
   โ‹ฎ    โ”‚         โ‹ฎ              โ‹ฎ         โ‹ฎ         โ‹ฎ          โ‹ฎ           โ‹ฎ         โ‹ฎ        โ‹ฎ          โ‹ฎ
 109592 โ”‚ Zapata, TX         2/1/2020   91218.0   0.865815  -1.21828    -0.508267   US           927          0
 109593 โ”‚ Zapata, TX         3/1/2020   90920.0   0.188432  -1.47271    -0.32669    US           927          0
                                                                                             109589 rows omitted

julia> unique(d.sizerank) |> sort
850-element Vector{Union{Missing, Int64}}:
   0
   1
   2
   3
   โ‹ฎ
 930
 932
 933
    missing

julia> subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)
6273ร—9 DataFrame
  Row โ”‚ loc             Date       HPI       HPGyy      HPGqq      HPGmm       id      sizerank  i_treated 
      โ”‚ String          String     Float64   Float64    Float64    Float64     String  Int64?    Int64     
โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1 โ”‚ Atlanta, GA     1/1/2010   157645.0  -10.4162   -1.74821   -0.270761   US             9          0
    2 โ”‚ Atlanta, GA     2/1/2010   157070.0   -9.39036  -1.38192   -0.364744   US             9          0
  โ‹ฎ   โ”‚       โ‹ฎ             โ‹ฎ         โ‹ฎ          โ‹ฎ          โ‹ฎ          โ‹ฎ         โ‹ฎ        โ‹ฎ          โ‹ฎ
 6272 โ”‚ Washington, DC  2/1/2020   438354.0    2.97397   1.01905    0.182606   US             7          0
 6273 โ”‚ Washington, DC  3/1/2020   439386.0    2.9441    0.904356   0.235426   US             7          0
                                                                                          6269 rows omitted

julia> unique(d.sizerank) |> sort
850-element Vector{Union{Missing, Int64}}:
   0
   1
   2
   3
   โ‹ฎ
 930
 932
 933
    missing

you want subset!. subset (no !) returns a new DataFrame.

1 Like

Yes! This is why I should try to understand code before copy & pasting from discourse.
Either of the following (mutating or not) works

subset!(d, :sizerank => ByRow(<=(50)), skipmissing = true)
d = subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)

How about defining a custom function for your first code:

sel123(x) = !ismissing(x) ? x<=50 : false
d[sel123.(d.sizerank),:]
1 Like