using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)
d = d[(d.sizerank .<= 50), :]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
[1] checkindex(#unused#::Type{Bool}, inds::Base.OneTo{Int64}, i::Missing)
@ Base .\abstractarray.jl:671
[2] checkindex
@ .\abstractarray.jl:686 [inlined]
[3] getindex(df::DataFrame, row_inds::Vector{Union{Missing, Bool}}, #unused#::Colon)
@ DataFrames ~\.julia\packages\DataFrames\zXEKU\src\dataframe\dataframe.jl:458
[4] top-level scope
@ REPL[30]:1
The variable sizerank
is either type Int64
or missing
.
Hence d.sizerank .<= 50
doesnโt work well here.
I hoped the option missingstring="NA"
would solve this, but d.sizerank .<= "50"
does not correspond to the integer values <50.
using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame, missingstring="NA")
d = d[(d.sizerank .<= "50"), :]
Hopefully there is an easy way to do this that Iโm missingโฆ
jzr
April 16, 2021, 7:00pm
2
d[coalesce(d.sizerank .<= "50", false), :]
uses false
in place of missing
for the filter.
Thanks but my problem is that
jzr
April 16, 2021, 7:08pm
4
Ah, CSV.read has got the wrong type for that field. It needs to be Int
.
d = CSV.read(r.body, DataFrame, missingstring="NA", types=Dict(:sizerank => Int))
Though I notice there are some errors when reading.
โ Warning: thread = 1 warning: error parsing Int64 around row = 4625, col = 8: ",", error=INVALID: DELIMITED
Make sure you understand the structure of the csv so you can parse it correctly.
The following hack works, but it ainโt great
using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame, missingstring="NA")
d.sizerank[d.sizerank .== ""] .= "1"
d.sizerank = parse.(Int, d.sizerank)
d = df[(d.sizerank .<= 50), :]
This is a job for subset
in the dev
version of DataFrames
using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)
subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)
1 Like
Thanks! The package conflict breaks my current pkg environment.
Iโll try it as soon as DataFrames is updated
In the current version you can use filter
with coalesce
using CSV, HTTP, DataFrames
link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
r = HTTP.get(link)
d = CSV.read(r.body, DataFrame)
filter(:sizerank => x -> coalesce(x <= 50, false), d)
1 Like
DataFramesMetaโs @where
will drop missing
s by default.
โฆ but it will be deprecated in favor of @subset
and the missing
s behavior will probably be the same as subset
.
If you have problems with the type you could also use something like this
emptytomissing(x) = isempty(x) ? missing : parse(Int,x)
select(df, :sizerank => ByRow(emptytomissing) => :newname)
Can also do passmissing(parse)(Int, x)
jzr
April 16, 2021, 8:32pm
13
You can probably just pass CSV.read(..., missingstring="")
.
I downloaded the lasted DataFrames, but no luck
julia> using CSV, HTTP, DataFrames;
julia> link = "https://raw.githubusercontent.com/azev77/azev77/main/HP_m.csv";
julia> r = HTTP.get(link);
julia> d = CSV.read(r.body, DataFrame)
109593ร9 DataFrame
Row โ loc Date HPI HPGyy HPGqq HPGmm id sizerank i_treated
โ String String Float64 Float64 Float64 Float64 String Int64? Int64
โโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Bancroft_and_Area 1/1/2010 131.1 5.64062 7.10785 14.2982 Canada missing 0
2 โ Bancroft_and_Area 2/1/2010 131.6 4.77708 9.84975 0.381388 Canada missing 0
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ
109592 โ Zapata, TX 2/1/2020 91218.0 0.865815 -1.21828 -0.508267 US 927 0
109593 โ Zapata, TX 3/1/2020 90920.0 0.188432 -1.47271 -0.32669 US 927 0
109589 rows omitted
julia> unique(d.sizerank) |> sort
850-element Vector{Union{Missing, Int64}}:
0
1
2
3
โฎ
930
932
933
missing
julia> subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)
6273ร9 DataFrame
Row โ loc Date HPI HPGyy HPGqq HPGmm id sizerank i_treated
โ String String Float64 Float64 Float64 Float64 String Int64? Int64
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Atlanta, GA 1/1/2010 157645.0 -10.4162 -1.74821 -0.270761 US 9 0
2 โ Atlanta, GA 2/1/2010 157070.0 -9.39036 -1.38192 -0.364744 US 9 0
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ
6272 โ Washington, DC 2/1/2020 438354.0 2.97397 1.01905 0.182606 US 7 0
6273 โ Washington, DC 3/1/2020 439386.0 2.9441 0.904356 0.235426 US 7 0
6269 rows omitted
julia> unique(d.sizerank) |> sort
850-element Vector{Union{Missing, Int64}}:
0
1
2
3
โฎ
930
932
933
missing
you want subset!
. subset
(no !
) returns a new DataFrame.
1 Like
Yes! This is why I should try to understand code before copy & pasting from discourse.
Either of the following (mutating or not) works
subset!(d, :sizerank => ByRow(<=(50)), skipmissing = true)
d = subset(d, :sizerank => ByRow(<=(50)), skipmissing = true)
How about defining a custom function for your first code:
sel123(x) = !ismissing(x) ? x<=50 : false
d[sel123.(d.sizerank),:]
1 Like