How to test missing value in CSV package?

Fred · December 5, 2017, 3:40pm

Hi,

I try to test if there is a missing value at a specific position of a dataframe obtained with the CSV package and I did not find the function to do that. Thank you for your help.

julia> x = CSV.read("data/data.csv"; delim = '\t', null="NA")
4×5 DataFrames.DataFrame
│ Row │ id │ s1       │ s2       │ s3       │ s4       │
├─────┼────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ g1 │ 0.134978 │ 0.231912 │ 0.479582 │ 0.134978 │
│ 2   │ g2 │ 0.972158 │ 0.437821 │ missing  │ 0.848548 │
│ 3   │ g3 │ 0.152925 │ missing  │ 0.848548 │ 0.152925 │
│ 4   │ g4 │ 0.813864 │ 0.972158 │ 0.917429 │ 0.813864 │

julia> x[4][2] == missing
ERROR: UndefVarError: missing not defined

julia> isnan(x[4][2])
missing

julia> isnull(x[4][2])
false

julia> eltype(x[4][2])
Any

julia> isnumber(x[4][2])
ERROR: MethodError: no method matching isnumber(::Missings.Missing)
Closest candidates are:
  isnumber(::Char) at strings/utf8proc.jl:268
  isnumber(::AbstractString) at deprecated.jl:56

ExpandingMan · December 5, 2017, 3:47pm

You can simply do

ismissing(x[4, 2])  # or
ismissing(x[4, :s2])

Fred · December 5, 2017, 3:50pm

Thank you @ExpandingMan but in Julia v0.6.1 this function does not exist. May I have to load another package than CSV ?

julia> ismissing(x[4, 2])
ERROR: UndefVarError: ismissing not defined

nalimilan · December 5, 2017, 3:52pm

You need using Missings.

ExpandingMan · December 5, 2017, 3:52pm

ismissing ought to be exported from DataFrames. If that’s not happening, you are probably using an out-of-date version of DataFrames. (What do you get when you do Pkg.status("DataFrames")? It should say 0.11.1).

Unfortunately in its current form the package manager is rather problematic, some sometimes some weird thing happens that can cause it to refuse to update. If this is the case, it should at least tell you what’s causing the problem when you do Pkg.update("DataFrames"). In the worst case scenario you can use git to pull the updated version manually.

@nalimilan, right now Missings seems to be re-exported from DataFrames. I do not need using Missings on DataFrames 0.11.1.

Fred · December 5, 2017, 3:56pm

Thanks again @ExpandingMan

I did not load DataFrames package to save some precompilation time, but if I load it the function ismissing() is present. So it is strongly advised to load the package DataFrame when using CSV…

julia> using DataFrames

julia> ismissing(x[2, 4])
true

Fred · December 5, 2017, 3:59pm

Thanks @nalimilan

Missing package is indeed enough to load ismissing()

julia> using Missings

julia> using CSV

julia> x = CSV.read("data/data.csv"; delim = '\t', null="NA")
4×5 DataFrames.DataFrame
│ Row │ id │ s1       │ s2       │ s3       │ s4       │
├─────┼────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ g1 │ 0.134978 │ 0.231912 │ 0.479582 │ 0.134978 │
│ 2   │ g2 │ 0.972158 │ 0.437821 │ missing  │ 0.848548 │
│ 3   │ g3 │ 0.152925 │ missing  │ 0.848548 │ 0.152925 │
│ 4   │ g4 │ 0.813864 │ 0.972158 │ 0.917429 │ 0.813864 │

julia> ismissing(x[2, 4])
true

ExpandingMan · December 5, 2017, 4:00pm

Yes, it’s typically a good idea to have whatever packages that you are planning on working with loaded. Note that DataFrames pre-compiles, so the compilation time should only be noticeable the first time you do using DataFrames. It should load quite fast on all subsequent imports.

Note also that since DataFrames re-exports Missings, if you do using DataFrames you do not need using Missings.

orhan_abar · December 7, 2018, 5:00pm

I am trying to check if Term column has missing value in my data frame bu ismissing gives wrong output.

df_courses = CSV.File(file_path) |> DataFrame
println(df_courses)
println(df_courses[Symbol("Term")])        
println(ismissing(df_courses[Symbol("Term")]))

2×11 DataFrame
│ Row │ Course ID │ Course Name │ Prefix  │ Number │ Prerequisites │ Corequisites │ Strict-Corequisites │ Credit Hours │ Institution │ Cononical Name   │ Term    │
│     │ Int64⍰    │ String⍰     │ String⍰ │ Int64⍰ │ Int64⍰        │ Missing      │ Missing             │ Int64⍰       │ String⍰     │ String⍰          │ Int64⍰  │
├─────┼───────────┼─────────────┼─────────┼────────┼───────────────┼──────────────┼─────────────────────┼──────────────┼─────────────┼──────────────────┼─────────┤
│ 1   │ 1         │ Course 1    │ C       │ 1      │ missing       │ missing      │ missing             │ 3            │ UKY         │ Course Example 1 │ 1       │
│ 2   │ 2         │ Course 2    │ C       │ 2      │ 1             │ missing      │ missing             │ 2            │ UKY         │ Course Example 2 │ missing │
Union{Missing, Int64}[1, missing]
false

alejandromerchan · December 7, 2018, 5:44pm

You’re testing the type for the whole column “Term”, which is not missing. If you want to test for the presence of missing values in a column, you can do

sum(ismissing.(df_courses[Symbol("Term")]))

That would give you the number of missing values.
You can also do:

sum(ismissing.(df_courses[Symbol("Term")])) != 0

and that will tell you if you have missing values in your column.

I don’t know if there’s a better way of doing this, but it works.

nalimilan · December 7, 2018, 5:53pm

any(ismissing, df_courses.Term) should be faster, and much faster if there is a missing value at the beginning of the column.

orhan_abar · December 7, 2018, 6:01pm

Thank you for your fast reply @alejandromerchan and @nalimilan.
@alejandromerchan I think my main issue was that I was missing the dot after ismissing.
@nalimilan Yes using any will be better option for me.

Topic		Replies	Views
Read NULL as missing in dataframe General Usage dataframes , csv	1	990	June 30, 2021
First steps with missings after reading a csv file New to Julia	1	373	January 17, 2019
How to read a table with missing values using CSV? Data package , data	9	7434	February 20, 2018
Detecting missing in DataFrame columns New to Julia	6	5858	April 6, 2021
CSV.read stopped working Data	5	1159	March 5, 2018

How to test missing value in CSV package?

Related topics