How to test missing value in CSV package?

Hi,

I try to test if there is a missing value at a specific position of a dataframe obtained with the CSV package and I did not find the function to do that. Thank you for your help.

julia> x = CSV.read("data/data.csv"; delim = '\t', null="NA")
4×5 DataFrames.DataFrame
│ Row │ id │ s1       │ s2       │ s3       │ s4       │
├─────┼────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ g1 │ 0.134978 │ 0.231912 │ 0.479582 │ 0.134978 │
│ 2   │ g2 │ 0.972158 │ 0.437821 │ missing  │ 0.848548 │
│ 3   │ g3 │ 0.152925 │ missing  │ 0.848548 │ 0.152925 │
│ 4   │ g4 │ 0.813864 │ 0.972158 │ 0.917429 │ 0.813864 │

julia> x[4][2] == missing
ERROR: UndefVarError: missing not defined

julia> isnan(x[4][2])
missing

julia> isnull(x[4][2])
false

julia> eltype(x[4][2])
Any

julia> isnumber(x[4][2])
ERROR: MethodError: no method matching isnumber(::Missings.Missing)
Closest candidates are:
  isnumber(::Char) at strings/utf8proc.jl:268
  isnumber(::AbstractString) at deprecated.jl:56

You can simply do

ismissing(x[4, 2])  # or
ismissing(x[4, :s2])
1 Like

Thank you @ExpandingMan but in Julia v0.6.1 this function does not exist. May I have to load another package than CSV ?

julia> ismissing(x[4, 2])
ERROR: UndefVarError: ismissing not defined

You need using Missings.

ismissing ought to be exported from DataFrames. If that’s not happening, you are probably using an out-of-date version of DataFrames. (What do you get when you do Pkg.status("DataFrames")? It should say 0.11.1).

Unfortunately in its current form the package manager is rather problematic, some sometimes some weird thing happens that can cause it to refuse to update. If this is the case, it should at least tell you what’s causing the problem when you do Pkg.update("DataFrames"). In the worst case scenario you can use git to pull the updated version manually.

@nalimilan, right now Missings seems to be re-exported from DataFrames. I do not need using Missings on DataFrames 0.11.1.

Thanks again @ExpandingMan

I did not load DataFrames package to save some precompilation time, but if I load it the function ismissing() is present. So it is strongly advised to load the package DataFrame when using CSV…

julia> using DataFrames

julia> ismissing(x[2, 4])
true

Thanks @nalimilan

Missing package is indeed enough to load ismissing()

julia> using Missings

julia> using CSV

julia> x = CSV.read("data/data.csv"; delim = '\t', null="NA")
4×5 DataFrames.DataFrame
│ Row │ id │ s1       │ s2       │ s3       │ s4       │
├─────┼────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ g1 │ 0.134978 │ 0.231912 │ 0.479582 │ 0.134978 │
│ 2   │ g2 │ 0.972158 │ 0.437821 │ missing  │ 0.848548 │
│ 3   │ g3 │ 0.152925 │ missing  │ 0.848548 │ 0.152925 │
│ 4   │ g4 │ 0.813864 │ 0.972158 │ 0.917429 │ 0.813864 │

julia> ismissing(x[2, 4])
true

Yes, it’s typically a good idea to have whatever packages that you are planning on working with loaded. Note that DataFrames pre-compiles, so the compilation time should only be noticeable the first time you do using DataFrames. It should load quite fast on all subsequent imports.

Note also that since DataFrames re-exports Missings, if you do using DataFrames you do not need using Missings.

I am trying to check if Term column has missing value in my data frame bu ismissing gives wrong output.

df_courses = CSV.File(file_path) |> DataFrame
println(df_courses)
println(df_courses[Symbol("Term")])        
println(ismissing(df_courses[Symbol("Term")]))
2×11 DataFrame
│ Row │ Course ID │ Course Name │ Prefix  │ Number │ Prerequisites │ Corequisites │ Strict-Corequisites │ Credit Hours │ Institution │ Cononical Name   │ Term    │
│     │ Int64⍰    │ String⍰     │ String⍰ │ Int64⍰ │ Int64⍰        │ Missing      │ Missing             │ Int64⍰       │ String⍰     │ String⍰          │ Int64⍰  │
├─────┼───────────┼─────────────┼─────────┼────────┼───────────────┼──────────────┼─────────────────────┼──────────────┼─────────────┼──────────────────┼─────────┤
│ 1   │ 1         │ Course 1    │ C       │ 1      │ missing       │ missing      │ missing             │ 3            │ UKY         │ Course Example 1 │ 1       │
│ 2   │ 2         │ Course 2    │ C       │ 2      │ 1             │ missing      │ missing             │ 2            │ UKY         │ Course Example 2 │ missing │
Union{Missing, Int64}[1, missing]
false

You’re testing the type for the whole column “Term”, which is not missing. If you want to test for the presence of missing values in a column, you can do

sum(ismissing.(df_courses[Symbol("Term")]))

That would give you the number of missing values.
You can also do:

sum(ismissing.(df_courses[Symbol("Term")])) != 0

and that will tell you if you have missing values in your column.

I don’t know if there’s a better way of doing this, but it works.

1 Like

any(ismissing, df_courses.Term) should be faster, and much faster if there is a missing value at the beginning of the column.

3 Likes

Thank you for your fast reply @alejandromerchan and @nalimilan.
@alejandromerchan I think my main issue was that I was missing the dot after ismissing.
@nalimilan Yes using any will be better option for me.