Missing values in character or string

I’m trying to read values out of a NetCDF file. Unfortunately, the missing values seem to be mistakenly stored as characters or strings by the data creator.

Here is my code to read data out of the nc file:

ds1 = NCDataset("path/xxx.nc");
A_nc  = ds1["A"][:,:,:,:]
close(ds1);

size(A_nc) = (360, 180, 28, 192)
typeof(A_nc) = Array{Union{Missing, Float64}, 4}

Here is the error message:
┌ Warning: variable ‘A’ has a numeric type but the corresponding missing_value (-999.9) is a character or string. Comparing, e.g. an integer and a string (1 == “1”) will always evaluate to false. See the function NCDatasets.cfvariable how to manually override the missing_value attribute.
@ CommonDataModel ~/.julia/packages/CommonDataModel/pO4st/src/cfvariable.jl:122

What is a good solution to address this issue?

I tried the below:
DIC_nc[DIC_nc .== '-999.9'] .= NaN;
But get this new error:
**ERROR:** LoadError: syntax: character literal contains multiple characters

It’s interesting that when I tried to open the NetCDF file in Matlab, the class function shows the value is ‘double’, instead of character or string as claimed by Julia.

It seems like you just got a warning and no errors when obtaining A_nc.

In Julia the missing values seems to have been replaced by missing. Use ismissing to locate them.

julia> A_nc = [0.1 0.2; 0.3 missing; 0.4 0.5]
3×2 Matrix{Union{Missing, Float64}}:
 0.1  0.2
 0.3   missing
 0.4  0.5

julia> A_nc[ismissing.(A_nc)] .= NaN
1-element view(reshape(::Matrix{Union{Missing, Float64}}, 6), [5]) with eltype Union{Missing, Float64}:
 NaN

julia> A_nc
3×2 Matrix{Union{Missing, Float64}}:
 0.1    0.2
 0.3  NaN
 0.4    0.5

The second error you encountered concerns the invalid syntax '-999.0'. In Julia, single quotes are only used for a character, not multiple characters in a string.

julia> '-999.0'
ERROR: syntax: character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ none:1

julia> "-999.0"
"-999.0"

julia> '8'
'8': ASCII/Unicode U+0038 (category Nd: Number, decimal digit)

The original error is really an issue with the NetCDF file itself.

3 Likes

Many thanks for the reply.

Unfortunately, it is not working. Despite the fact that it shows
typeof(A_nc) = Array{Union{Missing, Float64}, 4},
When I use ‘@show’ to display the values, I see a bunch of -999.9 instead of missing.
As a result, the script below does nothing:
A_nc[ismissing.(A_nc)] .= NaN

The script below is indeed able to replace -999.9s in the array with NaNs:
A_nc[A_nc .== -999.9] .= NaN,
but the ‘typeof’ result remains the same:
typeof(A_nc) = Array{Union{Missing, Float64}, 4}

Below is the result of ‘@show’ for the array, after I replace -999.9 with NaNs:

A_nc[:, 50, 1, 1] = Union{Missing, Float64}[2059.642853474127, 2060.3117549314975, 2060.9158207445385, NaN, NaN, NaN, NaN, 2047.648519975586]

I’m able to find a solution online:

using MappedArrays
A_nc = of_eltype(Float64, A_nc);

Please see also: Feature request: convert between Array{Union{T, Missing}, N} and Array{T, N} without copying · Issue #26681 · JuliaLang/julia · GitHub

You might like my package MissingsAsFalse.jl which provides a convenient syntax for missing comparisons. See here.

Also consider isequal.(x, -99.9) instead of ==.

1 Like

Just found an easier way to solve this issue:
A_nc = nomissing(A_nc, NaN);

Using missings in float arrays is a bad idea IMO.

I think all had worked for you if you had used GMT.

G = gmtread(“xxx.nc”)

1 Like

Sadly, in my institution, we’re forced to use missing, citing reasons that NaN is not numerical and thus not supported by all programs.

Sorry??? NaNs are numeric. And that’s what probably was in the original data. But ofc you do as need.

2 Likes

It’s the other way around… NaN will be supported everywhere with basically the same semantics. missing is julia-specific and very useful, but maybe not in this case…

2 Likes

Thank you, Joaquim and Pdeffebach.

It’s interesting to hear about this. How specifically are NaNs stored as numerical values?

NaNs are bit patterns that by convention are recognized as 32 or 64 floating point numbers (there are several NaNs), but the point is, in memory they take exactly the same space as any other number of its type (I mean Float32 or Float64). Julia missings (which I don’t know what they really are) on the other hand are not floating point numbers, so when NaNs are replaced with missing a copy of the array has to be made as all numbers are no longer storable contiguously in memory. I have no idea how Julia mange the Union{Missing, Float} but there are many post in the forum mentioning how the presence of missings instead of NaNs make processing way slower. That is why I said * Using missings in float arrays is a bad idea*.

2 Likes

Demonstration.

julia> bitstring(0.0)
"0000000000000000000000000000000000000000000000000000000000000000"

julia> bitstring(1.0)
"0011111111110000000000000000000000000000000000000000000000000000"

julia> bitstring(2.0)
"0100000000000000000000000000000000000000000000000000000000000000"

julia> bitstring(0.1)
"0011111110111001100110011001100110011001100110011001100110011010"

julia> bitstring(NaN)
"0111111111111000000000000000000000000000000000000000000000000000"

julia> bitstring(-NaN)
"1111111111111000000000000000000000000000000000000000000000000000"
1 Like

bitstring(missing)

ERROR: ArgumentError: Missing not a primitive type
Stacktrace:
[1] bitstring(x::Missing)
@ Base ./intfuncs.jl:842
[2] top-level scope
@ REPL[1]:1

Yes. That’s exactly the point.

3 Likes

missing and NaN have different meanings and different purposes. As missing is not a number, (or a primitive type), bitstring is not defined for it.

missings are useful for “Don’t Know” survey responses, or when no value is applicable. Advice to “never” use missing is misplaced because often missing semantics are exactly what you want.

In your context, though, when a netcdf data set, you definitely want to use NaN.

2 Likes

Besides the nomissing function, there is now also the keyword argument maskingvalue = NaN, per dataset or per variable to use a different values to mark missing data:

In your case, you can have directly an array of Float64s.

The warning:

┌ Warning: variable ‘A’ has a numeric type but the corresponding missing_value (-999.9) is a character or string. Comparing, e.g. an integer and a string (1 == “1”) will always evaluate to false. See the function NCDatasets.cfvariable how to manually override the missing_value attribute.

is triggered when a NetCDF file include the string “-999.9” rather than the floating point number -999.9 for the missing_value attribute (as the warning says).
See this issue for context:

It would be good to contact the author of the datasets. In my tests (in 2022), also python’s-netCDF4 fails to load such files.

I haven’t read the rest of the discussion (yet) and I don’t know the array, but I would say that the error lies in the attempt to “force” a string into one (only) character

DIC_nc[DIC_nc .== "-999.9"] .= NaN;

Might work (Not tested)