CSV, white spaces and data type. Force Float on empty fields

Hi, I have a csv file where either the column names or the column themselves have spaces like

#         date          ,    lat   ,    lon   , smajax , sminax , strike, q ,
1904-04-04 10:02:34.56 ,   41.802 ,   23.108 ,    8.6 ,    6.6 , 164.2 , B , 

Regarding the column names I was able to workaround the problem by using the option normalizenames=true but that does not work to remove the blanks on the columns. Another, more important problem, is that empty fields turn what would be Float64 columns into strings. See the smaxj column below.

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=35) |> DataFrame
35×31 DataFrame
 Row │ _date                     lat      lon       smajax    sminax    strike   q       depth    unc      q_1     mw       unc_1    q_2     s        ⋯
     │ String                    Float64  Float64   String    String    String   String  Float64  Float64  String  Float64  Float64  String  String   ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1904-04-04 10:02:34.56    41.802    23.108      8.6       6.6    164.2    B         15.0      4.8   C         6.84     0.4    C       d       ⋯
   2 │  1904-04-04 10:26:00.88    41.758    23.249      8.3       6.9     15.2    B         15.0      4.8   C         7.02     0.4    C       d
   3 │  1904-06-25 14:45:39.14    51.424   161.638     33.6      18.7    116.2    C         15.0     25.0   C         7.5      0.4    C       d
   4 │  1904-06-25 21:00:38.72    52.763   160.277     28.6      14.6     43.1    C         30.0     10.3   C         7.7      0.4    C       d
   5 │  1904-12-20 05:44:20.44     8.962   -84.042     26.3      13.2     74.1    C         10.0     14.0   C         7.29     0.2    C       p       ⋯
   6 │  1905-02-14 08:46:34.49    51.89   -176.702     30.3      19.5    157.5    C         35.0     24.6   C         7.59     0.26   C       p
   7 │  1905-02-17 11:41:07.82    23.689    97.17      27.1      18.9    143.7    C         15.0     25.0   C         7.26     0.37   C       p
   8 │  1905-02-19 04:34:55.80   -14.123   169.457     54.7      12.1      3.4    C         15.0     25.0   C         7.33     0.54   C       p
   9 │  1905-03-04 15:59:53.26    -6.639   151.225     82.1      33.3    161.3    C         15.0     12.6   C         6.96     0.42   C       p       ⋯
  10 │  1905-03-04 23:17:22.03    -4.635   149.105     22.4      18.4    155.1    C         15.0     25.0   C         7.01     0.6    C       p
  11 │  1905-03-22 03:38:44.30    51.145  -179.859     18.0      13.4     91.5    C         35.0     11.8   C         7.18     0.32   C       p
  12 │  1905-04-03 16:01:54.20    41.706    21.369     24.5      12.1     99.5    C         15.0      8.5   C         5.44     0.35   C       p
  ⋮  │            ⋮                 ⋮        ⋮         ⋮         ⋮         ⋮       ⋮        ⋮        ⋮       ⋮        ⋮        ⋮       ⋮       ⋮      ⋱
  25 │  1905-06-30 17:07:35.01   -19.266   169.615     72.0      22.9    177.8    C         15.0     58.0   C         6.9      0.43   C       p       ⋯
  26 │  1905-07-06 16:20:46.93    38.417   144.119     31.5      13.9    131.3    C         15.0     25.0   C         7.67     0.2    C       p
  27 │  1905-07-09 09:40:39.79    49.709    98.483     19.9      11.2     28.4    C         15.0     25.0   C         7.95     0.2    B       d
  28 │  1905-07-11 08:39:39.56    50.24     98.5       29.1      16.5     18.1    C         15.0     25.0   C         6.9      0.32   C       p
  29 │  1905-07-14 22:02:03.35    49.77     99.124     23.7      14.3     55.1    C         15.0     25.0   C         6.86     0.52   C       p       ⋯
  30 │  1905-07-17 00:23:24.07   -23.039   171.862    193.6      21.3    160.1    C         35.0     12.8   C         6.72     0.29   B       p
  31 │  1905-07-23 02:46:24.01    49.292    96.843     26.1      14.5     19.7    B         15.0     25.0   C         8.33     0.2    B       d
  32 │  1905-07-23 20:35:04.13    49.35     96.215     17.8      12.8    173.0    C         10.0     25.0   C         6.19     0.67   C       p
  33 │  1905-08-04 05:08:57.25    41.743    19.925     10.3       8.9    180.0    C         15.0      7.2   C         6.19     0.2    C       p       ⋯
  34 │  1905-09-08 01:43:02.41    38.636    15.784                                C         15.0      6.7   C         7.19     0.2    B       p
  35 │  1905-09-14 19:41:40.64    50.462  -179.492     41.2      26.3    110.8    C         15.0     25.0   C         6.71     0.45   C       p

How can I force those columns to remain Floats ans have NaN instead of missing in the empty cells? I can do the following (after knowing the column name where it happens) but it inserts a missing and I need floats (C does not know what is a missing).

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=35, types=Dict(:smajax=>Float64)) |> DataFrame

the other problem I referred above is the spaces in the column strings. e.g.

julia> isc[!,7]
35-element PooledArrays.PooledVector{String, UInt32, Vector{UInt32}}:
 " B "
 " B "
 " C "
 " C "

I understand they come from the fact that in the file that column is written as “, B ,” but those blanks should be stripped by default.

You can specify column types and even the date format. I use CSV in combination with DataFrames to rename the columns where col_types is an array of data types and col_names is an array of strings. I don’t remember if this method results in Missing or NaN being used.

data = DataFrame!(CSV.File(source,
		datarow=3,
		dateformat=Dates.DateFormat("yyyy-mm-dd HH:MM:SS.sss"),
		delim=',',
		header=col_names,
		types=col_types))

Yes, but one need to know in advance which columns need to be forced, and then there are the (tons in the real case) of warnings. The conversion to float should be automatic and would be nice to have an option replace empties by NaN instead of missing.
Also the behavior of leaving spaces around the string elements looks buggy to me.

And thanks for the DateFormat one. I was also struggling with it too.

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=35, types=Dict(:smajax=>Float64)) |> DataFrame
┌ Warning: thread = 1 warning: error parsing Float64 around row = 127, col = 4: "        ,", error=INVALID: DELIMITED
└ @ CSV C:\Users\joaqu\.julia\packages\CSV\la2cd\src\file.jl:606
35×31 DataFrame
 Row │ _date                     lat      lon       smajax     sminax    strike   q       depth    unc      q_1     mw       unc_1    q_2     s       ⋯
     │ String                    Float64  Float64   Float64?   String    String   String  Float64  Float64  String  Float64  Float64  String  String  ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1904-04-04 10:02:34.56    41.802    23.108        8.6      6.6    164.2    B         15.0      4.8   C         6.84     0.4    C       d      ⋯
   2 │  1904-04-04 10:26:00.88    41.758    23.249        8.3      6.9     15.2    B         15.0      4.8   C         7.02     0.4    C       d
   3 │  1904-06-25 14:45:39.14    51.424   161.638       33.6     18.7    116.2    C         15.0     25.0   C         7.5      0.4    C       d
   4 │  1904-06-25 21:00:38.72    52.763   160.277       28.6     14.6     43.1    C         30.0     10.3   C         7.7      0.4    C       d
   5 │  1904-12-20 05:44:20.44     8.962   -84.042       26.3     13.2     74.1    C         10.0     14.0   C         7.29     0.2    C       p      ⋯
   6 │  1905-02-14 08:46:34.49    51.89   -176.702       30.3     19.5    157.5    C         35.0     24.6   C         7.59     0.26   C       p
   7 │  1905-02-17 11:41:07.82    23.689    97.17        27.1     18.9    143.7    C         15.0     25.0   C         7.26     0.37   C       p
   8 │  1905-02-19 04:34:55.80   -14.123   169.457       54.7     12.1      3.4    C         15.0     25.0   C         7.33     0.54   C       p
   9 │  1905-03-04 15:59:53.26    -6.639   151.225       82.1     33.3    161.3    C         15.0     12.6   C         6.96     0.42   C       p      ⋯
  10 │  1905-03-04 23:17:22.03    -4.635   149.105       22.4     18.4    155.1    C         15.0     25.0   C         7.01     0.6    C       p
  11 │  1905-03-22 03:38:44.30    51.145  -179.859       18.0     13.4     91.5    C         35.0     11.8   C         7.18     0.32   C       p
  12 │  1905-04-03 16:01:54.20    41.706    21.369       24.5     12.1     99.5    C         15.0      8.5   C         5.44     0.35   C       p
  ⋮  │            ⋮                 ⋮        ⋮          ⋮         ⋮         ⋮       ⋮        ⋮        ⋮       ⋮        ⋮        ⋮       ⋮       ⋮     ⋱
  25 │  1905-06-30 17:07:35.01   -19.266   169.615       72.0     22.9    177.8    C         15.0     58.0   C         6.9      0.43   C       p      ⋯
  26 │  1905-07-06 16:20:46.93    38.417   144.119       31.5     13.9    131.3    C         15.0     25.0   C         7.67     0.2    C       p
  27 │  1905-07-09 09:40:39.79    49.709    98.483       19.9     11.2     28.4    C         15.0     25.0   C         7.95     0.2    B       d
  28 │  1905-07-11 08:39:39.56    50.24     98.5         29.1     16.5     18.1    C         15.0     25.0   C         6.9      0.32   C       p
  29 │  1905-07-14 22:02:03.35    49.77     99.124       23.7     14.3     55.1    C         15.0     25.0   C         6.86     0.52   C       p      ⋯
  30 │  1905-07-17 00:23:24.07   -23.039   171.862      193.6     21.3    160.1    C         35.0     12.8   C         6.72     0.29   B       p
  31 │  1905-07-23 02:46:24.01    49.292    96.843       26.1     14.5     19.7    B         15.0     25.0   C         8.33     0.2    B       d
  32 │  1905-07-23 20:35:04.13    49.35     96.215       17.8     12.8    173.0    C         10.0     25.0   C         6.19     0.67   C       p
  33 │  1905-08-04 05:08:57.25    41.743    19.925       10.3      8.9    180.0    C         15.0      7.2   C         6.19     0.2    C       p      ⋯
  34 │  1905-09-08 01:43:02.41    38.636    15.784  missing                        C         15.0      6.7   C         7.19     0.2    B       p
  35 │  1905-09-14 19:41:40.64    50.462  -179.492       41.2     26.3    110.8    C         15.0     25.0   C         6.71     0.45   C       p

Have you tried missingstring = " " (or more whitespaces)?

Thanks, but doesn’t help. The number of spaces doesn’t seem constant in the file, which leads failures to find them all but above all, the data type remains String

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=5, missingstring="        ", datarow=125) |> DataFrame
5×31 DataFrame
 Row │ _date                     lat         lon         smajax    sminax    strike   q       depth     unc      q_1     mw      unc_1   q_2     s    ⋯
     │ String                    String      String      String?   String?   String   String  String    String   String  String  String  String  Stri ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1905-07-23 20:35:04.13      49.350      96.215      17.8      12.8    173.0    C         10.0     25.0    C       6.19    0.67    C       p   ⋯
   2 │  1905-08-04 05:08:57.25      41.743      19.925      10.3       8.9    180.0    C         15.0      7.2    C       6.19    0.20    C       p
   3 │  1905-09-08 01:43:02.41      38.636      15.784   missing   missing             C         15.0      6.7    C       7.19    0.20    B       p
   4 │  1905-09-14 19:41:40.64      50.462    -179.492      41.2      26.3    110.8    C         15.0     25.0    C       6.71    0.45    C       p
   5 │  1905-09-15 06:02:20.96      55.751     163.952      31.3      23.2     75.4    C         15.0     12.1    C       7.84    0.46    C       p   ⋯

What strings do yo actually end up with (i.e. what’s in isc[isnosthing(tryparse(Float64, isc.smajax)), smajax])? Can you do missingstrings = [" ", " ", " "] or whatever is required to find them all?

smajax, sminax, strike and several others alike should be floats. Empty cells should not turn the entire column to string. Doing it seems a bug to me. Matlab does it right, see

I’m trying to replicate that in Julia.

I understand that. Empty cells do not turn columns to string in CSV.jl, provided that the missingstring is set/detected correctly:

julia> using CSV, DataFrames

julia> CSV.read(IOBuffer("a,b\n1.0,2.0\n,4.0\n"), DataFrame)
2×2 DataFrame
 Row │ a          b       
     │ Float64?   Float64 
─────┼────────────────────
   1 │       1.0      2.0
   2 │ missing        4.0

julia> CSV.read(IOBuffer("a,b\n1.0,2.0\n    ,4.0\n"), DataFrame, missingstring = "    ")
2×2 DataFrame
 Row │ a          b       
     │ Float64?   Float64 
─────┼────────────────────
   1 │       1.0      2.0
   2 │ missing        4.0

doing isc[isnosthing(tryparse(Float64, isc.smajax)), smajax] should show you where the parsing failed - which could just be different whitespace lengths that you haven’t specified, or something entirely different.

I am sorry but there are cases where it does.

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=5, missingstring="        ", datarow=125) |> DataFrame
5×31 DataFrame
 Row │ _date                     lat         lon         smajax    sminax    strike   q       depth     unc      q_1     mw      unc_1   q_2     s    ⋯
     │ String                    String      String      String?   String?   String   String  String    String   String  String  String  String  Stri ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1905-07-23 20:35:04.13      49.350      96.215      17.8      12.8    173.0    C         10.0     25.0    C       6.19    0.67    C       p   ⋯
   2 │  1905-08-04 05:08:57.25      41.743      19.925      10.3       8.9    180.0    C         15.0      7.2    C       6.19    0.20    C       p
   3 │  1905-09-08 01:43:02.41      38.636      15.784   missing   missing             C         15.0      6.7    C       7.19    0.20    B       p
   4 │  1905-09-14 19:41:40.64      50.462    -179.492      41.2      26.3    110.8    C         15.0     25.0    C       6.71    0.45    C       p
   5 │  1905-09-15 06:02:20.96      55.751     163.952      31.3      23.2     75.4    C         15.0     12.1    C       7.84    0.46    C       p   ⋯
                                                                                                                                     18 columns omitted

julia> isc[isnosthing(tryparse(Float64, isc.smajax)), smajax]
ERROR: MethodError: no method matching tryparse(::Type{Float64}, ::Vector{Union{Missing, String}})
Closest candidates are:
  tryparse(::Type{Float64}, ::String) at parse.jl:247
  tryparse(::Type{Float64}, ::SubString{String}) at parse.jl:252
  tryparse(::Type{T}, ::AbstractString) where T<:Union{Float32, Float64} at parse.jl:287

Sorry I had multiple typos in my command :slight_smile:

isc[isnothing.(tryparse.(Float64, isc.smajax)), smajax]

Nope, it seems the error is triggered before the broadcast is expected.

julia> isc[isnothing.(tryparse.(Float64, isc.smajax)), smajax]
ERROR: MethodError: no method matching tryparse(::Type{Float64}, ::Missing)
Closest candidates are:
  tryparse(::Type{Float64}, ::String) at parse.jl:247
  tryparse(::Type{Float64}, ::SubString{String}) at parse.jl:252
  tryparse(::Type{T}, ::AbstractString) where T<:Union{Float32, Float64} at parse.jl:287

Thanks for trying to help but even if parsing to Float64 a string column works this is a too twisted solution and I don’t want to rely on it.

Ah sorry I forgot that of course some of your missing strings are actually detected - you need to do isc[isnothing.(skipmissing(tryparse).(Float64, isc.smajax)), :smajax]

julia> isc[isnothing.(skipmissing(tryparse).(Float64, isc.smajax)), :smajax]
ERROR: MethodError: objects of type Base.SkipMissing{typeof(tryparse)} are not callable

BTW CSVFiles does it right … but has other issues.

julia> df = DataFrame(load("isc-gem-cat.csv", skiplines_begin=123, header_exists=false))
39130×31 DataFrame
   Row │ Column1                  Column2  Column3   Column4    Column5    Column6    Column7  Column8  Column9  Column10  Column11  Column12  Column ⋯
       │ String                   Float64  Float64   Float64?   Float64?   Float64?   String   Float64  Float64  String    Float64   Float64   String ⋯
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │ 1905-07-23 02:46:24.01    49.292    96.843       26.1       14.5       19.7  B           15.0     25.0  C             8.33      0.2   B      ⋯
     2 │ 1905-07-23 20:35:04.13    49.35     96.215       17.8       12.8      173.0  C           10.0     25.0  C             6.19      0.67  C
     3 │ 1905-08-04 05:08:57.25    41.743    19.925       10.3        8.9      180.0  C           15.0      7.2  C             6.19      0.2   C
     4 │ 1905-09-08 01:43:02.41    38.636    15.784  missing    missing    missing    C           15.0      6.7  C             7.19      0.2   B
     5 │ 1905-09-14 19:41:40.64    50.462  -179.492       41.2       26.3      110.8  C           15.0     25.0  C             6.71      0.45  C      ⋯
     6 │ 1905-09-15 06:02:20.96    55.751   163.952       31.3       23.2       75.4  C           15.0     12.1  C             7.84      0.46  C

You want passmissing, not skipmissing. That advice is unfortunately incorrect.

No error but no output either.

julia> isc[isnothing.(passmissing(tryparse).(Float64, isc.smajax)), :smajax]
Union{Missing, String}[]

Just to convert the column to float, this works

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=5, datarow=125) |> DataFrame
5×31 DataFrame
 Row │ _date                     lat      lon       smajax    sminax    strike   q       depth    unc      q_1     mw       unc_1    q_2     s        ⋯
     │ String                    Float64  Float64   String    String    String   String  Float64  Float64  String  Float64  Float64  String  String   ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │  1905-07-23 20:35:04.13    49.35     96.215     17.8      12.8    173.0    C         10.0     25.0   C         6.19     0.67   C       p       ⋯
   2 │  1905-08-04 05:08:57.25    41.743    19.925     10.3       8.9    180.0    C         15.0      7.2   C         6.19     0.2    C       p
   3 │  1905-09-08 01:43:02.41    38.636    15.784                                C         15.0      6.7   C         7.19     0.2    B       p
   4 │  1905-09-14 19:41:40.64    50.462  -179.492     41.2      26.3    110.8    C         15.0     25.0   C         6.71     0.45   C       p
   5 │  1905-09-15 06:02:20.96    55.751   163.952     31.3      23.2     75.4    C         15.0     12.1   C         7.84     0.46   C       p       ⋯
                                                                                                                                     17 columns omitted

julia> tryparse.(Float64,isc[!, :smajax])
5-element Vector{Union{Nothing, Float64}}:
 17.8
 10.3
   nothing
 41.2
 31.3

This is an issue that we need to deal with.

The problem is that the passmissing returns missing for the entry with missing, not nothing. so you get no trues for isnothing. MWE

julia> using DataFrames

julia> isc = DataFrame(smajax = ["17.8", "10.3", missing])
3×1 DataFrame
 Row │ smajax  
     │ String? 
─────┼─────────
   1 │ 17.8
   2 │ 10.3
   3 │ missing 

julia> isc[isnothing.(passmissing(tryparse).(Float64, isc.smajax)), :smajax]
Union{Missing, String}[]

julia> passmissing(tryparse).(Float64, isc.smajax)
3-element Array{Union{Missing, Float64},1}:
 17.8
 10.3
   missing

With the function passmissing(tryparse)(Float64, x) The logic of this code would be

  1. If x is missing, return missing
  2. If it’s a String that cannot be parsed as a float, say "Hello", you get nothing
  3. If it can be parsed as a number, get nothing.

and isnothing(missing) = false)

You are doing some sort of conversion from missing in the background here, right? Or just not using the missingstring option?

No. I’m not doing anything else then the commands I shown.

This is what you had before, note the missingstring comand:

julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=5, missingstring="        ", datarow=125) |> DataFrame

Your most recent CSV.File command doesn’t have that