Hi, I have a csv file where either the column names or the column themselves have spaces like
# date , lat , lon , smajax , sminax , strike, q ,
1904-04-04 10:02:34.56 , 41.802 , 23.108 , 8.6 , 6.6 , 164.2 , B ,
Regarding the column names I was able to workaround the problem by using the option normalizenames=true
but that does not work to remove the blanks on the columns. Another, more important problem, is that empty fields turn what would be Float64 columns into strings. See the smaxj
column below.
julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=35) |> DataFrame
35×31 DataFrame
Row │ _date lat lon smajax sminax strike q depth unc q_1 mw unc_1 q_2 s ⋯
│ String Float64 Float64 String String String String Float64 Float64 String Float64 Float64 String String ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1904-04-04 10:02:34.56 41.802 23.108 8.6 6.6 164.2 B 15.0 4.8 C 6.84 0.4 C d ⋯
2 │ 1904-04-04 10:26:00.88 41.758 23.249 8.3 6.9 15.2 B 15.0 4.8 C 7.02 0.4 C d
3 │ 1904-06-25 14:45:39.14 51.424 161.638 33.6 18.7 116.2 C 15.0 25.0 C 7.5 0.4 C d
4 │ 1904-06-25 21:00:38.72 52.763 160.277 28.6 14.6 43.1 C 30.0 10.3 C 7.7 0.4 C d
5 │ 1904-12-20 05:44:20.44 8.962 -84.042 26.3 13.2 74.1 C 10.0 14.0 C 7.29 0.2 C p ⋯
6 │ 1905-02-14 08:46:34.49 51.89 -176.702 30.3 19.5 157.5 C 35.0 24.6 C 7.59 0.26 C p
7 │ 1905-02-17 11:41:07.82 23.689 97.17 27.1 18.9 143.7 C 15.0 25.0 C 7.26 0.37 C p
8 │ 1905-02-19 04:34:55.80 -14.123 169.457 54.7 12.1 3.4 C 15.0 25.0 C 7.33 0.54 C p
9 │ 1905-03-04 15:59:53.26 -6.639 151.225 82.1 33.3 161.3 C 15.0 12.6 C 6.96 0.42 C p ⋯
10 │ 1905-03-04 23:17:22.03 -4.635 149.105 22.4 18.4 155.1 C 15.0 25.0 C 7.01 0.6 C p
11 │ 1905-03-22 03:38:44.30 51.145 -179.859 18.0 13.4 91.5 C 35.0 11.8 C 7.18 0.32 C p
12 │ 1905-04-03 16:01:54.20 41.706 21.369 24.5 12.1 99.5 C 15.0 8.5 C 5.44 0.35 C p
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
25 │ 1905-06-30 17:07:35.01 -19.266 169.615 72.0 22.9 177.8 C 15.0 58.0 C 6.9 0.43 C p ⋯
26 │ 1905-07-06 16:20:46.93 38.417 144.119 31.5 13.9 131.3 C 15.0 25.0 C 7.67 0.2 C p
27 │ 1905-07-09 09:40:39.79 49.709 98.483 19.9 11.2 28.4 C 15.0 25.0 C 7.95 0.2 B d
28 │ 1905-07-11 08:39:39.56 50.24 98.5 29.1 16.5 18.1 C 15.0 25.0 C 6.9 0.32 C p
29 │ 1905-07-14 22:02:03.35 49.77 99.124 23.7 14.3 55.1 C 15.0 25.0 C 6.86 0.52 C p ⋯
30 │ 1905-07-17 00:23:24.07 -23.039 171.862 193.6 21.3 160.1 C 35.0 12.8 C 6.72 0.29 B p
31 │ 1905-07-23 02:46:24.01 49.292 96.843 26.1 14.5 19.7 B 15.0 25.0 C 8.33 0.2 B d
32 │ 1905-07-23 20:35:04.13 49.35 96.215 17.8 12.8 173.0 C 10.0 25.0 C 6.19 0.67 C p
33 │ 1905-08-04 05:08:57.25 41.743 19.925 10.3 8.9 180.0 C 15.0 7.2 C 6.19 0.2 C p ⋯
34 │ 1905-09-08 01:43:02.41 38.636 15.784 C 15.0 6.7 C 7.19 0.2 B p
35 │ 1905-09-14 19:41:40.64 50.462 -179.492 41.2 26.3 110.8 C 15.0 25.0 C 6.71 0.45 C p
How can I force those columns to remain Floats ans have NaN instead of missing in the empty cells? I can do the following (after knowing the column name where it happens) but it inserts a missing and I need floats (C does not know what is a missing).
julia> isc = CSV.File("isc-gem-cat.csv", header=93, normalizenames=true, limit=35, types=Dict(:smajax=>Float64)) |> DataFrame
the other problem I referred above is the spaces in the column strings. e.g.
julia> isc[!,7]
35-element PooledArrays.PooledVector{String, UInt32, Vector{UInt32}}:
" B "
" B "
" C "
" C "
I understand they come from the fact that in the file that column is written as “, B ,” but those blanks should be stripped by default.