Loading a csvfile using CSVFiles

I have a file with a date in the form mm/dd/yy and my guess is that it is interpreting it in dd/mm/yy format. I’ve been using CSVFiles to do this with the load command and have tried the following:

load("test.csv",colparsers=[5=>dateformat"mm/dd/yy"])

since it is the 5th column. I’ve also tried the column name as well am getting the following error:

MethodError: no method matching tofield(::Pair{Symbol,DateFormat{Symbol("mm/dd/yy"),
Tuple{Dates.DatePart{'m'},Dates.Delim{Char,1},Dates.DatePart{'d'},
Dates.Delim{Char,1},Dates.DatePart{'y'}}}}, ::TextParse.LocalOpts)

so obviously I’m not calling this correctly, but I can’t find documentation on this, nor am able to figure it out from the source code.

Here’s the first lines of that file:

Athlete,Age,Country,Year,Closing Ceremony Date,Sport,Gold Medals,Silver Medals,Bronze Medals,Total Medals
Michael Phelps,23,United States,2008,8/24/08,Swimming,8,0,0,8
Michael Phelps,19,United States,2004,8/29/04,Swimming,6,0,2,8
Michael Phelps,27,United States,2012,8/12/12,Swimming,4,2,0,6
Natalie Coughlin,25,United States,2008,8/24/08,Swimming,1,2,3,6

I know CSVFiles/TextParse.jl have been doing some refactoring lately, so probably just a missing method somewhere. In the mean time, you could use CSV.jl like:

julia> df = CSV.read("/Users/jacobquinn/Downloads/discourse.csv", dateformat="mm/dd/yy", transforms=Dict("Closing Ceremony Date"=>x->x + Dates.Year(2000)))
4×10 DataFrames.DataFrame. Omitted printing of 1 columns
│ Row │ Athlete          │ Age   │ Country       │ Year  │ Closing Ceremony Date │ Sport    │ Gold Medals │ Silver Medals │ Bronze Medals │
│     │ String           │ Int64 │ String        │ Int64 │ Date                  │ String   │ Int64       │ Int64         │ Int64         │
├─────┼──────────────────┼───────┼───────────────┼───────┼───────────────────────┼──────────┼─────────────┼───────────────┼───────────────┤
│ 1   │ Michael Phelps   │ 23    │ United States │ 2008  │ 2008-08-24            │ Swimming │ 8           │ 0             │ 0             │
│ 2   │ Michael Phelps   │ 19    │ United States │ 2004  │ 2004-08-29            │ Swimming │ 6           │ 0             │ 2             │
│ 3   │ Michael Phelps   │ 27    │ United States │ 2012  │ 2012-08-12            │ Swimming │ 4           │ 2             │ 0             │
│ 4   │ Natalie Coughlin │ 25    │ United States │ 2008  │ 2008-08-24            │ Swimming │ 1           │ 2             │ 3             │

1 Like

You need to pass a Dict to colparsers, not a vector, in this case:

load("test.csv",colparsers=Dict(5=>dateformat"mm/dd/yy"))

You would use a vector when you want to specify just all column parsers, not as Pairs, but just as a list of parsers.

I just realized that this will still not give you what you want: it will parse the years as the year 0008 etc… That seems a weird default (to me) in the julia base date parsing story… The python/posix approach to parsing two digit year numbers seems much more useful to me. It also seems to me that it would be super helpful if one could specify in the dateformat string how two digit years should be handled (given that I would guess that the current behavior is almost never very useful).