Read file with CSV.read


#1

Hello everyone,

I’m having trouble reading a simple file with CSV.read.
I have the following data:

1.0 2.462558e-04 11 -1.18791031e-04 +1.18791031e-04 +8.96777973e+02 +3.88470836e+02
1.0 2.462558e-04 12 +1.18790872e-04 -1.18790872e-04 -8.96777979e+02 -3.88470836e+02 
1.0 2.462558e-04 21 +1.18790871e-04 -1.18790871e-04 +8.40080497e+02 +3.20800442e+02
1.0 2.462558e-04 22 -1.18791028e-04 +1.18791028e-04 -8.40080491e+02 -3.20800447e+02

which I put in a file test.dat
when I run

CSV.read("test.dat" ; datarow=1, delim=' ')

I get

ERROR: ArgumentError: data row (1) must come after header row (1)
Stacktrace:
 [1] #Source#12(::String, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:49
 [2] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 [3] #Source#11(::Char, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::DateFormat{Symbol("yyyy-mm-dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T, ::String) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:25
 [4] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0
 [5] #read#29(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:294
 [6] (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T} where T) at ./<missing>:0 (repeats 2 times)

If instead I change the file to

# Nothing here
1.0 2.462558e-04 11 -1.18791031e-04 +1.18791031e-04 +8.96777973e+02 +3.88470836e+02
1.0 2.462558e-04 12 +1.18790872e-04 -1.18790872e-04 -8.96777979e+02 -3.88470836e+02 
1.0 2.462558e-04 21 +1.18790871e-04 -1.18790871e-04 +8.40080497e+02 +3.20800442e+02
1.0 2.462558e-04 22 -1.18791028e-04 +1.18791028e-04 -8.40080491e+02 -3.20800447e+02

And type in the repl

CSV.read("test.dat" ; datarow = 2, delim=' ')

I get:

4×2 DataFrames.DataFrame
│ Row │ #Nothing    │ here         │
├─────┼─────────────┼──────────────┤
│ 1   │ 1.0         │ 0.000246256  │
│ 2   │ 11.0        │ -0.000118791 │
│ 3   │ 0.000118791 │ 896.778      │
│ 4   │ 388.471     │ 1.0          │

Neither of which is what I want obviously.
I’m on julia v0.6.2
CSV 0.1.5
DataFrames 0.10.1

As a side note, I did Pkg.update(), but somehow the system does not update to DataFrames 0.11

Many thanks in advance,
Olivier


#2

This works for me with your sample data:

julia> CSV.read("foo.csv"; header=false, delim=' ', types=fill(Float64,7))
4×7 DataFrames.DataFrame
│ Row │ Column1 │ Column2     │ Column3 │ Column4      │ Column5      │
├─────┼─────────┼─────────────┼─────────┼──────────────┼──────────────┤
│ 1   │ 1.0     │ 0.000246256 │ 11.0    │ -0.000118791 │ 0.000118791  │
│ 2   │ 1.0     │ 0.000246256 │ 12.0    │ 0.000118791  │ -0.000118791 │
│ 3   │ 1.0     │ 0.000246256 │ 21.0    │ 0.000118791  │ -0.000118791 │
│ 4   │ 1.0     │ 0.000246256 │ 22.0    │ -0.000118791 │ 0.000118791  │

│ Row │ Column6  │ Column7  │
├─────┼──────────┼──────────┤
│ 1   │ 896.778  │ 388.471  │
│ 2   │ -896.778 │ -388.471 │
│ 3   │ 840.08   │ 320.8    │
│ 4   │ -840.08  │ -320.8   │

#3

This worked for me (Julia 0.6.2) :

julia> using CSV
INFO: Recompiling stale cache file C:\Users\mcallistst\.julia\lib\v0.6\CSV.ji fo
r module CSV.

julia> readdlm("test.dat",' ')
4x7 Array{Float64,2}:
 1.0  0.000246256  11.0  -0.000118791   0.000118791   896.778   388.471
 1.0  0.000246256  12.0   0.000118791  -0.000118791  -896.778  -388.471
 1.0  0.000246256  21.0   0.000118791  -0.000118791   840.08    320.8
 1.0  0.000246256  22.0  -0.000118791   0.000118791  -840.08   -320.8

julia> CSV.read("test.dat",delim = ' ')
3x7 DataFrames.DataFrame. Omitted printing of 2 columns
│ Row │ 1.0 │ 2.462558e-04 │ 11 │ -1.18791031e-04 │ +1.18791031e-04 │
├─────┼─────┼──────────────┼────┼─────────────────┼─────────────────┤
│ 1   │ 1.0 │ 0.000246256  │ 12 │ 0.000118791     │ -0.000118791    │
│ 2   │ 1.0 │ 0.000246256  │ 21 │ 0.000118791     │ -0.000118791    │
│ 3   │ 1.0 │ 0.000246256  │ 22 │ -0.000118791    │ 0.000118791     │

julia> CSV.read("test.dat",delim = ' ',datarow=1)
4x7 DataFrames.DataFrame. Omitted printing of 2 columns
│ Row │ Column1 │ Column2     │ Column3 │ Column4      │ Column5      │
├─────┼─────────┼─────────────┼─────────┼──────────────┼──────────────┤
│ 1   │ 1.0     │ 0.000246256 │ 11      │ -0.000118791 │ 0.000118791  │
│ 2   │ 1.0     │ 0.000246256 │ 12      │ 0.000118791  │ -0.000118791 │
│ 3   │ 1.0     │ 0.000246256 │ 21      │ 0.000118791  │ -0.000118791 │
│ 4   │ 1.0     │ 0.000246256 │ 22      │ -0.000118791 │ 0.000118791  │


#4

Thanks for the reply,

Unfortunately, this does not work either on my machine:

CSV.read("test.dat"; header=false, delim=' ', types=fill(Float64,7))
4×7 DataFrames.DataFrame
│ Row │ Column1 │ Column2     │ Column3     │ Column4      │ Column5      │ Column6      │ Column7  │
├─────┼─────────┼─────────────┼─────────────┼──────────────┼──────────────┼──────────────┼──────────┤
│ 1   │ 1.0     │ 0.000246256 │ 11.0        │ -0.000118791 │ 0.000118791  │ 896.778      │ 388.471  │
│ 2   │ 1.0     │ 0.000246256 │ 12.0        │ 0.000118791  │ -0.000118791 │ -896.778     │ -388.471 │
│ 3   │ #NULL   │ 1.0         │ 0.000246256 │ 21.0         │ 0.000118791  │ -0.000118791 │ 840.08   │
│ 4   │ 320.8   │ 1.0         │ 0.000246256 │ 22.0         │ -0.000118791 │ 0.000118791  │ -840.08  │

On top of that, I would like to have Ints, for the third column.


#5

Many thanks for the reply,

But none, of the solutions, except the readdlm version, works :

Solution 1:

CSV.read("test.dat",delim = ' ')
ERROR: CSV.CSVError("error parsing a `Int64` value on column 3, row 2; encountered '.'")
Stacktrace:
 [1] checknullend at /home/omerchiers/.julia/v0.6/CSV/src/parsefields.jl:56 [inlined]
 [2] parsefield at /home/omerchiers/.julia/v0.6/CSV/src/parsefields.jl:127 [inlined]
 [3] parsefield at /home/omerchiers/.julia/v0.6/CSV/src/parsefields.jl:107 [inlined]
 [4] streamfrom(::CSV.Source, ::Type{DataStreams.Data.Field}, ::Type{Nullable{Int64}}, ::Int64, ::Int64) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:195
 [5] streamto!(::DataFrames.DataFrame, ::Type{DataStreams.Data.Field}, ::CSV.Source, ::Type{Nullable{Int64}}, ::Type{Nullable{Int64}}, ::Int64, ::Int64, ::DataStreams.Data.Schema{true}, ::Base.#identity) at /home/omerchiers/.julia/v0.6/DataStreams/src/DataStreams.jl:173
 [6] stream!(::CSV.Source, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrame, ::DataStreams.Data.Schema{true}, ::DataStreams.Data.Schema{true}, ::Array{Function,1}) at /home/omerchiers/.julia/v0.6/DataStreams/src/DataStreams.jl:187
 [7] #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/omerchiers/.julia/v0.6/DataStreams/src/DataStreams.jl:151
 [8] stream!(::CSV.Source, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/omerchiers/.julia/v0.6/DataStreams/src/DataStreams.jl:145
 [9] #read#29(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:299
 [10] (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T} where T) at ./<missing>:0 (repeats 2 times)

Option 2:

CSV.read("test.dat",delim = ' ',datarow=1)
ERROR: ArgumentError: data row (1) must come after header row (1)
Stacktrace:
 [1] #Source#12(::String, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:49
 [2] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 [3] #Source#11(::Char, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::DateFormat{Symbol("yyyy-mm-dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}}, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T} where T, ::String) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:25
 [4] (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0
 [5] #read#29(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at /home/omerchiers/.julia/v0.6/CSV/src/Source.jl:294
 [6] (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T} where T) at ./<missing>:0 (repeats 2 times)

Could this be a problem of my DataFrames version?
I could use readdlm in the meanwhile, but I prefer the cleaner CSV option, since it is the one that will be supported in the long run.

Many thanks in advance.
Olivier


#6

I think this works with https://github.com/davidanthoff/CSVFiles.jl:

using FileIO, CSVFiles, DataFrames

 load("data.csv", spacedelim=true, header_exists=false) |> DataFrame

Make sure you do a Pkg.update() first, the underlying parser https://github.com/JuliaComputing/TextParse.jl only recently got support for white space delimited files.


#7

DataFrames won’t be upgraded until all packages you have installed that depend on it support version 0.11. Until then, better use readtable or CSVFiles. See DataFrames 0.11 released for more details.


#8

Thanks to all of you for your help!
Olivier