My apologies to the developers of CSV.jl . This post is not the criticize or pick on them. They are be be lauded for writing such extensive documentation. However, this morning I was trying to read a well formatted file into a DataFrame and ran into lots of problems. The below is to illustrated how documentation may be misinterpreted or misunderstood. A lot of Julia documentation will be misinterpreted and misunderstood the same ways.
file simple (tab delimited):
comment: first design date: 28-Oct-2021
rock paper scissors
granite A4 big
white quartz 8x10 snippers
Read the first line of the file; itβs different and will be parsed separtely
path="C:\\Users\luser\\"
fname = "simple"
pathname = path*fname
fn = open(pathname,"r")
firstline = readline(fn);
header = split(firstline,"\t")
So far, so good. Now the fun begins.
Read the rest of the file into a dataframe. Here are relevant snippets of documentation:
CSV.read(source, sink::T; kwargs...) => T
CSV.File(input; kwargs...) => CSV.File
First attempt:
df = DataFrame()
dfr = CSV.read(fn, df; delim="\t")
julia> dfr = CSV.read(fn, df; delim="\t")
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
[1] |>(x::Tables.CopiedColumns{CSV.File}, f::DataFrame)
@ Base .\operators.jl:858
[2] read(source::IOStream, sink::DataFrame; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:delim,), Tuple{String}}})
@ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\CSV.jl:91
[3] top-level scope
@ REPL[63]:1
Hmm, letβs try CSV.File and pipe it directly to dataframe:
julia> dff = CSV.File(fn; skipto=2) |> df
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
[1] |>(x::CSV.File, f::DataFrame)
@ Base .\operators.jl:858
[2] top-level scope
@ REPL[64]:1
Oops, Iβm reusing βfnβ β user the pathname instead:
julia> df = DataFrame()
0Γ0 DataFrame
julia> dfr = CSV.read(pathname, df; delim="\t", skipto=2)
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
[1] |>(x::Tables.CopiedColumns{CSV.File}, f::DataFrame)
@ Base .\operators.jl:858
[2] read(source::String, sink::DataFrame; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:delim, :skipto), Tuple{String, Int64}}})
@ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\CSV.jl:91
[3] top-level scope
@ REPL[66]:1
That didnβt work. Try CSV.File. Iβll try piping the output to my empty dataframe:
julia> dff = CSV.File(pathname; skipto=2) |> df
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
[1] |>(x::CSV.File, f::DataFrame)
@ Base .\operators.jl:858
[2] top-level scope
@ REPL[67]:1
Hmm, maybe when on the front page it said:
That's quite a bit! Let's boil down a TL;DR:
Just want to read a delimited file or collection of files and do basic stuff with data? Use CSV.File(file) or CSV.read(file, DataFrame)
β¦ you actually type in βDataFrameβ and not the variable name for the dataframe
Letβs give it a try:
julia> dfr = CSV.read(pathname, DataFrame; delim="\t", skipto=2)
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
3Γ4 DataFrame
Row β comment: first design date: 28-Oct-2021
β String15 String7 String15 Missing
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β rock paper scissors missing
2 β granite A4 big missing
3 β white quartz 8x10 snippers missing
Hey, it read some data in. But why is the first line in the header? I put in skipto=2.
[break to re-read documentation on skipto]
Oh, it looks like the header is the line before the skipto line, so I need to put in header=false:
julia> dfr = CSV.read(pathname, DataFrame; delim="\t", skipto=2, header=false)
3Γ3 DataFrame
Row β Column1 Column2 Column3
β String15 String7 String15
ββββββΌβββββββββββββββββββββββββββββββββ
1 β rock paper scissors
2 β granite A4 big
3 β white quartz 8x10 snippers
Success at last!
Letβs get CSV.File working too:
julia> dff = CSV.File(pathname; skipto=2, header=false) |> DataFrame
3Γ3 DataFrame
Row β Column1 Column2 Column3
β String15 String7 String15
ββββββΌβββββββββββββββββββββββββββββββββ
1 β rock paper scissors
2 β granite A4 big
3 β white quartz 8x10 snippers
Analysis
-
Often confused by Julia documentation as to whether I need to write out the type, or provide a variable of that type. The use of βTβ is the worst offender here β βIs that variable T, or is that a type?β In the CSV.jl documentaton, they put in DataFrame as the type. OK, this requires me to remember that 1) A DataFrame (an object containing rows and columns of data) is also a type (DataFrame) in Julia, 2) type names are entered without quotes around them.
-
Confusion on opening the file. When reading 1 line out, I created a variable --βfnβ-- for the IOStream. I then reused that when working with CSV. Needed just the pathname string instead. However, in my defense, most of the examples given for CSV use
IOBuffer(data)
instead of a filename.
What makes the CSV.jl documentation confusing:
-
The friendly front page really makes you feel good that CSV.jl is an easy way to read delimited files into a dataframe β especially when you get down the the βTL;DRβ (recommendation: move that to the top). However, not a simple example in the documentation shows to read a delimited file into a dataframe! Ok, thatβs a lie β there is one example of reading a zipped file into a dataframe (but since Iβm not reading a zipped file, Iβm not going to look there).
-
Most of the examples use
IOBuffer(data)
as the source of the data. I get why β the data can be defined in the code, making it self-contained. You donβt have to create a file for the example which has to travel along with the code. However, I didnβt come to the CSV.jl documentation to read from a String, or an IOBuffer. I came to read from a fileβand few examples read from a file.
Recommendations
-
Try to make a distinction, or call out, whether a variable is needed, or a Type is needed. Again, βTβ is the worst offender, and Julia documentation (and C++ documentation, and β¦) use it both ways. If it is supposed to be a variable, use something other than T.
-
Documentation should have some dead simple, knuckle-dragging, examples that are some the simplest, or most common uses. For example, this simple example in the CSV.jl documentation would have helped me a lot:
file simple2.csv (comma delimited):
stockno,item,unit,price
10,"hammer","EA", 15.95
20,"ladder","EA", 84.95
30,"nails","LB", 6.95
julia> pathname="simple2.csv"
"simple2.csv"
julia> df1 = CSV.read(pathname, DataFrame; delim=",")
3Γ4 DataFrame
Row β stockno item unit price
β Int64 String7 String3 Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β 10 hammer EA 15.95
2 β 20 ladder EA 84.95
3 β 30 nails LB 6.95
julia> df2 = CSV.File(pathname; delim=",") |> DataFrame
3Γ4 DataFrame
Row β stockno item unit price
β Int64 String7 String3 Float64
ββββββΌββββββββββββββββββββββββββββββββββββ
1 β 10 hammer EA 15.95
2 β 20 ladder EA 84.95
3 β 30 nails LB 6.95
Thatβs it! The above would have been really helpful.
Now, recall what I said at the beginning regarding the developers of CSV.jl:
This post is not the criticize or pick on them. They are be be lauded for writing such extensive documentation.
I could write something similar for a lot of documentation (SQLite.jl, looking at you β¦).
The point of this post is just to illustrate how documentation can be misunderstood and give some general tips for making all documentation better.