How beginners misread documentation--Example: Me

My apologies to the developers of CSV.jl . This post is not the criticize or pick on them. They are be be lauded for writing such extensive documentation. However, this morning I was trying to read a well formatted file into a DataFrame and ran into lots of problems. The below is to illustrated how documentation may be misinterpreted or misunderstood. A lot of Julia documentation will be misinterpreted and misunderstood the same ways.

file simple (tab delimited):

comment:	first design	date:	28-Oct-2021
rock	paper	scissors
granite	A4	big
white quartz	8x10	snippers

Read the first line of the file; it’s different and will be parsed separtely

path="C:\\Users\luser\\"
fname = "simple"
pathname = path*fname

fn = open(pathname,"r")

firstline = readline(fn);

header =  split(firstline,"\t")

So far, so good. Now the fun begins.

Read the rest of the file into a dataframe. Here are relevant snippets of documentation:

CSV.read(source, sink::T; kwargs...) => T 

CSV.File(input; kwargs...) => CSV.File

First attempt:

df = DataFrame()
dfr = CSV.read(fn, df; delim="\t")

julia> dfr = CSV.read(fn, df; delim="\t")
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
 [1] |>(x::Tables.CopiedColumns{CSV.File}, f::DataFrame)
   @ Base .\operators.jl:858
 [2] read(source::IOStream, sink::DataFrame; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol, String, Tuple{Symbol}, NamedTuple{(:delim,), Tuple{String}}})
   @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\CSV.jl:91
 [3] top-level scope
   @ REPL[63]:1

Hmm, let’s try CSV.File and pipe it directly to dataframe:

julia> dff = CSV.File(fn; skipto=2) |> df
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
 [1] |>(x::CSV.File, f::DataFrame)
   @ Base .\operators.jl:858
 [2] top-level scope
   @ REPL[64]:1

Oops, I’m reusing β€œfn” – user the pathname instead:

julia> df = DataFrame()
0Γ—0 DataFrame

julia> dfr = CSV.read(pathname, df; delim="\t", skipto=2)
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
 [1] |>(x::Tables.CopiedColumns{CSV.File}, f::DataFrame)
   @ Base .\operators.jl:858
 [2] read(source::String, sink::DataFrame; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol, Any, Tuple{Symbol, Symbol}, NamedTuple{(:delim, :skipto), Tuple{String, Int64}}})
   @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\CSV.jl:91
 [3] top-level scope
   @ REPL[66]:1

That didn’t work. Try CSV.File. I’ll try piping the output to my empty dataframe:

julia> dff = CSV.File(pathname; skipto=2) |> df
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
ERROR: MethodError: objects of type DataFrame are not callable
Stacktrace:
 [1] |>(x::CSV.File, f::DataFrame)
   @ Base .\operators.jl:858
 [2] top-level scope
   @ REPL[67]:1

Hmm, maybe when on the front page it said:

That's quite a bit! Let's boil down a TL;DR:

Just want to read a delimited file or collection of files and do basic stuff with data? Use CSV.File(file) or CSV.read(file, DataFrame)

… you actually type in β€œDataFrame” and not the variable name for the dataframe

Let’s give it a try:

julia> dfr = CSV.read(pathname, DataFrame; delim="\t", skipto=2)
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 1. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 2. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
β”Œ Warning: thread = 1 warning: only found 3 / 4 columns around data row: 3. Filling remaining columns with `missing`
β”” @ CSV C:\Users\nxf79930\.julia\packages\CSV\nofYz\src\file.jl:634
3Γ—4 DataFrame
 Row β”‚ comment:      first design  date:     28-Oct-2021 
     β”‚ String15      String7       String15  Missing     
─────┼───────────────────────────────────────────────────
   1 β”‚ rock          paper         scissors      missing 
   2 β”‚ granite       A4            big           missing 
   3 β”‚ white quartz  8x10          snippers      missing 

Hey, it read some data in. But why is the first line in the header? I put in skipto=2.
[break to re-read documentation on skipto]

Oh, it looks like the header is the line before the skipto line, so I need to put in header=false:

julia> dfr = CSV.read(pathname, DataFrame; delim="\t", skipto=2, header=false)
3Γ—3 DataFrame
 Row β”‚ Column1       Column2  Column3  
     β”‚ String15      String7  String15 
─────┼─────────────────────────────────
   1 β”‚ rock          paper    scissors
   2 β”‚ granite       A4       big
   3 β”‚ white quartz  8x10     snippers

Success at last!

Let’s get CSV.File working too:

julia> dff = CSV.File(pathname; skipto=2, header=false) |> DataFrame
3Γ—3 DataFrame
 Row β”‚ Column1       Column2  Column3  
     β”‚ String15      String7  String15 
─────┼─────────────────────────────────
   1 β”‚ rock          paper    scissors
   2 β”‚ granite       A4       big
   3 β”‚ white quartz  8x10     snippers

Analysis

  1. Often confused by Julia documentation as to whether I need to write out the type, or provide a variable of that type. The use of β€œT” is the worst offender here – β€œIs that variable T, or is that a type?” In the CSV.jl documentaton, they put in DataFrame as the type. OK, this requires me to remember that 1) A DataFrame (an object containing rows and columns of data) is also a type (DataFrame) in Julia, 2) type names are entered without quotes around them.

  2. Confusion on opening the file. When reading 1 line out, I created a variable --β€œfn”-- for the IOStream. I then reused that when working with CSV. Needed just the pathname string instead. However, in my defense, most of the examples given for CSV use IOBuffer(data) instead of a filename.

What makes the CSV.jl documentation confusing:

  1. The friendly front page really makes you feel good that CSV.jl is an easy way to read delimited files into a dataframe – especially when you get down the the β€œTL;DR” (recommendation: move that to the top). However, not a simple example in the documentation shows to read a delimited file into a dataframe! Ok, that’s a lie – there is one example of reading a zipped file into a dataframe (but since I’m not reading a zipped file, I’m not going to look there).

  2. Most of the examples use IOBuffer(data) as the source of the data. I get why – the data can be defined in the code, making it self-contained. You don’t have to create a file for the example which has to travel along with the code. However, I didn’t come to the CSV.jl documentation to read from a String, or an IOBuffer. I came to read from a file–and few examples read from a file.

Recommendations

  1. Try to make a distinction, or call out, whether a variable is needed, or a Type is needed. Again, β€œT” is the worst offender, and Julia documentation (and C++ documentation, and …) use it both ways. If it is supposed to be a variable, use something other than T.

  2. Documentation should have some dead simple, knuckle-dragging, examples that are some the simplest, or most common uses. For example, this simple example in the CSV.jl documentation would have helped me a lot:

file simple2.csv (comma delimited):

stockno,item,unit,price
10,"hammer","EA", 15.95
20,"ladder","EA", 84.95
30,"nails","LB", 6.95
julia> pathname="simple2.csv"
"simple2.csv"

julia> df1 = CSV.read(pathname, DataFrame; delim=",")
3Γ—4 DataFrame
 Row β”‚ stockno  item     unit     price   
     β”‚ Int64    String7  String3  Float64 
─────┼────────────────────────────────────
   1 β”‚      10  hammer   EA         15.95
   2 β”‚      20  ladder   EA         84.95
   3 β”‚      30  nails    LB          6.95

julia> df2 = CSV.File(pathname; delim=",") |> DataFrame
3Γ—4 DataFrame
 Row β”‚ stockno  item     unit     price   
     β”‚ Int64    String7  String3  Float64 
─────┼────────────────────────────────────
   1 β”‚      10  hammer   EA         15.95
   2 β”‚      20  ladder   EA         84.95
   3 β”‚      30  nails    LB          6.95

That’s it! The above would have been really helpful.

Now, recall what I said at the beginning regarding the developers of CSV.jl:

This post is not the criticize or pick on them. They are be be lauded for writing such extensive documentation.

I could write something similar for a lot of documentation (SQLite.jl, looking at you …).

The point of this post is just to illustrate how documentation can be misunderstood and give some general tips for making all documentation better.

13 Likes

You might open a PR to CSV.jl adding your example to the docs to help the developers out.

1 Like

T is not a confusing name to people who have internalized Julia naming conventions: CamelCase for Type variables, snake_case for non-type variables.

Since a variable can very well be a type and vice versa, the question β€œIs that variable T, or is that a type?” doesn’t really mean anything, it’s both.

To me the real problem in the documentation of CSV.jl is that the argument is named sink. To me, df is more β€œsink-like” than DataFrame. Renaming the argument SinkType or better yet OutputType would clear up a lot of the confusion.

1 Like

This!

1 Like

This argument is not a type, it’s effectively a function. Eg, a common way to read a CSV into a plain array-based Table is providing sink=rowtable or columntable.

I find it more than confusing in this case, I think it’s wrong :slight_smile:

The convention is that T is a generic name for a type so for example you can replace T with DataFrame and get a correct signature. In this case it gives

CSV.read(source, sink::DataFrame; kwargs...) => DataFrame

It’s wrong because it says that sink is a value of type DataFrame such as df, not a value of type Type such as DataFrame. So according to the documentation @blackeneth was 100% right to pass df as parameter.

I also find the " => T" part a bit confusing. AFAIK the correct way to write this signature is

CSV.read(source, sink::Type{T}; kwargs...)::T where {T}

but it’s probably best to keep things simple and write

CSV.read(source, sink::Type; kwargs...)

then explain in the following text that β€œCSV.read returns a value of the given sink type, for example CSV.read("data.csv", DataFrame) returns a data frame”.

3 Likes

You are right, I read that wrong.