CSV.jl number of lines

,

I cannot use attribute “resusebuffer” in latest version of CSV.jl package. How can I calculate exact number of lines in CSV?

function countcsvlines(file)
    n = 0
    for row in CSV.Rows(file; resusebuffer=true)
        n += 1
    end
    return n
end

I tried to use:

CSV.Rows(fl.buf).ctx.rowsguess

but it gives me higher numbers. File with 5 lines has result 7 and file with 52 lines has result 57.

Did you mean to write reusebuffer instead of resusebuffer (note the extra s in “resuse”)?

Other than that, you may need to set the header keyword, depending on whether or not your dataset has a header with column names or not (there may also be other metadata in front of the header, depending on the CSV). Other than that, it works for me:

julia> function countcsvlines(file)
           n = 0
           for row in CSV.Rows(file; header=false, reusebuffer=true)
               n += 1
           end
           return n
       end
countcsvlines (generic function with 1 method)

julia> countcsvlines("example.csv")
8

julia> readlines("example.csv")
8-element Vector{String}:
 "1,2,3,4"
 "4,5,6,7"
 "1,2,3,4"
 "4,5,6,7"
 "1,2,3,4"
 "4,5,6,7"
 "1,2,3,4"
 "4,5,6,7"

Thank you, I meant reusebuffer. My fault.

Just in case, there is a built-in function:

file = raw"C:\...\input.csv"
countlines(file)

Thank you @rafael.guerra . Is it possible to do something like following code?

usng CSV
csvRows = CSV.Rows("example.csv")
countlines(csvRows.buf)

That won’t work if the CSV contains a quoted field with a newline character - CSV.jl accounts for that.

CSV.Rows is a lazy iterator

CSV.Rows: an alternative approach for consuming delimited data, where the input is only consumed one row at a time

As such, you either have to iterate manually or collect it to figure out how many rows there are.

You can’t get around that limitation simply because CSV is much more complicated than you may initially assume - it’s not standardized, different escaping mechanisms exist and a newline character may not always indicate a new row.

Thank you for an explanation.

I think this will give incorrect results because csvs can have escaped new line characters.

1 Like

@Oscar_Smith and @Sukera, thanks for the insights.
Is this a common/possible situation when working with CSVs containing only numeric data?

This only occurs for CSVs with strings in them.

1 Like

It looks unfortunate you didn’t get an error for a misspelled keyword argument though :confused:

1 Like

Thanks Oscar.
To better understand this tested a simple familiar Excel scenario with ALT+ENTER to split string labels across two lines in a single cell.
The OP’s function does provide the correct number of CSV rows = 3, while physical CSV file on disk has 5 lines:

Excel:
escaped_sequences_newline_csv_Excel

Notepad++:
escaped_sequences_newline_csv_notepadplus

The Base function count() may do what you ask for here:

csvRows = CSV.Rows(file; header=false)

julia> count(i -> i==i, csvRows)
3

NB:
There should be a better way of writing this part : i -> i==i

It’s _ -> true, and another alternative Returns(true) will be available when Julia 1.7 gets released.

1 Like