.csv number of rows

Is there a clever way to determine the number of rows in large .csv or .tsv file without reading a whole column?

I’m currently doing:

CSV.File("big_file.csv") |> Tables.select(:first_column_name) |> DataFrame

which tells me what I need to know, but if there’s something which just counts the rows a little faster, I would love to know.

eh… you can just find out the number of lines in the file instead…?

julia> (open("./Final.csv") |> readlines |> length) -1
1936
1 Like

Be careful with this approach, it will give wrong answer for CSV files having linebreaks in some values.

4 Likes

thanks I figured it was easy but basic googling did not reveal it.

I always think that wc -l is much faster but of course windows doesn’t come with such things pre-built, which is a shame.

As @aplavin mentioned, just doing readlines can be incorrect for csv files w/ quoted newline characters. Using the readlines function is also pretty wasteful and will gobble up a lot of memory for really large files. In Base, the countlines function will be much more efficient.

For a more general purpose solution for csv files that may contain quoted newline characters, this should be extremely fast/efficient:

function countcsvlines(file)
    n = 0
    for row in CSV.Rows(file; resusebuffer=true)
        n += 1
    end
    return n
end
7 Likes

UPDATE: @quinnj spells it “reSusebuffer” = true)
I have found that “reusebuffer” = true) works much better

But other than that…
Amazing. This is gonna help so much with pre-allocating. Loading this and a first row reader into every CSV analysis from now on!!!