Is there a clever way to determine the number of rows in large .csv or .tsv file without reading a whole column?
I’m currently doing:
CSV.File("big_file.csv") |> Tables.select(:first_column_name) |> DataFrame
which tells me what I need to know, but if there’s something which just counts the rows a little faster, I would love to know.
eh… you can just find out the number of lines in the file instead…?
julia> (open("./Final.csv") |> readlines |> length) -1
Be careful with this approach, it will give wrong answer for CSV files having linebreaks in some values.
thanks I figured it was easy but basic googling did not reveal it.
I always think that
wc -l is much faster but of course windows doesn’t come with such things pre-built, which is a shame.
As @aplavin mentioned, just doing readlines can be incorrect for csv files w/ quoted newline characters. Using the
readlines function is also pretty wasteful and will gobble up a lot of memory for really large files. In Base, the
countlines function will be much more efficient.
For a more general purpose solution for csv files that may contain quoted newline characters, this should be extremely fast/efficient:
n = 0
for row in CSV.Rows(file; resusebuffer=true)
n += 1
UPDATE: @quinnj spells it “reSusebuffer” = true)
I have found that “reusebuffer” = true) works much better
But other than that…
Amazing. This is gonna help so much with pre-allocating. Loading this and a first row reader into every CSV analysis from now on!!!