CSV.Rows usage & Tables.jl interface

,

Hello,

I am newbie in julia so please bare with me. I am trying to read a csv file which includes a fairly large taxi data about 124M lines. I would like to read line by line. According to documentation, I should be using CSV.Rows interface to read the line by line.

I would like to read the file and calculate minimum and maximum values in two columns, basically lat, long columns. I would like to understand the how wide the area is the based on the minimum and maximum lat/long values.

I have not been able to find way to directly access the CSV.Row cells. I really could not comprehend the datastructure that CSV.Rows is composed of. I had to resort to use iterators to access the columns of single row. I confess that I did not look at the CSV source code which would help me understand it.

How can I access a row cell directly without an iterator? Why is it not possible to access it like arow[rowindex, columnindex]?

I tried to read Tables.jl interface as well since CSV implements it. I find it unfortunately very unintuitive or the documentation is very lacking at least for me.

My current implementation is as follows. In following implementation, I did not like the way, I am forced to use the iterators to read colums in the first loop. I needed the initialize the four variables with the values extracted from the first data row. I had to use two loops, even though one of them is run only once, in order to read the first line and initialize the variables.

In addition to my above question: even though following is throw away script to extract just one piece of information, what would be the better or concise way to write it? I am sure the are many.

using CSV
using DataFrames
using Dates

file = "allCars.csv"
column_types = Dict(1=>DateTime,2=>Float64,3=>Float64,4=>Int32)
read_rows(number_of_lines) =  CSV.Rows(file,
                                        delim=',', 
                                        skipto=2,
                                        limit=number_of_lines,
                                        header=[:date, :lat, :long, :vehicle],
                                        types=column_types,
                                        dateformat="yyyy-mm-dd HH:MM",
                                        reusebuffer=true)

for loc in read_rows(1)

  min_lat = loc.lat
  max_lat = loc.lat
  min_long = loc.long
  max_long = loc.long

  for loc in read_rows(typemax(Int64)) 
    if min_lat > loc.lat
      min_lat = loc.lat 
    end

    if max_lat < loc.lat 
      max_lat = loc.lat 
    end    

    if min_long > loc.long
      min_long = loc.long 
    end

    if max_long < loc.long 
      max_long = loc.long 
    end    
  end  
  
  println("min lat-long: $(min_lat):$(min_long), max lat-long $(max_lat):$(max_long)")
end

It looks like you are trying to work around the way for-loop scoping works in Julia. See this post for workarounds. Just put your code in a function and you won’t have to worry about it.

1 Like

Thank you for the response. Yes it is one of the reasons, however my question is different. Thanks for the link, I will definitely read it.

The question here is: Can you afford to load the whole file into memory? Then accessing cells like this is easily possible by loading the data into a Tables.jl-compatible object (e.g. a Dataframe):

using DataFrames
df = DataFrame(CSV.File("allCars.csv"))

df[rowindex, columnindex]

If you can not afford loading the whole file into memory, then iterating over it line-by-line is more or less the best you can do, no?

You can avoid that by initializing min_lat etc. with e.g. typemax(Float64).
But maybe better you want to have a look at GitHub - joshday/OnlineStats.jl: ⚡ Single-pass algorithms for statistics, which is exactly designed for that usecase: Calculating statistics on data (such as minimum, maximum, mean etc.), when the data does not fit into memory

2 Likes

You can just do extrema(r -> r.lat, rows) and extrema(r -> r.long, rows), or separately minimum(r -> r.lat, rows) and maximum. Here rows is the object returned by CSV.Rows(...) call.

3 Likes

Hello, thank you for the response. I used this approach not to load 123M lines to a dataframe. If I initialize min or max externally with type min or max value, it stays the same since the values in the dataset will never be minimum or larger than the typemin or typemax. I will look at the libraries. Thanks again.

I have checked out OnlineStats.jl library. It is wonderful library and I changed my code according to that. I deeply thank you for pointing this library out. I will post my final code.

3 Likes

It seems you’ve resolved your issue in other ways, but just as a comment: the CSV.Rows object is specifically designed not to provide random access to cells, but only the current iterated row. So while iterating like:

for loc in read_rows(typemax(Int64))
    # loc is a row
end

You can access the individual cells for the current iterated row in the loc object, as you’ve seen by doing loc.lat or loc.long. You can also access individual values by using getindex like loc[1] or loc[:lat] and you can see all possible column names by doing propertynames(loc).

Happy to help answer any other questions about CSV.Rows if that’s helpful, but it’s just meant to be a more memory-efficient structure for iterating over large csv files without needing to materialize the entire file in memory at once.

4 Likes

Perhaps using the reduce function could give your code a more compact form, if that’s what you were looking for.
For example in the following way (which I haven’t tested and I don’t know the details of the read_rows function well)

extr(coord)=reduce((l,lcurr)->(m=min(l.m,lcurr[coord]),
                     M=max(l.M,lcurr[coord])), read_rows(typemax(Int64)), init=(m=Inf, M=-Inf))
extr(:lat)
extr(:long)