Hello,
I am newbie in julia so please bare with me. I am trying to read a csv file which includes a fairly large taxi data about 124M lines. I would like to read line by line. According to documentation, I should be using CSV.Rows interface to read the line by line.
I would like to read the file and calculate minimum and maximum values in two columns, basically lat, long columns. I would like to understand the how wide the area is the based on the minimum and maximum lat/long values.
I have not been able to find way to directly access the CSV.Row cells. I really could not comprehend the datastructure that CSV.Rows is composed of. I had to resort to use iterators to access the columns of single row. I confess that I did not look at the CSV source code which would help me understand it.
How can I access a row cell directly without an iterator? Why is it not possible to access it like arow[rowindex, columnindex]?
I tried to read Tables.jl interface as well since CSV implements it. I find it unfortunately very unintuitive or the documentation is very lacking at least for me.
My current implementation is as follows. In following implementation, I did not like the way, I am forced to use the iterators to read colums in the first loop. I needed the initialize the four variables with the values extracted from the first data row. I had to use two loops, even though one of them is run only once, in order to read the first line and initialize the variables.
In addition to my above question: even though following is throw away script to extract just one piece of information, what would be the better or concise way to write it? I am sure the are many.
using CSV
using DataFrames
using Dates
file = "allCars.csv"
column_types = Dict(1=>DateTime,2=>Float64,3=>Float64,4=>Int32)
read_rows(number_of_lines) =  CSV.Rows(file,
                                        delim=',', 
                                        skipto=2,
                                        limit=number_of_lines,
                                        header=[:date, :lat, :long, :vehicle],
                                        types=column_types,
                                        dateformat="yyyy-mm-dd HH:MM",
                                        reusebuffer=true)
for loc in read_rows(1)
  min_lat = loc.lat
  max_lat = loc.lat
  min_long = loc.long
  max_long = loc.long
  for loc in read_rows(typemax(Int64)) 
    if min_lat > loc.lat
      min_lat = loc.lat 
    end
    if max_lat < loc.lat 
      max_lat = loc.lat 
    end    
    if min_long > loc.long
      min_long = loc.long 
    end
    if max_long < loc.long 
      max_long = loc.long 
    end    
  end  
  
  println("min lat-long: $(min_lat):$(min_long), max lat-long $(max_lat):$(max_long)")
end