Hello,
I am newbie in julia so please bare with me. I am trying to read a csv file which includes a fairly large taxi data about 124M lines. I would like to read line by line. According to documentation, I should be using CSV.Rows interface to read the line by line.
I would like to read the file and calculate minimum and maximum values in two columns, basically lat, long columns. I would like to understand the how wide the area is the based on the minimum and maximum lat/long values.
I have not been able to find way to directly access the CSV.Row cells. I really could not comprehend the datastructure that CSV.Rows is composed of. I had to resort to use iterators to access the columns of single row. I confess that I did not look at the CSV source code which would help me understand it.
How can I access a row cell directly without an iterator? Why is it not possible to access it like arow[rowindex, columnindex]?
I tried to read Tables.jl interface as well since CSV implements it. I find it unfortunately very unintuitive or the documentation is very lacking at least for me.
My current implementation is as follows. In following implementation, I did not like the way, I am forced to use the iterators to read colums in the first loop. I needed the initialize the four variables with the values extracted from the first data row. I had to use two loops, even though one of them is run only once, in order to read the first line and initialize the variables.
In addition to my above question: even though following is throw away script to extract just one piece of information, what would be the better or concise way to write it? I am sure the are many.
using CSV
using DataFrames
using Dates
file = "allCars.csv"
column_types = Dict(1=>DateTime,2=>Float64,3=>Float64,4=>Int32)
read_rows(number_of_lines) = CSV.Rows(file,
delim=',',
skipto=2,
limit=number_of_lines,
header=[:date, :lat, :long, :vehicle],
types=column_types,
dateformat="yyyy-mm-dd HH:MM",
reusebuffer=true)
for loc in read_rows(1)
min_lat = loc.lat
max_lat = loc.lat
min_long = loc.long
max_long = loc.long
for loc in read_rows(typemax(Int64))
if min_lat > loc.lat
min_lat = loc.lat
end
if max_lat < loc.lat
max_lat = loc.lat
end
if min_long > loc.long
min_long = loc.long
end
if max_long < loc.long
max_long = loc.long
end
end
println("min lat-long: $(min_lat):$(min_long), max lat-long $(max_lat):$(max_long)")
end