Hello! I’m not sure what the best way to sort is in this case but this is the plan: I would like to turn the below block of code into a loop (I was advised by @ChrisRackauckas not to loop through rows, and to use DataFramesMeta or Query)
using DataFrames
df = readtable(nrows=1000000, skipstart=0 "file1.csv")
sort!(df, cols = [:Type])
writetable("sorted_file1_1.csv", df)
The idea is to read every 1 million rows from the 5.2 GB CSV file, sort it by column :Type
, do writetable
, then repeat for the next million rows, by readtable(nrows=1000000, skipstart=1000000, "file.csv")
.
Two variables I can think of right away are: skipstart=
will increase by 1 million for each new file. writetable
will increase the filename of sorted_file1_1 by 1 for every new file (e.g. the next file would be named sorted_file1_2.csv
).
This approach is naive, because it doesn’t first consider the total amount of rows in the file, break it up into approximately equal size files, then sort all files.
After sorting, I need to read and input into a model only df[:Type] .== "Trade"
rows, using TextLineReader
with the decode_csv
operation from Tensorflow.jl.