Hello! I’m not sure what the best way to sort is in this case but this is the plan: I would like to turn the below block of code into a loop (I was advised by @ChrisRackauckas not to loop through rows, and to use DataFramesMeta or Query)
using DataFrames df = readtable(nrows=1000000, skipstart=0 "file1.csv") sort!(df, cols = [:Type]) writetable("sorted_file1_1.csv", df)
The idea is to read every 1 million rows from the 5.2 GB CSV file, sort it by column
writetable, then repeat for the next million rows, by
readtable(nrows=1000000, skipstart=1000000, "file.csv").
Two variables I can think of right away are:
skipstart= will increase by 1 million for each new file.
writetable will increase the filename of sorted_file1_1 by 1 for every new file (e.g. the next file would be named
This approach is naive, because it doesn’t first consider the total amount of rows in the file, break it up into approximately equal size files, then sort all files.
After sorting, I need to read and input into a model only
df[:Type] .== "Trade" rows, using
TextLineReader with the
decode_csv operation from Tensorflow.jl.