I want to split (partition) a large
csv file into separate smaller
csv files based on the value of a particular column.
I also want to do this row by row, without loading the entire file into memory.
My current solution is using
CSV and iterating over the rows, selecting the io stream based on the value of column
col1. But I don’t know the best way to efficiently write that row. Currently I’m using
Tables.eachcolumn as below:
getIO(col1) -> returns appropriate io stream (an open `csv` file) csvfile = CSV.File(filename) sch = Tables.schema(csvfile) for row in csvfile #write(getIO(row.col1), row) # I want to do something like this io = getIO(row.col1) Tables.eachcolumn(sch, row) do val, col, name print(io, val) col < numcols && print(io, ", ") end println(io) end
Is there a better way?