I want to split (partition) a large csv file into separate smaller csv files based on the value of a particular column.
I also want to do this row by row, without loading the entire file into memory.
My current solution is using CSV and iterating over the rows, selecting the io stream based on the value of column col1. But I don’t know the best way to efficiently write that row. Currently I’m using Tables.eachcolumn as below:
getIO(col1) -> returns appropriate io stream (an open `csv` file)
csvfile = CSV.File(filename)
sch = Tables.schema(csvfile)
for row in csvfile
#write(getIO(row.col1), row) # I want to do something like this
io = getIO(row.col1)
Tables.eachcolumn(sch, row) do val, col, name
print(io, val)
col < numcols && print(io, ", ")
end
println(io)
end
If the first column is easy to parse (no escaping, etc.), then I’d skip using CSV so that you don’t need to parse the other columns:
julia> open("/tmp/t1.csv") do f1
open("/tmp/t2.csv", "w") do f2
for l in eachline(f1)
c1 = parse(Int64, l[1:findfirst(==(','), l) - 1])
if c1 == 2
write(f2, l, "\n")
end
end
end
end
Thanks Rob! This is much simpler and about 3x faster than my original method using CSV.
As an aside, I tried using SubString rather then indexing into the line because I thought it would save allocating a new string on each iteration. But there was no improvement (in fact slightly slower).
I would certainly not call it that. I’d call it a bash solution. It also doesn’t do what OP asks, which is to partition a file based on a column value. No doubt that can be achieved in bash with awk for example, but I too would prefer an implementation in Julia, for the following reasons:
The logic to parse and select file based on column can more easily be reused if written as a Julia function (like getIO in OP’s example) than embedded in an awk command
With a pure Julia solution, we can cleanly unit test this functionality with mock data, using in-memory IO streams
Guaranteed to work (and work the same) on all platforms