Partition a large CSV file into smaller files without loading into memory

question
#1

I want to split (partition) a large csv file into separate smaller csv files based on the value of a particular column.

I also want to do this row by row, without loading the entire file into memory.

My current solution is using CSV and iterating over the rows, selecting the io stream based on the value of column col1. But I don’t know the best way to efficiently write that row. Currently I’m using Tables.eachcolumn as below:

getIO(col1) -> returns appropriate io stream (an open `csv` file)

csvfile = CSV.File(filename)
sch = Tables.schema(csvfile)

for row in csvfile
    #write(getIO(row.col1), row)  # I want to do something like this

    io = getIO(row.col1)
    Tables.eachcolumn(sch, row) do val, col, name
        print(io, val)
        col < numcols && print(io, ", ")
    end
    println(io)
end

Is there a better way?

#2

I just use julia’s run to call sprint.

Works on Windows too if u install git and set path to include all the bins from git.

See https://www.google.com/url?sa=t&source=web&rct=j&url=https://stackoverflow.com/questions/54956361/julia-how-to-execute-a-system-command-from-julia-code&ved=2ahUKEwj24eCaiPXgAhUTf30KHan4D2gQFjAAegQIBRAB&usg=AOvVaw3xxTqGVTL2DejClxfiZtfM

#3

If the first column is easy to parse (no escaping, etc.), then I’d skip using CSV so that you don’t need to parse the other columns:

julia> open("/tmp/t1.csv") do f1
           open("/tmp/t2.csv", "w") do f2
               for l in eachline(f1)
                   c1 = parse(Int64, l[1:findfirst(==(','), l) - 1])
                   if c1 == 2
                       write(f2, l, "\n")
                   end
               end
           end
       end
3 Likes
#4

Perhaps you meant split?
I would prefer a pure Julia solution though.

#5

Thanks Rob! This is much simpler and about 3x faster than my original method using CSV.

As an aside, I tried using SubString rather then indexing into the line because I thought it would save allocating a new string on each iteration. But there was no improvement (in fact slightly slower).

#6
run(`split large_file.csv`)

is as Julia as it comes

1 Like
#7

I would certainly not call it that. I’d call it a bash solution. It also doesn’t do what OP asks, which is to partition a file based on a column value. No doubt that can be achieved in bash with awk for example, but I too would prefer an implementation in Julia, for the following reasons:

  • The logic to parse and select file based on column can more easily be reused if written as a Julia function (like getIO in OP’s example) than embedded in an awk command
  • With a pure Julia solution, we can cleanly unit test this functionality with mock data, using in-memory IO streams
  • Guaranteed to work (and work the same) on all platforms

I’d second Rob’s solution.

2 Likes