Partition a large CSV file into smaller files without loading into memory

greg_plowman · March 9, 2019, 12:12pm

I want to split (partition) a large csv file into separate smaller csv files based on the value of a particular column.

I also want to do this row by row, without loading the entire file into memory.

My current solution is using CSV and iterating over the rows, selecting the io stream based on the value of column col1. But I don’t know the best way to efficiently write that row. Currently I’m using Tables.eachcolumn as below:

getIO(col1) -> returns appropriate io stream (an open `csv` file)

csvfile = CSV.File(filename)
sch = Tables.schema(csvfile)

for row in csvfile
    #write(getIO(row.col1), row)  # I want to do something like this

    io = getIO(row.col1)
    Tables.eachcolumn(sch, row) do val, col, name
        print(io, val)
        col < numcols && print(io, ", ")
    end
    println(io)
end

Is there a better way?

xiaodai · March 9, 2019, 12:32pm

I just use julia’s run to call sprint.

Works on Windows too if u install git and set path to include all the bins from git.

See https://www.google.com/url?sa=t&source=web&rct=j&url=https://stackoverflow.com/questions/54956361/julia-how-to-execute-a-system-command-from-julia-code&ved=2ahUKEwj24eCaiPXgAhUTf30KHan4D2gQFjAAegQIBRAB&usg=AOvVaw3xxTqGVTL2DejClxfiZtfM

robsmith11 · March 9, 2019, 12:55pm

If the first column is easy to parse (no escaping, etc.), then I’d skip using CSV so that you don’t need to parse the other columns:

julia> open("/tmp/t1.csv") do f1
           open("/tmp/t2.csv", "w") do f2
               for l in eachline(f1)
                   c1 = parse(Int64, l[1:findfirst(==(','), l) - 1])
                   if c1 == 2
                       write(f2, l, "\n")
                   end
               end
           end
       end

greg_plowman · March 10, 2019, 8:09am

Perhaps you meant split?
I would prefer a pure Julia solution though.

greg_plowman · March 10, 2019, 8:14am

Thanks Rob! This is much simpler and about 3x faster than my original method using CSV.

As an aside, I tried using SubString rather then indexing into the line because I thought it would save allocating a new string on each iteration. But there was no improvement (in fact slightly slower).

xiaodai · March 10, 2019, 8:36am

run(`split large_file.csv`)

is as Julia as it comes

bennedich · March 10, 2019, 12:26pm

I would certainly not call it that. I’d call it a bash solution. It also doesn’t do what OP asks, which is to partition a file based on a column value. No doubt that can be achieved in bash with awk for example, but I too would prefer an implementation in Julia, for the following reasons:

The logic to parse and select file based on column can more easily be reused if written as a Julia function (like getIO in OP’s example) than embedded in an awk command
With a pure Julia solution, we can cleanly unit test this functionality with mock data, using in-memory IO streams
Guaranteed to work (and work the same) on all platforms

I’d second Rob’s solution.

Topic		Replies	Views
Repartitioning 2TB of csv into parquets Data big-data	21	4749	June 25, 2020
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	851	October 27, 2022
CSVFiles line by line New to Julia	5	2741	December 29, 2018
Reading huge csv files Data	5	4100	January 19, 2019
Is it possible to iterate over a very large CSV in Windows? New to Julia	9	1018	August 23, 2019

Partition a large CSV file into smaller files without loading into memory

Related topics