Concatenate csv files without loading them

yakir12 · March 4, 2020, 3:22pm

I’m wondering what the best way would be to concatenate multiple CSV files – all of which contain the same header row (they are compatible) – into one CSV file.

I know I can do something like (taken from @c42f’s one-liner):

vcat(CSV.read.(file_names)...) |> CSV.write("one_big_file.csv")

But since I never need the data loaded, is there a faster way that avoids fully parsing the data…?

stevengj · March 4, 2020, 6:32pm

See my answer here on removing the first line from a text file. Don’t parse the data, just copy it blindly to a new file, but skip the first (header) line for everything but the first file.

scottedwards2000 · February 12, 2022, 4:11pm

Very handy @stevengj ! But we still have the issue of merging all these files without loading them into memory, right @yakir12 ? What’s the best way to solve that?

rafael.guerra · February 12, 2022, 5:04pm

Steve’s nice solution using: Iterators.drop(eachline(input), 1) reads and writes the files line by line.

I have just tried it to merge two 47GB files and it seemed to work, taking ~10 min on my PC laptop.

stevengj · February 12, 2022, 11:33pm

After you read the first line to strip off the header, it will be a lot faster to read the file in chunks, as in this example: How to obtain the result of a diff between 2 files in a loop? - #4 by stevengj

rafael.guerra · February 12, 2022, 11:58pm

In the example linked, the chunks seem to be 32768 bytes long. Why this value?

Oscar_Smith · February 13, 2022, 12:03am

It’s a power of 2 and a common size for L1 cache.

rafael.guerra · February 13, 2022, 12:09am

Thanks Oscar.
In my laptop I see this:

L1_L2_L3_cache

Does it mean that I should use a chunk size = 2^18 = 262144 (< 320 KB)?

Oscar_Smith · February 13, 2022, 12:19am

it probably will be a minor difference, but feel free to do some benchmarks…

rafael.guerra · February 14, 2022, 6:42pm

@Oscar_Smith, fyi, I’ve observed ~22% speed gains on my laptop (with L1 cache = 320 KB) when using a chunk size of 262_144 bytes instead of 32_768 bytes.

In any case, doing it by chunks seemed much faster than doing it line by line. The code used to merge 2 x 37 GB csv files was adapted from Steve’s original and is provided below.

Original code by @stevengj (adapted)

open("merge_two_37GB.csv", "w") do output
   isfirst = true
   for file in files
      open(file, "r") do input
         if isfirst
            println(output, readline(input))
            isfirst=false
         else
            readline(input)      # read repeated CSV header but
            println(output)      # print carriage return only
         end
         buf = Vector{UInt8}(undef, 262144)   # L1 cache: 262144 => 252 s; 32768 => 323 s
         while !eof(input)
            nb = readbytes!(input, buf)
            write(output, view(buf,1:nb))
         end
      end
   end
end

scottedwards2000 · April 13, 2022, 9:56pm

This approach worked like a charm!!! so fast! thanks!

Topic		Replies	Views
Skipping a lot of lines in CSV.read() allocates too much memory Performance csv , io	77	2062	February 23, 2024
How can I split large data using a faster and more efficient function (data science)? New to Julia csv	9	811	October 27, 2022
Reading headers of delimited files General Usage csv	1	585	January 3, 2023
CSV.Row very slow for reading files line by line Performance package , csv	0	282	May 9, 2023
Read multiple csv files Performance filesystem	1	1439	July 28, 2020

Concatenate csv files without loading them

Related topics