Concatenate csv files without loading them

I’m wondering what the best way would be to concatenate multiple CSV files – all of which contain the same header row (they are compatible) – into one CSV file.

I know I can do something like (taken from @c42f’s one-liner):

vcat(CSV.read.(file_names)...) |> CSV.write("one_big_file.csv")

But since I never need the data loaded, is there a faster way that avoids fully parsing the data…?

1 Like

See my answer here on removing the first line from a text file. Don’t parse the data, just copy it blindly to a new file, but skip the first (header) line for everything but the first file.

4 Likes

Very handy @stevengj ! But we still have the issue of merging all these files without loading them into memory, right @yakir12 ? What’s the best way to solve that?

Steve’s nice solution using: Iterators.drop(eachline(input), 1) reads and writes the files line by line.

I have just tried it to merge two 47GB files and it seemed to work, taking ~10 min on my PC laptop.

After you read the first line to strip off the header, it will be a lot faster to read the file in chunks, as in this example: How to obtain the result of a diff between 2 files in a loop? - #4 by stevengj

1 Like

In the example linked, the chunks seem to be 32768 bytes long. Why this value?

It’s a power of 2 and a common size for L1 cache.

1 Like

Thanks Oscar.
In my laptop I see this:

L1_L2_L3_cache

Does it mean that I should use a chunk size = 2^18 = 262144 (< 320 KB)?

it probably will be a minor difference, but feel free to do some benchmarks…

1 Like

@Oscar_Smith, fyi, I’ve observed ~22% speed gains on my laptop (with L1 cache = 320 KB) when using a chunk size of 262_144 bytes instead of 32_768 bytes.

In any case, doing it by chunks seemed much faster than doing it line by line. The code used to merge 2 x 37 GB csv files was adapted from Steve’s original and is provided below.

Original code by @stevengj (adapted)
open("merge_two_37GB.csv", "w") do output
   isfirst = true
   for file in files
      open(file, "r") do input
         if isfirst
            println(output, readline(input))
            isfirst=false
         else
            readline(input)      # read repeated CSV header but
            println(output)      # print carriage return only
         end
         buf = Vector{UInt8}(undef, 262144)   # L1 cache: 262144 => 252 s; 32768 => 323 s
         while !eof(input)
            nb = readbytes!(input, buf)
            write(output, view(buf,1:nb))
         end
      end
   end
end
2 Likes

This approach worked like a charm!!! so fast! thanks!