I’m wondering what the best way would be to concatenate multiple CSV files – all of which contain the same header row (they are compatible) – into one CSV file.
I know I can do something like (taken from @c42f’s one-liner):
vcat(CSV.read.(file_names)...) |> CSV.write("one_big_file.csv")
But since I never need the data loaded, is there a faster way that avoids fully parsing the data…?
1 Like
See my answer here on removing the first line from a text file. Don’t parse the data, just copy it blindly to a new file, but skip the first (header) line for everything but the first file.
4 Likes
Very handy @stevengj ! But we still have the issue of merging all these files without loading them into memory, right @yakir12 ? What’s the best way to solve that?
Steve’s nice solution using: Iterators.drop(eachline(input), 1)
reads and writes the files line by line.
I have just tried it to merge two 47GB files and it seemed to work, taking ~10 min on my PC laptop.
After you read the first line to strip off the header, it will be a lot faster to read the file in chunks, as in this example: How to obtain the result of a diff between 2 files in a loop? - #4 by stevengj
1 Like
In the example linked, the chunks seem to be 32768 bytes
long. Why this value?
It’s a power of 2 and a common size for L1 cache.
1 Like
Thanks Oscar.
In my laptop I see this:
Does it mean that I should use a chunk size = 2^18 = 262144 (< 320 KB)
?
it probably will be a minor difference, but feel free to do some benchmarks…
1 Like
@Oscar_Smith, fyi, I’ve observed ~22% speed gains on my laptop (with L1 cache = 320 KB) when using a chunk size of 262_144
bytes instead of 32_768
bytes.
In any case, doing it by chunks seemed much faster than doing it line by line. The code used to merge 2 x 37 GB csv files was adapted from Steve’s original and is provided below.
Original code by @stevengj (adapted)
open("merge_two_37GB.csv", "w") do output
isfirst = true
for file in files
open(file, "r") do input
if isfirst
println(output, readline(input))
isfirst=false
else
readline(input) # read repeated CSV header but
println(output) # print carriage return only
end
buf = Vector{UInt8}(undef, 262144) # L1 cache: 262144 => 252 s; 32768 => 323 s
while !eof(input)
nb = readbytes!(input, buf)
write(output, view(buf,1:nb))
end
end
end
end
2 Likes
This approach worked like a charm!!! so fast! thanks!