Hi !
I have large CSV/TSV tables and I want to change some column names using a Julia program.
I think that loading these large tables into DataFrames will consume a lot or RAM and be slow as well.
My idea was to use readdlm()/writedlm() on each line of the file, but I think this is not very efficient to read and then rewrite each file line, because 99% of the file is unchanged.
To extend the discussion, when one want to modify a file at the very beginning or at a very specific position, is there a way to copy efficiently a large block of text that is unchanged ? Or the only possibility is to parse the file line by line ? In that case, is readdlm()/writedlm() the best solution for text file ?
Thanks for your comments !
io = open("file.csv", "r+")
# find the first tabulation
i = 1
while read(io, Char) != '\t'
seek(io,i)
i+=1
end
println(i) # 3
# find the first newline
i = 1
while read(io, Char) != '\n'
seek(io,i)
i+=1
end
println(i) # 27
# postion writing after the first tabulation
seek(io,3)
# write the new line
write(io, "col1\tcol2\tcol3")
close(io)
That’s not possible, but you could overwrite the remaining characters with space. But this is of course not a general solution, just something you can do in your special example, which wouldn’t harm the later interpretation of the files content as a table.
This can be improved in various ways, but if you don’t have any extreme requirements it should work just fine.
open(outfilename, "w") do output
open(infilename, "r") do input
header = readline(input, keep = true)
# ...transform header...
write(output, header)
for line in eachline(input, keep = true)
write(output, line)
end
end
end
Yes, that’s unavoidable.
But typically the software, which reads the data in, does a trim on the column names, so at the end, the column name should result in “col3” w/o the spaces. If you are in control of this even better, because that’s what you should do, when reading in data: always trim (remove leading and trailing spaces) from the data fields. At least when we are talking about data analysis.
If you have the space you need for rearranging the header line that’s fine. Otherwise there’s a big difference between reading everything at once into memory or just a line at a time. But if you want to improve on that I would do
open(outfilename, "w") do output
open(infilename, "r") do input
header = readline(input, keep = true)
# ...transform header...
write(output, header)
while !eof(input)
write(output, read(input, blocksize))
end
end
end
where you can tune blocksize depending on how much memory you want to handle at once and the characteristics of your file system.
Personally I wouldn’t bother with doing an in-place header change unless the file was gigantic, but in that case I just wouldn’t change it at all. Things like a column name with unexpected space padding will always come back to bite you at a later time.
I realize the limitations of modifying a file without rewriting it since it is not possible to insert/remove characters. It is an interesting possibility when the number of characters modified is exactly the same than the original characters.