Changing strings at the beginning of a large file

Hi !
I have large CSV/TSV tables and I want to change some column names using a Julia program.
I think that loading these large tables into DataFrames will consume a lot or RAM and be slow as well.

My idea was to use readdlm()/writedlm() on each line of the file, but I think this is not very efficient to read and then rewrite each file line, because 99% of the file is unchanged.

To extend the discussion, when one want to modify a file at the very beginning or at a very specific position, is there a way to copy efficiently a large block of text that is unchanged ? Or the only possibility is to parse the file line by line ? In that case, is readdlm()/writedlm() the best solution for text file ?
Thanks for your comments ! :wink:

You can directly write at any position in a file using open, write, seek, close.
You find the needed functions in
https://docs.julialang.org/en/v1.7/base/io-network/#General-I/O

@oheil thank you very much for your answer. I did a toy program to test :slight_smile:
I have a simple table in “file.csv”

id	column1	column2	column3
row1	1	2	3
row2	4	5	6
row3	7	8	9

I want to replace “column” by “col”

io = open("file.csv", "r+")

# find the first tabulation
i = 1
while read(io, Char) != '\t'
    seek(io,i)
    i+=1
end
println(i) # 3

# find the first newline
i = 1
while read(io, Char) != '\n'
    seek(io,i)
    i+=1
end

println(i) # 27

# postion writing after the first tabulation
seek(io,3)
# write the new line
write(io, "col1\tcol2\tcol3")

close(io)

The result of this program is

id	col1	col2	col32	column3
row1	1	2	3
row2	4	5	6
row3	7	8	9

because my modification is shorter than the original one. I did not find a way to remove characters, how is it possible ?

That’s not possible, but you could overwrite the remaining characters with space. But this is of course not a general solution, just something you can do in your special example, which wouldn’t harm the later interpretation of the files content as a table.

@oheil thank you !
Indeed if I do

spaces = repeat(" ",9)
write(io, "col1\tcol2\tcol3$spaces")

I obtain

id	col1	col2	col3         
row1	1	2	3
row2	4	5	6
row3	7	8	9

But the name of the last column is not “col3”, it is "col3 " followed by 9 spaces. I suppose that it is an unavoidable limitation to keep in mind.

This can be improved in various ways, but if you don’t have any extreme requirements it should work just fine.

open(outfilename, "w") do output
    open(infilename, "r") do input
        header = readline(input, keep = true)
        # ...transform header...
        write(output, header)
        for line in eachline(input, keep = true)
            write(output, line)
        end
    end
end

Yes, that’s unavoidable.
But typically the software, which reads the data in, does a trim on the column names, so at the end, the column name should result in “col3” w/o the spaces. If you are in control of this even better, because that’s what you should do, when reading in data: always trim (remove leading and trailing spaces) from the data fields. At least when we are talking about data analysis.

It’s actually about to avoid exactly this, because of large files with only small changes.

@GunnarFarneback, Your solution is similar to what I was about to do :slight_smile:

Yes, my idea, in this post, was to avoid a “waste of energy” by reading/writing many lines without any changes :wink:

If you have the space you need for rearranging the header line that’s fine. Otherwise there’s a big difference between reading everything at once into memory or just a line at a time. But if you want to improve on that I would do

open(outfilename, "w") do output
    open(infilename, "r") do input
        header = readline(input, keep = true)
        # ...transform header...
        write(output, header)
        while !eof(input)
            write(output, read(input, blocksize))
        end
    end
end

where you can tune blocksize depending on how much memory you want to handle at once and the characteristics of your file system.

Personally I wouldn’t bother with doing an in-place header change unless the file was gigantic, but in that case I just wouldn’t change it at all. Things like a column name with unexpected space padding will always come back to bite you at a later time.

1 Like

@GunnarFarneback thank you for your comment, the option to tune blocksize is interesting.

You could process the whole header line also as follows:

open(file, "r+") do io
    str1 = readline(io) 
    str2 = replace(str1,"column" => "col")
    str2 = rpad(str2, length(str1)-2) * "\n"
    seekstart(io)
    write(io, str2)
end

@rafael.guerra Thank you for this elegant solution :wink:

I realize the limitations of modifying a file without rewriting it since it is not possible to insert/remove characters. It is an interesting possibility when the number of characters modified is exactly the same than the original characters.