Changing strings at the beginning of a large file

Fred · December 20, 2021, 10:18am

Hi !
I have large CSV/TSV tables and I want to change some column names using a Julia program.
I think that loading these large tables into DataFrames will consume a lot or RAM and be slow as well.

My idea was to use readdlm()/writedlm() on each line of the file, but I think this is not very efficient to read and then rewrite each file line, because 99% of the file is unchanged.

To extend the discussion, when one want to modify a file at the very beginning or at a very specific position, is there a way to copy efficiently a large block of text that is unchanged ? Or the only possibility is to parse the file line by line ? In that case, is readdlm()/writedlm() the best solution for text file ?
Thanks for your comments !

oheil · December 20, 2021, 10:32am

You can directly write at any position in a file using open, write, seek, close.
You find the needed functions in
https://docs.julialang.org/en/v1.7/base/io-network/#General-I/O

Fred · December 20, 2021, 12:48pm

@oheil thank you very much for your answer. I did a toy program to test
I have a simple table in “file.csv”

id	column1	column2	column3
row1	1	2	3
row2	4	5	6
row3	7	8	9

I want to replace “column” by “col”

io = open("file.csv", "r+")

# find the first tabulation
i = 1
while read(io, Char) != '\t'
    seek(io,i)
    i+=1
end
println(i) # 3

# find the first newline
i = 1
while read(io, Char) != '\n'
    seek(io,i)
    i+=1
end

println(i) # 27

# postion writing after the first tabulation
seek(io,3)
# write the new line
write(io, "col1\tcol2\tcol3")

close(io)

The result of this program is

id	col1	col2	col32	column3
row1	1	2	3
row2	4	5	6
row3	7	8	9

because my modification is shorter than the original one. I did not find a way to remove characters, how is it possible ?

oheil · December 20, 2021, 1:04pm

That’s not possible, but you could overwrite the remaining characters with space. But this is of course not a general solution, just something you can do in your special example, which wouldn’t harm the later interpretation of the files content as a table.

Fred · December 20, 2021, 1:14pm

@oheil thank you !
Indeed if I do

spaces = repeat(" ",9)
write(io, "col1\tcol2\tcol3$spaces")

I obtain

id	col1	col2	col3         
row1	1	2	3
row2	4	5	6
row3	7	8	9

But the name of the last column is not “col3”, it is "col3 " followed by 9 spaces. I suppose that it is an unavoidable limitation to keep in mind.

GunnarFarneback · December 20, 2021, 1:18pm

This can be improved in various ways, but if you don’t have any extreme requirements it should work just fine.

open(outfilename, "w") do output
    open(infilename, "r") do input
        header = readline(input, keep = true)
        # ...transform header...
        write(output, header)
        for line in eachline(input, keep = true)
            write(output, line)
        end
    end
end

oheil · December 20, 2021, 1:19pm

Yes, that’s unavoidable.
But typically the software, which reads the data in, does a trim on the column names, so at the end, the column name should result in “col3” w/o the spaces. If you are in control of this even better, because that’s what you should do, when reading in data: always trim (remove leading and trailing spaces) from the data fields. At least when we are talking about data analysis.

oheil · December 20, 2021, 1:21pm

It’s actually about to avoid exactly this, because of large files with only small changes.

Fred · December 20, 2021, 1:23pm

@GunnarFarneback, Your solution is similar to what I was about to do

Yes, my idea, in this post, was to avoid a “waste of energy” by reading/writing many lines without any changes

GunnarFarneback · December 20, 2021, 1:38pm

If you have the space you need for rearranging the header line that’s fine. Otherwise there’s a big difference between reading everything at once into memory or just a line at a time. But if you want to improve on that I would do

open(outfilename, "w") do output
    open(infilename, "r") do input
        header = readline(input, keep = true)
        # ...transform header...
        write(output, header)
        while !eof(input)
            write(output, read(input, blocksize))
        end
    end
end

where you can tune blocksize depending on how much memory you want to handle at once and the characteristics of your file system.

Personally I wouldn’t bother with doing an in-place header change unless the file was gigantic, but in that case I just wouldn’t change it at all. Things like a column name with unexpected space padding will always come back to bite you at a later time.

Fred · December 20, 2021, 1:53pm

@GunnarFarneback thank you for your comment, the option to tune blocksize is interesting.

rafael.guerra · December 20, 2021, 3:04pm

You could process the whole header line also as follows:

open(file, "r+") do io
    str1 = readline(io) 
    str2 = replace(str1,"column" => "col")
    str2 = rpad(str2, length(str1)-2) * "\n"
    seekstart(io)
    write(io, str2)
end

Fred · December 20, 2021, 3:32pm

@rafael.guerra Thank you for this elegant solution

I realize the limitations of modifying a file without rewriting it since it is not possible to insert/remove characters. It is an interesting possibility when the number of characters modified is exactly the same than the original characters.

Topic		Replies	Views
Removing the first line in a text file New to Julia question	11	7045	March 5, 2020
Fastest way to write to and delete from the end of a file and for general editing General Usage	7	1237	May 25, 2017
Handling multiple files at the same time Performance question , strings , file	6	961	April 26, 2022
Preserving File Structure when Reading and Writing New to Julia	12	1357	August 29, 2018
Write to a particular line in a file General Usage question	4	1887	November 15, 2020

Changing strings at the beginning of a large file

Related topics