Hi everyone,
how would I go about removing the first line of a (potentially very large) text file, ideally without having to load the file into a DataFrame or copying the full file?
Best
Jakob
Hi everyone,
how would I go about removing the first line of a (potentially very large) text file, ideally without having to load the file into a DataFrame or copying the full file?
Best
Jakob
Check out working with text files from the Julia wikibook.
open("file_to_read") do f
io = open("test.txt", "w")
i = 1
for l in eachline(f)
i != 1 && println(io, l)
i += 1
end
close(io)
end
Do you just want to save it back on your disk in-place? It’d be nice to just tell the filesystem that the file starts at a new place, but I don’t think that’s a supported operation on any system and would likely have to be removed in exact multiples of 1 or 2k. I think the only way to do it is to work through the entire file and re-write it.
Do you need to write a program? I mean if you are on linux you would do:
tail -n +2 input.txt > output.txt
Not sure what the Window’s equivalent would be…I’d probably install the Ubuntu “App” to get a bash prompt with tail, and run from in there.
Note that this adds an unconditional new line at the end, which may or may not be OK.
Even simpler:
open("file_to_read") do input
open("test.txt", "w") do output
for line in Iterators.drop(eachline(input), 1)
println(output, line)
end
end
end
If you want to maximize speed, and avoid the extraneous newline pointed out by @yuyichao, it would be faster to read everything after the newline in a block:
open("file_to_read") do input
readuntil(input, '\n')
write("file_to_write", read(input))
end
This implementation is even shorter than the code based on eachline
. (And it will be vastly faster than spawning a Unix program like tail
, not to mention being more portable.)
The only downside of this approach is that it might take a lot of memory if you have an enormous file. A more general implementation would probably read the data in large blocks, similar to this code.
why not use POSIX tools…
https://superuser.com/questions/284258/remove-first-line-in-bash
POSIX tools are great, but Julia code that relies on them will not be portable.
Also, spawning executables takes a lot of time, so for simple tasks it is often orders of magnitude faster to use Julia code than to spawn a POSIX command-line program.
For example, run(pipeline(`tail -n +2 $inputname`, stdout=outputname))
seems to be about 10²× slower on my computer than my native-Julia readuntil
code above for most files.
I agree with everything you said it’s just that op didn’t say they need programmatically do this multiple times on different occasions, so I thought I’d mention sed in case it’s just a one time thing.
If you are generating the files in Julia then the most efficient solution would be to create a structure where you’ve implemented the IO methods for. That way you can filter out the first line and write the rest of the data.
This is really neat! (There is an extra )
though after "file_to_write"
)
One can even point "file_to_read"
and "file_to_write
" to the same file to do the in-place replacement. As I’ve changed the workflow to cutting large files into smaller chunks, memory is not really an issue anymore.
Thanks for all the replies btw., julia really does have an awesome community!
If you want the option to do this in-place, I would change the code to:
write("file_to_write",
open("file_to_read") do input
readuntil(input, '\n')
read(input)
end)
so that the file is closed after reading before opening it to write.