Preserving File Structure when Reading and Writing

I’m having some trouble preserving the structure of a CSV file that I’m attempting to alter with a Julia program. The data is originally displayed in the file like this:

Created as New Dataset,Sample 001 By Lab Date Thursday, August 23 2018
cm-1,A
4000.00,0.0084
3999.00,0.0084
3998.00,0.0084

But after running it through the program, it comes out like this:

cm-1,A4000.00,0.00843999.00,0.00843998.00,0.00843997.00,0.00843996.00,0.0...

It seems like everything is being output on the same line.

Here’s the code which is creating this result:

for i = firstindex(fileList):lastindex(fileList)
        local lines
        open(fileList[i]) do reader
            lines = readlines(reader)
        end
        println(lines[1])
        open(fileList[i], "w") do writer
            for i = firstindex(lines):lastindex(lines)
                if (i!=1)
                    write(writer, lines[i])
                end
            end
        end
    end

If it matters, what I’m trying to accomplish is just removing the first line from the file.

readlines has an argument chomp which defaults to true and causes the newline character in each line to be stripped off. So the lines you’re getting out have the right content but no newlines, and thus writing them back to a file results in everything being on the same line. Try readlines(reader, chomp=false).

For more information, see:

help?> readlines
search: readlines readline readlink

  readlines(stream::IO=STDIN; chomp::Bool=true)
  readlines(filename::AbstractString; chomp::Bool=true)

  Read all lines of an I/O stream or a file as a vector of strings. Behavior is equivalent to saving the result of reading readline repeatedly with the same arguments and saving the resulting lines as a vector of
  strings.

help?> readline
search: readline readlines readlink

  readline(stream::IO=STDIN; chomp::Bool=true)
  readline(filename::AbstractString; chomp::Bool=true)

  Read a single line of text from the given I/O stream or file (defaults to STDIN). When reading from a file, the text is assumed to be encoded in UTF-8. Lines in the input end with '\n' or "\r\n" or the end of an
  input stream. When chomp is true (as it is by default), these trailing newline characters are removed from the line before it is returned. When chomp is false, they are returned as part of the line.

Above answers your specific question, but I have some unsolicited advice. I have to do stuff like this all the time and used to write loops like this, but have learned a bunch of tricks that make things more compact (and I think clearer too - compactness certainly isn’t the only goal). Please feel free to disregard all of the following :smile:

  1. Rather than read the entire file into memory (when you create lines), you can do it line by line. Doesn’t matter much if your files are tiny, but if you’ve got huge ones, it can make a difference. Look into the function eachline()
  2. If you stick with loading the lines into an array, in your writer loop, you could instead do something like for line in lines[2:end]; write(writer, line); end instead of using the index and checking each time to make sure it’s not the first index (lines[2:end] gives you all but the first entry)
  3. An alternative is to consume the first line of your reader with the print statement, and then just write the rest of the stream to a new file:
function stripfirst(infile::String, outfile::String)
    open(infile, "r") do reader
        println(readline(reader)) # consumes the first line
        write(reader, outfile)
    end
end
shell> cat testfile.csv
Created as New Dataset,Sample 001 By Lab Date Thursday, August 23 2018
cm-1,A
4000.00,0.0084
3999.00,0.0084
3998.00,0.0084

julia> function stripfirst(infile::String, outfile::String)
           open(infile, "r") do reader
               println(readline(reader)) # consumes the first line
               write(outfile, reader)
           end
       end
stripfirst (generic function with 1 method)

julia> stripfirst("testfile.csv", "testout.csv")
Created as New Dataset,Sample 001 By Lab Date Thursday, August 23 2018
52

shell> cat testout.csv
cm-1,A
4000.00,0.0084
3999.00,0.0084
3998.00,0.0084

EDIT: I just noticed you’re trying to edit files in-place, in which case points 1 and 3 are invalid… sorry! Will leave up just FYI, but point 2 is still operable.

2 Likes

When I try replacing lines = readlines(reader) with readlines(reader, chomp=false), I get the following error:

ERROR: LoadError: MethodError: no method matching eachline(::IOStream; chomp=false)
Closest candidates are:
  eachline(::IO; keep) at io.jl:871 got unsupported keyword argument "chomp"
  eachline() at io.jl:871 got unsupported keyword argument "chomp"
  eachline(!Matched::AbstractString; keep) at io.jl:875 got unsupported keyword argument "chomp"

It’s readlines(::IO, keep=false) on 1.0.

Are you on julia 1.0? Looks like chomp was changed to keep.

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.0 (2018-08-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

help?> readline
search: readline readlines readlink

  readline(io::IO=stdin; keep::Bool=false)
  readline(filename::AbstractString; keep::Bool=false)

  Read a single line of text from the given I/O stream or file (defaults to stdin). When reading from a file, the text is assumed to be encoded in
  UTF-8. Lines in the input end with '\n' or "\r\n" or the end of an input stream. When keep is false (as it is by default), these trailing newline
  characters are removed from the line before it is returned. When keep is true, they are returned as part of the line.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> open("my_file.txt", "w") do io
             write(io, "JuliaLang is a GitHub organization.\nIt has many members.\n");
         end
  57

  julia> readline("my_file.txt")
  "JuliaLang is a GitHub organization."

  julia> readline("my_file.txt", keep=true)
  "JuliaLang is a GitHub organization.\n"

  julia> rm("my_file.txt")

I changed it to keep = true, and everything worked. Thank you!
Also, @kevbonham, I implemented the second tip you gave about just going from 2 to the end, and I think it looks a lot better and I think it works better, too. Thank you!

1 Like

Totally off-topic but it is really bad idea to remove human readable file header. Maybe you should rethink your full workflow to be able to keep the header line.

This is a good point - CSV.read allows you to specify which line the data starts on for example.

Normally I’d agree with you, but the way that the file is structured (specifically the existence of that “August 23 2018” part) renders the read-in from CSV.read unusable. Plus, I’m working with copies of files so it doesn’t matter so much if the header gets chopped off.

@kevbonham That said, if there’s an option with CSV.read that allows me to select what line to start with I’d much rather use that. I was looking through the documentation here, and the only parameter that seems like it would allow me to start reading at the second line didn’t work.

After you have found the solution, please consider making a pull request to improve the CSV.jl documentation. You will be the best expert for this matter. (Sorry that I cannot help with the problem but I haven’t used CSV.jl.)

Which one did you try? Pretty sure header=2 will do the trick. If you only do datarow=3, it will still try to use the first row as the header. But if you provide the header row, it assumes the data starts on the row after. Not at a computer right now to check, but if that doesn’t work you should definitely open an issue.

  • header : column names can be provided manually as a complete Vector{String}, or as an Int/AbstractRange which indicates the row/rows that contain the column names

  • datarow::Int : specifies the row on which the actual data starts in the file; by default, the data is expected on the next row after the header row(s); for a file without column names (header), specify datarow=1

Ah, I was using datarow = 3before; changing it to header = 2 worked like a charm. Thank you!

1 Like