Ignore consecutive whitespaces with CSV.read(...)

question

#1

I am reading a file with CSV.read(...) that contains many consecutive spaces between columns because the colmuns are padded. I passed the keyword argument delim=' ', but it thinks that every space counts as a new column, instead of ignoring long stretches of consecutive white spaces.

How can I do this?


#2

What delimiter is it using? Have you tried CSV.read(filename, delim=delim) without anything else? That should work. If at all possible you should try to have csv’s with quote characters (e.g. “”") for strings, though of course one cannot always control what one gets.

Worst case scenario you can just use strip on relevant column, for example

df[:bad_column] = strip.(df[:bad_column])

#3

That’s what I did. Precisely

CSV.read(filename, delim=' ')

#4

Sorry, I meant what is the actual file delimiter (as in commas, tabs)? I think it should work if you only give it that and nothing else, otherwise you’ll have to use strip.


#5

The delimiter is whitespace. But the columns are padded. This means that there is more than one space between columns


#6

Whitespace as in tab? If it’s tab you should do delim='\t'. Maybe you’re implying that it’s just fixed width and there are no delimiters. I’ve never encountered such a thing, but I’ve seen some crazy stuff so I guess nothing would shock me at this point. If that’s the case you might be hard pressed to find any CSV readers that support this. I can’t think of any hack that would get CSV.jl to read such a thing properly, but if that really is your use case the CSV.jl contributors may have some ideas.


#7

If that’s indeed a fixed-width file, see:



#8

Thanks, that’s just my issue. Good to know it’s on the Todo list


#9

For CSVFiles.jl you can use the spacedelim option:

using CSVFiles, DataFrames
df = DataFrame(load("filename.csv", spacedelim=true))

#10

I think it is a normal behaviour for counting double space as the one with missing column. Otherwise, there is an ambiguity on how to define possible NAs


#11

I see your point. But the interesting thing is that DataFrames.read_table(..., separator=' ') handles this without problems. It automatically ignores the padding blank spaces.


#12

I’ve been using CSVFiles.jl (linked above) for a while now, and find it much more flexible than the “recommended” CSV.jl. If the only source of my files were programmers with good hygiene, CSV might work. But as I get all kinds of crap, I need that flexibility.

Another option (slightly dangerous) option is just to process the files to replace consectutive whitespace with tabs. I end up doing stuff like this all the time with sed, eg:

$ cat $FILE | sed -E -e 's/[\t ]+/\t/g' > new_file.tsv

You could do the same thing in Julia too:

open("old_file.txt") do bad; open("new_file.tsv", "w+") do good
    for l in eachline(bad, strip=false)
        replace!(l, r"[\t ]+", "\t")
        write(good, l)
    end
end; end

Or something… Note: both of these solutions are approximate, I didn’t test them and (esp on the Julia) some of the syntax may be off


#13

Is there a way to do the stripping in Julia, but without creating a new file? Maybe one can modify the stream before it is fed to the DataFrame reader.


#14

I’m sure there is. I don’t know how to do it though…