Ignore consecutive whitespaces with CSV.read(...)

question

#1

I am reading a file with CSV.read(...) that contains many consecutive spaces between columns because the colmuns are padded. I passed the keyword argument delim=' ', but it thinks that every space counts as a new column, instead of ignoring long stretches of consecutive white spaces.

How can I do this?


#2

What delimiter is it using? Have you tried CSV.read(filename, delim=delim) without anything else? That should work. If at all possible you should try to have csv’s with quote characters (e.g. β€œβ€") for strings, though of course one cannot always control what one gets.

Worst case scenario you can just use strip on relevant column, for example

df[:bad_column] = strip.(df[:bad_column])

#3

That’s what I did. Precisely

CSV.read(filename, delim=' ')

#4

Sorry, I meant what is the actual file delimiter (as in commas, tabs)? I think it should work if you only give it that and nothing else, otherwise you’ll have to use strip.


#5

The delimiter is whitespace. But the columns are padded. This means that there is more than one space between columns


#6

Whitespace as in tab? If it’s tab you should do delim='\t'. Maybe you’re implying that it’s just fixed width and there are no delimiters. I’ve never encountered such a thing, but I’ve seen some crazy stuff so I guess nothing would shock me at this point. If that’s the case you might be hard pressed to find any CSV readers that support this. I can’t think of any hack that would get CSV.jl to read such a thing properly, but if that really is your use case the CSV.jl contributors may have some ideas.


#7

If that’s indeed a fixed-width file, see:



#8

Thanks, that’s just my issue. Good to know it’s on the Todo list


#9

For CSVFiles.jl you can use the spacedelim option:

using CSVFiles, DataFrames
df = DataFrame(load("filename.csv", spacedelim=true))

#10

I think it is a normal behaviour for counting double space as the one with missing column. Otherwise, there is an ambiguity on how to define possible NAs


#11

I see your point. But the interesting thing is that DataFrames.read_table(..., separator=' ') handles this without problems. It automatically ignores the padding blank spaces.


#12

I’ve been using CSVFiles.jl (linked above) for a while now, and find it much more flexible than the β€œrecommended” CSV.jl. If the only source of my files were programmers with good hygiene, CSV might work. But as I get all kinds of crap, I need that flexibility.

Another option (slightly dangerous) option is just to process the files to replace consectutive whitespace with tabs. I end up doing stuff like this all the time with sed, eg:

$ cat $FILE | sed -E -e 's/[\t ]+/\t/g' > new_file.tsv

You could do the same thing in Julia too:

open("old_file.txt") do bad; open("new_file.tsv", "w+") do good
    for l in eachline(bad, strip=false)
        replace!(l, r"[\t ]+", "\t")
        write(good, l)
    end
end; end

Or something… Note: both of these solutions are approximate, I didn’t test them and (esp on the Julia) some of the syntax may be off


#13

Is there a way to do the stripping in Julia, but without creating a new file? Maybe one can modify the stream before it is fed to the DataFrame reader.


#14

I’m sure there is. I don’t know how to do it though…


#15

Hello,
I am afraid that I have the same issue. I have a table with multiple spaces separating columns (not necessarily fixed size):

shell> head merlin.err
    FAMILY     PERSON     MARKER      RATIO
         A    EP01223   rs710865     0.0214
         A    EP01227 rs11249215     0.0107
         A    EP01228 rs11249215    0.00253
         A    EP01228 rs10903129     0.0116

The function readtable in DataFrames is able to read this table by treating repeated spaces as delimiter:

julia> using DataFrames
julia> dat=readtable("merlin.err",header=true, separator=' ')
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
944Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ FAMILY     β”‚ PERSON     β”‚ MARKER     β”‚ RATIO   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ A          β”‚ EP01223    β”‚ rs710865   β”‚ 0.0214  β”‚
β”‚ 2   β”‚ A          β”‚ EP01227    β”‚ rs11249215 β”‚ 0.0107  β”‚
β”‚ 3   β”‚ A          β”‚ EP01228    β”‚ rs11249215 β”‚ 0.00253 β”‚
β”‚ 4   β”‚ A          β”‚ EP01228    β”‚ rs10903129 β”‚ 0.0116  β”‚

But I cannot do this with CSV.read, even with the ignorerepeated option:

julia> using CSV
julia> dat = CSV.read("merlin.err",header=true, delim=' ', ignorerepeated=true)
941Γ—5 DataFrames.DataFrame
β”‚ Row β”‚ Column1    β”‚ FAMILY     β”‚ PERSON     β”‚ MARKER     β”‚ RATIO    β”‚
β”‚     β”‚ String⍰    β”‚ String⍰    β”‚ String⍰    β”‚ String⍰    β”‚ Float64⍰ β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ missing    β”‚ A          β”‚ EP01223    β”‚ rs710865   β”‚ 0.0214   β”‚
β”‚ 2   β”‚ missing    β”‚ A          β”‚ EP01227    β”‚ rs11249215 β”‚ 0.0107   β”‚
β”‚ 3   β”‚ missing    β”‚ A          β”‚ EP01228    β”‚ rs11249215 β”‚ 0.00253  β”‚
β”‚ 4   β”‚ missing    β”‚ A          β”‚ EP01228    β”‚ rs10903129 β”‚ 0.0116   β”‚

I am in julia 1.0.0, with CSV v0.4.3.
I should add that it is not just the first column with all missing values. The first column has some non-missing values as well, and other columns have missing values too (missing values caused by repeated spaces).
Any help is appreciated, thanks!!


#16

Can you file a bug against CSV.jl, with a small example to reproduce the problem?