Ignore consecutive whitespaces with CSV.read(...)

I am reading a file with CSV.read(...) that contains many consecutive spaces between columns because the colmuns are padded. I passed the keyword argument delim=' ', but it thinks that every space counts as a new column, instead of ignoring long stretches of consecutive white spaces.

How can I do this?

What delimiter is it using? Have you tried CSV.read(filename, delim=delim) without anything else? That should work. If at all possible you should try to have csv’s with quote characters (e.g. β€œβ€") for strings, though of course one cannot always control what one gets.

Worst case scenario you can just use strip on relevant column, for example

df[:bad_column] = strip.(df[:bad_column])

That’s what I did. Precisely

CSV.read(filename, delim=' ')

Sorry, I meant what is the actual file delimiter (as in commas, tabs)? I think it should work if you only give it that and nothing else, otherwise you’ll have to use strip.

The delimiter is whitespace. But the columns are padded. This means that there is more than one space between columns

Whitespace as in tab? If it’s tab you should do delim='\t'. Maybe you’re implying that it’s just fixed width and there are no delimiters. I’ve never encountered such a thing, but I’ve seen some crazy stuff so I guess nothing would shock me at this point. If that’s the case you might be hard pressed to find any CSV readers that support this. I can’t think of any hack that would get CSV.jl to read such a thing properly, but if that really is your use case the CSV.jl contributors may have some ideas.

If that’s indeed a fixed-width file, see:

2 Likes

Thanks, that’s just my issue. Good to know it’s on the Todo list

1 Like

For CSVFiles.jl you can use the spacedelim option:

using CSVFiles, DataFrames
df = DataFrame(load("filename.csv", spacedelim=true))
1 Like

I think it is a normal behaviour for counting double space as the one with missing column. Otherwise, there is an ambiguity on how to define possible NAs

I see your point. But the interesting thing is that DataFrames.read_table(..., separator=' ') handles this without problems. It automatically ignores the padding blank spaces.

I’ve been using CSVFiles.jl (linked above) for a while now, and find it much more flexible than the β€œrecommended” CSV.jl. If the only source of my files were programmers with good hygiene, CSV might work. But as I get all kinds of crap, I need that flexibility.

Another option (slightly dangerous) option is just to process the files to replace consectutive whitespace with tabs. I end up doing stuff like this all the time with sed, eg:

$ cat $FILE | sed -E -e 's/[\t ]+/\t/g' > new_file.tsv

You could do the same thing in Julia too:

open("old_file.txt") do bad; open("new_file.tsv", "w+") do good
    for l in eachline(bad, strip=false)
        replace!(l, r"[\t ]+", "\t")
        write(good, l)
    end
end; end

Or something… Note: both of these solutions are approximate, I didn’t test them and (esp on the Julia) some of the syntax may be off

3 Likes

Is there a way to do the stripping in Julia, but without creating a new file? Maybe one can modify the stream before it is fed to the DataFrame reader.

I’m sure there is. I don’t know how to do it though…

Hello,
I am afraid that I have the same issue. I have a table with multiple spaces separating columns (not necessarily fixed size):

shell> head merlin.err
    FAMILY     PERSON     MARKER      RATIO
         A    EP01223   rs710865     0.0214
         A    EP01227 rs11249215     0.0107
         A    EP01228 rs11249215    0.00253
         A    EP01228 rs10903129     0.0116

The function readtable in DataFrames is able to read this table by treating repeated spaces as delimiter:

julia> using DataFrames
julia> dat=readtable("merlin.err",header=true, separator=' ')
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
944Γ—4 DataFrames.DataFrame
β”‚ Row β”‚ FAMILY     β”‚ PERSON     β”‚ MARKER     β”‚ RATIO   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ A          β”‚ EP01223    β”‚ rs710865   β”‚ 0.0214  β”‚
β”‚ 2   β”‚ A          β”‚ EP01227    β”‚ rs11249215 β”‚ 0.0107  β”‚
β”‚ 3   β”‚ A          β”‚ EP01228    β”‚ rs11249215 β”‚ 0.00253 β”‚
β”‚ 4   β”‚ A          β”‚ EP01228    β”‚ rs10903129 β”‚ 0.0116  β”‚

But I cannot do this with CSV.read, even with the ignorerepeated option:

julia> using CSV
julia> dat = CSV.read("merlin.err",header=true, delim=' ', ignorerepeated=true)
941Γ—5 DataFrames.DataFrame
β”‚ Row β”‚ Column1    β”‚ FAMILY     β”‚ PERSON     β”‚ MARKER     β”‚ RATIO    β”‚
β”‚     β”‚ String⍰    β”‚ String⍰    β”‚ String⍰    β”‚ String⍰    β”‚ Float64⍰ β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ missing    β”‚ A          β”‚ EP01223    β”‚ rs710865   β”‚ 0.0214   β”‚
β”‚ 2   β”‚ missing    β”‚ A          β”‚ EP01227    β”‚ rs11249215 β”‚ 0.0107   β”‚
β”‚ 3   β”‚ missing    β”‚ A          β”‚ EP01228    β”‚ rs11249215 β”‚ 0.00253  β”‚
β”‚ 4   β”‚ missing    β”‚ A          β”‚ EP01228    β”‚ rs10903129 β”‚ 0.0116   β”‚

I am in julia 1.0.0, with CSV v0.4.3.
I should add that it is not just the first column with all missing values. The first column has some non-missing values as well, and other columns have missing values too (missing values caused by repeated spaces).
Any help is appreciated, thanks!!

1 Like

Can you file a bug against CSV.jl, with a small example to reproduce the problem?

2 Likes

Argh… I remember having to move heaven and earth to get a PR for readtable() accepted that interpreted multiple whitespace characters as a single delimiter, as in the cases above, and has been the custom in several decades of unix text file tables. Now readtable() is deprecated and CSV.read() shows the same old behavior, which IMHO is a form of regression…

Doesn’t ignorerepeated=true give the intended behavior? See https://github.com/JuliaData/CSV.jl/pull/266

4 Likes

You can also try the spacedelim=true option with CSVFiles.jl.

1 Like