Ignore consecutive whitespaces with CSV.read(...)

e3c6 · May 31, 2018, 7:33pm

I am reading a file with CSV.read(...) that contains many consecutive spaces between columns because the colmuns are padded. I passed the keyword argument delim=' ', but it thinks that every space counts as a new column, instead of ignoring long stretches of consecutive white spaces.

How can I do this?

ExpandingMan · May 31, 2018, 7:45pm

What delimiter is it using? Have you tried CSV.read(filename, delim=delim) without anything else? That should work. If at all possible you should try to have csv’s with quote characters (e.g. “”") for strings, though of course one cannot always control what one gets.

Worst case scenario you can just use strip on relevant column, for example

df[:bad_column] = strip.(df[:bad_column])

e3c6 · May 31, 2018, 7:59pm

That’s what I did. Precisely

CSV.read(filename, delim=' ')

ExpandingMan · May 31, 2018, 8:03pm

Sorry, I meant what is the actual file delimiter (as in commas, tabs)? I think it should work if you only give it that and nothing else, otherwise you’ll have to use strip.

e3c6 · May 31, 2018, 8:15pm

The delimiter is whitespace. But the columns are padded. This means that there is more than one space between columns

ExpandingMan · May 31, 2018, 8:17pm

Whitespace as in tab? If it’s tab you should do delim='\t'. Maybe you’re implying that it’s just fixed width and there are no delimiters. I’ve never encountered such a thing, but I’ve seen some crazy stuff so I guess nothing would shock me at this point. If that’s the case you might be hard pressed to find any CSV readers that support this. I can’t think of any hack that would get CSV.jl to read such a thing properly, but if that really is your use case the CSV.jl contributors may have some ideas.

nalimilan · May 31, 2018, 8:19pm

If that’s indeed a fixed-width file, see:

e3c6 · May 31, 2018, 8:53pm

Thanks, that’s just my issue. Good to know it’s on the Todo list

davidanthoff · May 31, 2018, 11:28pm

For CSVFiles.jl you can use the spacedelim option:

using CSVFiles, DataFrames
df = DataFrame(load("filename.csv", spacedelim=true))

aakhmetz · June 1, 2018, 1:36am

I think it is a normal behaviour for counting double space as the one with missing column. Otherwise, there is an ambiguity on how to define possible NAs

e3c6 · June 1, 2018, 1:48am

I see your point. But the interesting thing is that DataFrames.read_table(..., separator=' ') handles this without problems. It automatically ignores the padding blank spaces.

kevbonham · June 2, 2018, 1:28pm

I’ve been using CSVFiles.jl (linked above) for a while now, and find it much more flexible than the “recommended” CSV.jl. If the only source of my files were programmers with good hygiene, CSV might work. But as I get all kinds of crap, I need that flexibility.

Another option (slightly dangerous) option is just to process the files to replace consectutive whitespace with tabs. I end up doing stuff like this all the time with sed, eg:

$ cat $FILE | sed -E -e 's/[\t ]+/\t/g' > new_file.tsv

You could do the same thing in Julia too:

open("old_file.txt") do bad; open("new_file.tsv", "w+") do good
    for l in eachline(bad, strip=false)
        replace!(l, r"[\t ]+", "\t")
        write(good, l)
    end
end; end

Or something… Note: both of these solutions are approximate, I didn’t test them and (esp on the Julia) some of the syntax may be off

e3c6 · June 2, 2018, 11:40pm

Is there a way to do the stripping in Julia, but without creating a new file? Maybe one can modify the stream before it is fed to the DataFrame reader.

kevbonham · June 3, 2018, 4:36pm

I’m sure there is. I don’t know how to do it though…

crsl4 · January 8, 2019, 4:03pm

Hello,
I am afraid that I have the same issue. I have a table with multiple spaces separating columns (not necessarily fixed size):

shell> head merlin.err
    FAMILY     PERSON     MARKER      RATIO
         A    EP01223   rs710865     0.0214
         A    EP01227 rs11249215     0.0107
         A    EP01228 rs11249215    0.00253
         A    EP01228 rs10903129     0.0116

The function readtable in DataFrames is able to read this table by treating repeated spaces as delimiter:

julia> using DataFrames
julia> dat=readtable("merlin.err",header=true, separator=' ')
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
944×4 DataFrames.DataFrame
│ Row │ FAMILY     │ PERSON     │ MARKER     │ RATIO   │
├─────┼────────────┼────────────┼────────────┼─────────┤
│ 1   │ A          │ EP01223    │ rs710865   │ 0.0214  │
│ 2   │ A          │ EP01227    │ rs11249215 │ 0.0107  │
│ 3   │ A          │ EP01228    │ rs11249215 │ 0.00253 │
│ 4   │ A          │ EP01228    │ rs10903129 │ 0.0116  │

But I cannot do this with CSV.read, even with the ignorerepeated option:

julia> using CSV
julia> dat = CSV.read("merlin.err",header=true, delim=' ', ignorerepeated=true)
941×5 DataFrames.DataFrame
│ Row │ Column1    │ FAMILY     │ PERSON     │ MARKER     │ RATIO    │
│     │ String⍰    │ String⍰    │ String⍰    │ String⍰    │ Float64⍰ │
├─────┼────────────┼────────────┼────────────┼────────────┼──────────┤
│ 1   │ missing    │ A          │ EP01223    │ rs710865   │ 0.0214   │
│ 2   │ missing    │ A          │ EP01227    │ rs11249215 │ 0.0107   │
│ 3   │ missing    │ A          │ EP01228    │ rs11249215 │ 0.00253  │
│ 4   │ missing    │ A          │ EP01228    │ rs10903129 │ 0.0116   │

I am in julia 1.0.0, with CSV v0.4.3.
I should add that it is not just the first column with all missing values. The first column has some non-missing values as well, and other columns have missing values too (missing values caused by repeated spaces).
Any help is appreciated, thanks!!

nalimilan · January 8, 2019, 4:15pm

Can you file a bug against CSV.jl, with a small example to reproduce the problem?

davidavdav · October 8, 2019, 8:01am

Argh… I remember having to move heaven and earth to get a PR for readtable() accepted that interpreted multiple whitespace characters as a single delimiter, as in the cases above, and has been the custom in several decades of unix text file tables. Now readtable() is deprecated and CSV.read() shows the same old behavior, which IMHO is a form of regression…

nalimilan · October 8, 2019, 12:02pm

Doesn’t ignorerepeated=true give the intended behavior? See https://github.com/JuliaData/CSV.jl/pull/266

davidanthoff · October 8, 2019, 4:34pm

You can also try the spacedelim=true option with CSVFiles.jl.

Topic		Replies	Views
Is there no standard way to read files with fixed width columns in the new DataFrames ecosystem? Data	6	2588	November 29, 2017
Reading data text files delimited with both spaces & tabs General Usage csv , io	13	4266	July 19, 2021
Reading text data: `readdlm` is deprecated, so how CSV package is used? General Usage question , text-data	9	564	July 17, 2024
Reading a txt. file with two different delimiters General Usage csv	5	2023	May 15, 2022
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	60	September 3, 2024

Ignore consecutive whitespaces with CSV.read(...)

Related topics