I am reading a file with CSV.read(...) that contains many consecutive spaces between columns because the colmuns are padded. I passed the keyword argument delim=' ', but it thinks that every space counts as a new column, instead of ignoring long stretches of consecutive white spaces.
What delimiter is it using? Have you tried CSV.read(filename, delim=delim) without anything else? That should work. If at all possible you should try to have csvβs with quote characters (e.g. ββ") for strings, though of course one cannot always control what one gets.
Worst case scenario you can just use strip on relevant column, for example
Sorry, I meant what is the actual file delimiter (as in commas, tabs)? I think it should work if you only give it that and nothing else, otherwise youβll have to use strip.
Whitespace as in tab? If itβs tab you should do delim='\t'. Maybe youβre implying that itβs just fixed width and there are no delimiters. Iβve never encountered such a thing, but Iβve seen some crazy stuff so I guess nothing would shock me at this point. If thatβs the case you might be hard pressed to find any CSV readers that support this. I canβt think of any hack that would get CSV.jl to read such a thing properly, but if that really is your use case the CSV.jl contributors may have some ideas.
I think it is a normal behaviour for counting double space as the one with missing column. Otherwise, there is an ambiguity on how to define possible NAs
I see your point. But the interesting thing is that DataFrames.read_table(..., separator=' ') handles this without problems. It automatically ignores the padding blank spaces.
Iβve been using CSVFiles.jl (linked above) for a while now, and find it much more flexible than the βrecommendedβ CSV.jl. If the only source of my files were programmers with good hygiene, CSV might work. But as I get all kinds of crap, I need that flexibility.
Another option (slightly dangerous) option is just to process the files to replace consectutive whitespace with tabs. I end up doing stuff like this all the time with sed, eg:
open("old_file.txt") do bad; open("new_file.tsv", "w+") do good
for l in eachline(bad, strip=false)
replace!(l, r"[\t ]+", "\t")
write(good, l)
end
end; end
Or somethingβ¦ Note: both of these solutions are approximate, I didnβt test them and (esp on the Julia) some of the syntax may be off
Is there a way to do the stripping in Julia, but without creating a new file? Maybe one can modify the stream before it is fed to the DataFrame reader.
Hello,
I am afraid that I have the same issue. I have a table with multiple spaces separating columns (not necessarily fixed size):
shell> head merlin.err
FAMILY PERSON MARKER RATIO
A EP01223 rs710865 0.0214
A EP01227 rs11249215 0.0107
A EP01228 rs11249215 0.00253
A EP01228 rs10903129 0.0116
The function readtable in DataFrames is able to read this table by treating repeated spaces as delimiter:
julia> using DataFrames
julia> dat=readtable("merlin.err",header=true, separator=' ')
WARNING: readtable is deprecated, use CSV.read from the CSV package instead
944Γ4 DataFrames.DataFrame
β Row β FAMILY β PERSON β MARKER β RATIO β
βββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌββββββββββ€
β 1 β A β EP01223 β rs710865 β 0.0214 β
β 2 β A β EP01227 β rs11249215 β 0.0107 β
β 3 β A β EP01228 β rs11249215 β 0.00253 β
β 4 β A β EP01228 β rs10903129 β 0.0116 β
But I cannot do this with CSV.read, even with the ignorerepeated option:
julia> using CSV
julia> dat = CSV.read("merlin.err",header=true, delim=' ', ignorerepeated=true)
941Γ5 DataFrames.DataFrame
β Row β Column1 β FAMILY β PERSON β MARKER β RATIO β
β β Stringβ° β Stringβ° β Stringβ° β Stringβ° β Float64β° β
βββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββββΌβββββββββββ€
β 1 β missing β A β EP01223 β rs710865 β 0.0214 β
β 2 β missing β A β EP01227 β rs11249215 β 0.0107 β
β 3 β missing β A β EP01228 β rs11249215 β 0.00253 β
β 4 β missing β A β EP01228 β rs10903129 β 0.0116 β
I am in julia 1.0.0, with CSV v0.4.3.
I should add that it is not just the first column with all missing values. The first column has some non-missing values as well, and other columns have missing values too (missing values caused by repeated spaces).
Any help is appreciated, thanks!!
Arghβ¦ I remember having to move heaven and earth to get a PR for readtable() accepted that interpreted multiple whitespace characters as a single delimiter, as in the cases above, and has been the custom in several decades of unix text file tables. Now readtable() is deprecated and CSV.read() shows the same old behavior, which IMHO is a form of regressionβ¦