Is it possible in CSV.jl, DelimitedFiles.jl, or other, to read text files that contain both spaces & tabs, in the header and/or data sections?
An example is provided below.
NB: the hidden spaces and tabs should be there after copy and paste, also added row with missing value
Col1 Col2 Col3
1 1 012
2 1 013
2 1 015
Currently, the workaround used is to replace all tabs by spaces in a text editor and then read the file using CSV.jl:
using CSV, DataFrames
df = CSV.read(file, header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
readdlm does so by default?
julia> using DelimitedFiles
"Col1" "Col2" "Col3"
1 1 12
2 1 13
@GunnarFarneback, thanks for your response. The result is a matrix of Any that needs further parsing in order to use the convenience tools in DataFrames and DataFramesMeta to handle missings, etc. (my input example was too simple).
The following code (adapted from here) converts the matrix of Any into a dataframe, allowing further processing as indicated.
It does not seem to be as “efficient” as CSV.jl (where there is a lot of machinery for handling types, etc.) but it works and will mark your response as a solution, if no other ideas pop up.
M = readdlm(file)
header = M[1,:]
data = M[2:end,:]
df = DataFrame(data,:auto)
rename!(df, names(df) .=> Symbol.(header[:]))
You can do the replacement in Julia without too much fuzz:
df = CSV.read(IOBuffer(replace(read("/tmp/testfile.txt"), UInt8('\t') => UInt8(' '))), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
Gunnar, this is really brilliant.
No clue how you managed to implement the replacement operation using the IOBuffer.
The docs mention such intermediate operations only vaguely. Your example should be part of the docs, IMHO.
There’s nothing deep going on here. If split into intermediate results:
a = read(file); # Read a full file into a UInt8 vector
b = replace(a, UInt8('\t') => UInt8(' ')) # Replace tabs by spaces
c = IOBuffer(b) # IOBuffer may optionally operate on a pre-existing array
So, the file is actually read “twice”? Once into IOBuffer memory and then again from the memory buffer with CSV.read()?
Unless you’re sure the file is ASCII, you should probably do
a = read(file, String)
b = replace(a, '\t' => ' ')
c = IOBuffer(b)
@cjdoris, how do you plug-it in the one-liner above?
ERROR: MethodError: no method matching findnext(::UInt8, ::String, ::Int64)
Can you show us what you tried?
It reads it more than two times, since
replace also generate intermediate array.
If you want to avoid it, then you should better use
a = read(file)
map!(c -> c == UInt8('\t') ? UInt8(' ') : c, a, a)
df = CSV.read(IOBuffer(a), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
@cjdoris, sorry messed it up here. It works perfectly
Thank you, very good to know.
Should we use piping to adapt your solution to Gunnar’s one-liner?
df = CSV.read(IOBuffer(read(file) |> x->map!(c-> c==UInt8('\t') ? UInt8(' ') : c, x, x)), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
I also saw such problems that need a simple preprocessing of a file before feeding it to CSV or another reader library, and also had to resort to similar workarounds.
It should be possible to implement general functions like
map_stream_bytes(::Function, io) -> io,
filter_... as well): they would take and return IO streams without reading all the content to memory at once. Or maybe something like
iostream_from_iterator_of_chars is even better.
Didn’t go as far as implementing these functions myself though.