Reading data text files delimited with both spaces & tabs

,

Is it possible in CSV.jl, DelimitedFiles.jl, or other, to read text files that contain both spaces & tabs, in the header and/or data sections?

An example is provided below.

NB: the hidden spaces and tabs should be there after copy and paste, also added row with missing value

Col1   Col2 	Col3 
1	1		012
2 	1		013
2 	1   
2 	1	    015

Currently, the workaround used is to replace all tabs by spaces in a text editor and then read the file using CSV.jl:

using CSV, DataFrames
df = CSV.read(file, header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)

Thank you.

readdlm does so by default?

julia> using DelimitedFiles
       readdlm("/tmp/testfile.txt")
3Ă—3 Matrix{Any}:
  "Col1"   "Col2"    "Col3"
 1        1        12
 2        1        13
1 Like

@GunnarFarneback, thanks for your response. The result is a matrix of Any that needs further parsing in order to use the convenience tools in DataFrames and DataFramesMeta to handle missings, etc. (my input example was too simple).

The following code (adapted from here) converts the matrix of Any into a dataframe, allowing further processing as indicated.

It does not seem to be as “efficient” as CSV.jl (where there is a lot of machinery for handling types, etc.) but it works and will mark your response as a solution, if no other ideas pop up.

using DelimitedFiles
M = readdlm(file)
header = M[1,:]
data = M[2:end,:]
df = DataFrame(data,:auto)
rename!(df, names(df) .=> Symbol.(header[:]))

Thanks.

You can do the replacement in Julia without too much fuzz:

df = CSV.read(IOBuffer(replace(read("/tmp/testfile.txt"), UInt8('\t') => UInt8(' '))), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
5 Likes

Gunnar, this is really brilliant.

No clue how you managed to implement the replacement operation using the IOBuffer.

The docs mention such intermediate operations only vaguely. Your example should be part of the docs, IMHO.

Thanks again.

There’s nothing deep going on here. If split into intermediate results:

a = read(file);                             # Read a full file into a UInt8 vector
b = replace(a, UInt8('\t') => UInt8(' '))   # Replace tabs by spaces
c = IOBuffer(b)                             # IOBuffer may optionally operate on a pre-existing array
2 Likes

So, the file is actually read “twice”? Once into IOBuffer memory and then again from the memory buffer with CSV.read()?

Unless you’re sure the file is ASCII, you should probably do

a = read(file, String)
b = replace(a, '\t' => ' ')
c = IOBuffer(b)
2 Likes

@cjdoris, how do you plug-it in the one-liner above?
Get error:

ERROR: MethodError: no method matching findnext(::UInt8, ::String, ::Int64)

Can you show us what you tried?

It reads it more than two times, since replace also generate intermediate array.
If you want to avoid it, then you should better use map!

a = read(file)
map!(c -> c == UInt8('\t') ? UInt8(' ') : c, a, a)
df = CSV.read(IOBuffer(a), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
1 Like

@cjdoris, sorry messed it up here. It works perfectly :slight_smile:

Thank you, very good to know.

Should we use piping to adapt your solution to Gunnar’s one-liner?
E.g.:

df = CSV.read(IOBuffer(read(file) |> x->map!(c-> c==UInt8('\t') ? UInt8(' ') : c, x, x)), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
1 Like

I also saw such problems that need a simple preprocessing of a file before feeding it to CSV or another reader library, and also had to resort to similar workarounds.
It should be possible to implement general functions like map_stream_bytes(::Function, io) -> io, map_stream_chars, map_stream_lines (and filter_... as well): they would take and return IO streams without reading all the content to memory at once. Or maybe something like iostream_to_iterator_of_chars/iostream_from_iterator_of_chars is even better.
Didn’t go as far as implementing these functions myself though.

1 Like