Reading data text files delimited with both spaces & tabs

rafael.guerra · July 18, 2021, 10:49am

Is it possible in CSV.jl, DelimitedFiles.jl, or other, to read text files that contain both spaces & tabs, in the header and/or data sections?

An example is provided below.

NB: the hidden spaces and tabs should be there after copy and paste, also added row with missing value

Col1   Col2 	Col3 
1	1		012
2 	1		013
2 	1   
2 	1	    015

Currently, the workaround used is to replace all tabs by spaces in a text editor and then read the file using CSV.jl:

using CSV, DataFrames
df = CSV.read(file, header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)

Thank you.

GunnarFarneback · July 18, 2021, 11:17am

readdlm does so by default?

julia> using DelimitedFiles
       readdlm("/tmp/testfile.txt")
3×3 Matrix{Any}:
  "Col1"   "Col2"    "Col3"
 1        1        12
 2        1        13

rafael.guerra · July 18, 2021, 11:54am

@GunnarFarneback, thanks for your response. The result is a matrix of Any that needs further parsing in order to use the convenience tools in DataFrames and DataFramesMeta to handle missings, etc. (my input example was too simple).

The following code (adapted from here) converts the matrix of Any into a dataframe, allowing further processing as indicated.

It does not seem to be as “efficient” as CSV.jl (where there is a lot of machinery for handling types, etc.) but it works and will mark your response as a solution, if no other ideas pop up.

using DelimitedFiles
M = readdlm(file)
header = M[1,:]
data = M[2:end,:]
df = DataFrame(data,:auto)
rename!(df, names(df) .=> Symbol.(header[:]))

Thanks.

GunnarFarneback · July 18, 2021, 12:34pm

You can do the replacement in Julia without too much fuzz:

df = CSV.read(IOBuffer(replace(read("/tmp/testfile.txt"), UInt8('\t') => UInt8(' '))), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)

rafael.guerra · July 18, 2021, 12:47pm

Gunnar, this is really brilliant.

No clue how you managed to implement the replacement operation using the IOBuffer.

The docs mention such intermediate operations only vaguely. Your example should be part of the docs, IMHO.

Thanks again.

GunnarFarneback · July 18, 2021, 1:02pm

There’s nothing deep going on here. If split into intermediate results:

a = read(file);                             # Read a full file into a UInt8 vector
b = replace(a, UInt8('\t') => UInt8(' '))   # Replace tabs by spaces
c = IOBuffer(b)                             # IOBuffer may optionally operate on a pre-existing array

rafael.guerra · July 18, 2021, 1:08pm

So, the file is actually read “twice”? Once into IOBuffer memory and then again from the memory buffer with CSV.read()?

cjdoris · July 18, 2021, 3:16pm

Unless you’re sure the file is ASCII, you should probably do

a = read(file, String)
b = replace(a, '\t' => ' ')
c = IOBuffer(b)

rafael.guerra · July 18, 2021, 3:20pm

@cjdoris, how do you plug-it in the one-liner above?
Get error:

ERROR: MethodError: no method matching findnext(::UInt8, ::String, ::Int64)

cjdoris · July 18, 2021, 3:25pm

Can you show us what you tried?

Skoffer · July 18, 2021, 3:36pm

It reads it more than two times, since replace also generate intermediate array.
If you want to avoid it, then you should better use map!

a = read(file)
map!(c -> c == UInt8('\t') ? UInt8(' ') : c, a, a)
df = CSV.read(IOBuffer(a), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)

rafael.guerra · July 18, 2021, 4:11pm

@cjdoris, sorry messed it up here. It works perfectly

rafael.guerra · July 18, 2021, 4:28pm

Thank you, very good to know.

Should we use piping to adapt your solution to Gunnar’s one-liner?
E.g.:

df = CSV.read(IOBuffer(read(file) |> x->map!(c-> c==UInt8('\t') ? UInt8(' ') : c, x, x)), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)

aplavin · July 19, 2021, 4:54am

I also saw such problems that need a simple preprocessing of a file before feeding it to CSV or another reader library, and also had to resort to similar workarounds.
It should be possible to implement general functions like map_stream_bytes(::Function, io) -> io, map_stream_chars, map_stream_lines (and filter_... as well): they would take and return IO streams without reading all the content to memory at once. Or maybe something like iostream_to_iterator_of_chars/iostream_from_iterator_of_chars is even better.
Didn’t go as far as implementing these functions myself though.

Topic		Replies	Views
Ignore consecutive whitespaces with CSV.read(...) Data question	18	20722	October 8, 2019
How to read .tsv files in julia New to Julia	2	513	August 18, 2023
Question about DelimitedFiles Data question , package	1	319	July 30, 2020
Best practice for reading large matrix with repeated spaces as text General Usage dataframes , csv	8	495	April 20, 2023
Reading a txt. file with two different delimiters General Usage csv	5	2091	May 15, 2022

Reading data text files delimited with both spaces & tabs

Related topics