Is it possible in CSV.jl, DelimitedFiles.jl, or other, to read text files that contain both spaces & tabs, in the header and/or data sections?
An example is provided below.
NB: the hidden spaces and tabs should be there after copy and paste, also added row with missing value
Col1 Col2 Col3
1 1 012
2 1 013
2 1
2 1 015
Currently, the workaround used is to replace all tabs by spaces in a text editor and then read the file using CSV.jl:
using CSV, DataFrames
df = CSV.read(file, header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
Thank you.
readdlm
does so by default?
julia> using DelimitedFiles
readdlm("/tmp/testfile.txt")
3Ă—3 Matrix{Any}:
"Col1" "Col2" "Col3"
1 1 12
2 1 13
1 Like
@GunnarFarneback, thanks for your response. The result is a matrix of Any that needs further parsing in order to use the convenience tools in DataFrames and DataFramesMeta to handle missings, etc. (my input example was too simple).
The following code (adapted from here) converts the matrix of Any into a dataframe, allowing further processing as indicated.
It does not seem to be as “efficient” as CSV.jl (where there is a lot of machinery for handling types, etc.) but it works and will mark your response as a solution, if no other ideas pop up.
using DelimitedFiles
M = readdlm(file)
header = M[1,:]
data = M[2:end,:]
df = DataFrame(data,:auto)
rename!(df, names(df) .=> Symbol.(header[:]))
Thanks.
You can do the replacement in Julia without too much fuzz:
df = CSV.read(IOBuffer(replace(read("/tmp/testfile.txt"), UInt8('\t') => UInt8(' '))), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
5 Likes
Gunnar, this is really brilliant.
No clue how you managed to implement the replacement operation using the IOBuffer.
The docs mention such intermediate operations only vaguely. Your example should be part of the docs, IMHO.
Thanks again.
There’s nothing deep going on here. If split into intermediate results:
a = read(file); # Read a full file into a UInt8 vector
b = replace(a, UInt8('\t') => UInt8(' ')) # Replace tabs by spaces
c = IOBuffer(b) # IOBuffer may optionally operate on a pre-existing array
2 Likes
So, the file is actually read “twice”? Once into IOBuffer memory and then again from the memory buffer with CSV.read()?
Unless you’re sure the file is ASCII, you should probably do
a = read(file, String)
b = replace(a, '\t' => ' ')
c = IOBuffer(b)
2 Likes
@cjdoris, how do you plug-it in the one-liner above?
Get error:
ERROR: MethodError: no method matching findnext(::UInt8, ::String, ::Int64)
Can you show us what you tried?
It reads it more than two times, since replace
also generate intermediate array.
If you want to avoid it, then you should better use map!
a = read(file)
map!(c -> c == UInt8('\t') ? UInt8(' ') : c, a, a)
df = CSV.read(IOBuffer(a), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
1 Like
@cjdoris, sorry messed it up here. It works perfectly
Thank you, very good to know.
Should we use piping to adapt your solution to Gunnar’s one-liner?
E.g.:
df = CSV.read(IOBuffer(read(file) |> x->map!(c-> c==UInt8('\t') ? UInt8(' ') : c, x, x)), header=1, delim=" ", ignorerepeated=true, type=Int64, DataFrame)
1 Like
I also saw such problems that need a simple preprocessing of a file before feeding it to CSV or another reader library, and also had to resort to similar workarounds.
It should be possible to implement general functions like map_stream_bytes(::Function, io) -> io
, map_stream_chars
, map_stream_lines
(and filter_...
as well): they would take and return IO streams without reading all the content to memory at once. Or maybe something like iostream_to_iterator_of_chars
/iostream_from_iterator_of_chars
is even better.
Didn’t go as far as implementing these functions myself though.
1 Like