Reading text data: `readdlm` is deprecated, so how CSV package is used?

ryofurue · July 12, 2024, 9:15am

In the issue tracker of DelimitedFiles

readdlm not working with white spaces

opened 06:14PM - 03 Apr 22 UTC

readdlm ignores all-white spaces by default when a delimiter is not specified. H…owever, when one wants to specify the data type to be read it is obligatory to specify the delimiter too... `readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')` Then, in the following case, `readdlm(file, ' ', Float64, comments=true)` the function doesn't ignore the initial whitespace because the delimiter is `' '`, only 1 whitespace. Then the program crashes with for example ``` 2 3 1 3 ``` There should be a flag to ignore all chars that match with the delimiter or just be able to specify the type like this `readdlm(file, type=Float64, comments=true)` however this brings the problem that if the delimiter is not a whitespace the problem will persist.

I found a comment

readdlm is also effectively deprecated and the CSV package should be used

about reading text data files.

I’m trying to read a plain text file listing numbers delimited by newlines and spaces.

Does somebody know where to find a simple tutorial to use the CSV package to do this?

I just blindly tried the CSV package. I got an object containing objects like Vector{CSV.Row} and I didn’t know how to examine whether my read was successful or not.

The data file includes comment lines indicated by “#”. The file can be read with

using DelimitedFiles
a = readdlm("my-datafile.txt"; comments=true)
a[2,5] # -> the number at row 2, column 5

This is a very simple interface. You just get a 2D array. How does one use the CSV package to do this?

Aside: The data file I’m trying to read turns out including the unicode “zero width no-break space”, which readdlm() fails to handle. So, I was trying to do something about it, when I found the deprecation comment above.

oheil · July 12, 2024, 10:12am

The first step to try blindly is probably this:

using CSV, DataFrames

myfile = "infile.txt"
df = CSV.read(myfile,DataFrame)

And see what df contains or what warnings come up.

The corresponding starting point in the docs is here: Home · CSV.jl and starting downwards with the sentence: “That’s quite a bit! Let’s boil down a TL;DR:”

ryofurue · July 15, 2024, 6:36am

Thanks! I’ve made some progress using that. But I got stuck with delimiters. According to the thread (from 2021) which I quote at the end of this message, you have to preprocess the input text file if it uses multiple delimiters. Is that still true today? I thought that it was quite usual to expect to be able to specify a Regex, along the lines of

   df = CSV.read(myfile, DataFrame; delim=r"\s+") # any sequence of "space" characters

so that any nonzero sequence of “space” characters ( \s ) be tread as one single delimiter.

Currently CSV.read() isn’t able to “guess” the number of columns in my datafile, presumably because the datafile uses a mixture of tabs and spaces. readdlm() correctly detects the delimiters.

oheil · July 15, 2024, 8:53am

You are right, this is a bit disappointing and my guess is, the expected convenience is dropped in favor of performance.

At least, consecutive white spaces as delimiting columns should be an option, or in general, consecutive all characters of a given string or a given list of characters would be nice. A regex maybe to much. ~~Perhaps you may open an issue with this request (didn’t check if there is already one).~~ You may support below linked feature request by adding a comment to it.

From your quoted discussion the best solution is doing the editing on the fly using Julia:

df = read(myfile) |>
        x -> map!(c -> c == UInt8('\t') ? UInt8(' ') : c, x, x) |>
        x -> CSV.read( x, DataFrame; delim=" ", ignorerepeated=true)

oheil · July 15, 2024, 9:33am

github.com/JuliaData/CSV.jl

Feature request: Handle multiple, different delimiters in a file

opened 08:52PM - 29 Dec 21 UTC

patrickboehnke

new feature

I frequently work with data that uses different delimiters. I would like to be a…ble to store this data as a CSV file without first having to go through and replace the delimiters with just a single choice. The enhancements that I would like to see are for CSV.File as follows: 1. The 'delim' argument also accepts an Array of Char or String entries to use as delimiters. 2. The 'ignorerepeated' argument could be updated to consider a three state system: 1) All duplicate delimiters ignored 2) Only duplicate delimiters that are the same are ignored and 3) Each delimiter treated as unique Thank you for your consideration!

ryofurue · July 15, 2024, 2:37pm

By the way, do you know how to tell CSV.read to ignore the last delimiter at the end of a line? Without removing it, I get an extra missing column.

As a workaround, because I don’t know how to apply two filters to a single stream, I first replace tabs with a space and then remove the line-ending spaces like

  b = replace(b, r" +\n" => "\n")

(which I don’t know will work or not if the text file uses the DOS line ending \r\n).

oheil · July 15, 2024, 3:31pm

Perhaps the real problem is, that above solution does not work on strings, but on a Vector{UInt8} (array of characters), because that’s what read returns into the pipes. Therefor the replacement with map!.

To insert filters which work on strings, you can do:

read(myfile) |> x -> join(Array{Char}(x))

This outputs the whole file as a string (probably large) to the REPL.
For CSV.read, we need again the Vector{UInt8}:

collect(UInt8, "line1\nline2")

Put it all together:

df = read(myfile) |>
          x -> map!(c -> c == UInt8('\t') ? UInt8(' ') : c, x, x) |>
          x -> join(Array{Char}(x)) |>
          x -> replace(x, r"\s+\n" => "\n") |>
          x -> collect(UInt8,x) |>
          x -> CSV.read( x, DataFrame; delim=" ", ignorerepeated=true)

This issue moved to something quite theoretical by just iterating on a starting problem and sticking to the starting solution and just enhancing it. I am not so happy now with the result and I am not sure if this is still something which performs and scales well.

On the other hand, cleaning up the original source data file manually in an editor is also not a good solution. It’s error prone, can’t be repeated, can’t be undone, it’s intransparent, it’s tedious, … definitely something not recommended by me.

Hopefully, if some other people see this, they may correct me and provide something more appropriate for your kind of dirty data.

aplavin · July 15, 2024, 4:52pm

Note that tabular IO packages in Julia (including CSV) are very flexible in terms of reading into different data structures.
You don’t generally need something as heavy as DataFrames just to read tables/matrices and work with them:

using CSV, Tables

CSV.read("file.txt", Tables.matrix)  # returns a plain 2d matrix
CSV.read("file.txt", columntable)  # returns a simple columnar table: namedtuple of vectors
CSV.read("file.txt", rowtable)  # returns a simple row-table: vector of namedtuples

rafael.guerra · July 15, 2024, 6:33pm

Try:
stripwhitespace=true

For example:

df = CSV.read(IOBuffer(replace(read(inputfile), UInt8('\t') => UInt8(' '))),
     delim=' ', stripwhitespace=true, ignorerepeated=true, comment="#", DataFrame
)

ryofurue · July 17, 2024, 5:45pm

Thank you all for your help!

I confirm that stripwhitespace=true removes spaces at the end of the line.

I also confirm that Tables.matrix best fits my needs right now. (I just want to access the elements as df[3, 5].)

So, the final solution is

df = CSV.read( . . . replace \t with ' ' . . . , Tables.matrix; 
  delim=' ', stripwhitespace=true, ignorerepeated=true, comment="#")

So, as @oheil says, it would be nice if multiple characters could be specified as delim. Then, CSV.read() would be as convenient as readdlm() for this kind of tabulated, space-delimited text data files.

I did add a comment to the github issue @oheil mentioned.

[Aside] CSV.jl, as its name suggests, is designed for CSV and CSV-like data files, which aren’t optimized for viewing on the computer screen. I often need to use Excel or a similar application just to view a CSV file.

On the other hand, I often encounter tabulated, text data files, which are designed to be intelligible on the command terminal.

I don’t know how those tab characters entered the data file my colleague gave me, but the file is intelligible by just

$ less thedatafile.txt

on the command terminal.

Topic		Replies	Views
Reading data text files delimited with both spaces & tabs General Usage csv , io	13	4451	July 19, 2021
Reading a txt. file with two different delimiters General Usage csv	5	2132	May 15, 2022
DelimitedFiles reading everything in one column New to Julia	1	341	January 15, 2021
Ignore consecutive whitespaces with CSV.read(...) Data question	18	20857	October 8, 2019
Question about DelimitedFiles Data question , package	1	325	July 30, 2020

Reading text data: `readdlm` is deprecated, so how CSV package is used?

Related topics