Reading text data: `readdlm` is deprecated, so how CSV package is used?

In the issue tracker of DelimitedFiles

I found a comment

readdlm is also effectively deprecated and the CSV package should be used

about reading text data files.

I’m trying to read a plain text file listing numbers delimited by newlines and spaces.

Does somebody know where to find a simple tutorial to use the CSV package to do this?

I just blindly tried the CSV package. I got an object containing objects like Vector{CSV.Row} and I didn’t know how to examine whether my read was successful or not.

The data file includes comment lines indicated by “#”. The file can be read with

using DelimitedFiles
a = readdlm("my-datafile.txt"; comments=true)
a[2,5] # -> the number at row 2, column 5

This is a very simple interface. You just get a 2D array. How does one use the CSV package to do this?

Aside: The data file I’m trying to read turns out including the unicode “zero width no-break space”, which readdlm() fails to handle. So, I was trying to do something about it, when I found the deprecation comment above.

The first step to try blindly is probably this:

using CSV, DataFrames

myfile = "infile.txt"
df = CSV.read(myfile,DataFrame)

And see what df contains or what warnings come up.

The corresponding starting point in the docs is here: Home · CSV.jl and starting downwards with the sentence: “That’s quite a bit! Let’s boil down a TL;DR:”

1 Like

Thanks! I’ve made some progress using that. But I got stuck with delimiters. According to the thread (from 2021) which I quote at the end of this message, you have to preprocess the input text file if it uses multiple delimiters. Is that still true today? I thought that it was quite usual to expect to be able to specify a Regex, along the lines of

   df = CSV.read(myfile, DataFrame; delim=r"\s+") # any sequence of "space" characters

so that any nonzero sequence of “space” characters ( \s ) be tread as one single delimiter.

Currently CSV.read() isn’t able to “guess” the number of columns in my datafile, presumably because the datafile uses a mixture of tabs and spaces. readdlm() correctly detects the delimiters.

You are right, this is a bit disappointing and my guess is, the expected convenience is dropped in favor of performance.

At least, consecutive white spaces as delimiting columns should be an option, or in general, consecutive all characters of a given string or a given list of characters would be nice. A regex maybe to much. Perhaps you may open an issue with this request (didn’t check if there is already one). You may support below linked feature request by adding a comment to it.

From your quoted discussion the best solution is doing the editing on the fly using Julia:

df = read(myfile) |>
        x -> map!(c -> c == UInt8('\t') ? UInt8(' ') : c, x, x) |>
        x -> CSV.read( x, DataFrame; delim=" ", ignorerepeated=true)
1 Like
1 Like

By the way, do you know how to tell CSV.read to ignore the last delimiter at the end of a line? Without removing it, I get an extra missing column.

As a workaround, because I don’t know how to apply two filters to a single stream, I first replace tabs with a space and then remove the line-ending spaces like

  b = replace(b, r" +\n" => "\n")

(which I don’t know will work or not if the text file uses the DOS line ending \r\n).

Perhaps the real problem is, that above solution does not work on strings, but on a Vector{UInt8} (array of characters), because that’s what read returns into the pipes. Therefor the replacement with map!.

To insert filters which work on strings, you can do:

read(myfile) |> x -> join(Array{Char}(x))

This outputs the whole file as a string (probably large) to the REPL.
For CSV.read, we need again the Vector{UInt8}:

collect(UInt8, "line1\nline2")

Put it all together:

df = read(myfile) |>
          x -> map!(c -> c == UInt8('\t') ? UInt8(' ') : c, x, x) |>
          x -> join(Array{Char}(x)) |>
          x -> replace(x, r"\s+\n" => "\n") |>
          x -> collect(UInt8,x) |>
          x -> CSV.read( x, DataFrame; delim=" ", ignorerepeated=true)

This issue moved to something quite theoretical by just iterating on a starting problem and sticking to the starting solution and just enhancing it. I am not so happy now with the result and I am not sure if this is still something which performs and scales well.

On the other hand, cleaning up the original source data file manually in an editor is also not a good solution. It’s error prone, can’t be repeated, can’t be undone, it’s intransparent, it’s tedious, … definitely something not recommended by me.

Hopefully, if some other people see this, they may correct me and provide something more appropriate for your kind of dirty data.

1 Like

Note that tabular IO packages in Julia (including CSV) are very flexible in terms of reading into different data structures.
You don’t generally need something as heavy as DataFrames just to read tables/matrices and work with them:

using CSV, Tables

CSV.read("file.txt", Tables.matrix)  # returns a plain 2d matrix
CSV.read("file.txt", columntable)  # returns a simple columnar table: namedtuple of vectors
CSV.read("file.txt", rowtable)  # returns a simple row-table: vector of namedtuples
2 Likes

Try:
stripwhitespace=true

For example:

df = CSV.read(IOBuffer(replace(read(inputfile), UInt8('\t') => UInt8(' '))),
     delim=' ', stripwhitespace=true, ignorerepeated=true, comment="#", DataFrame
)
1 Like

Thank you all for your help!

I confirm that stripwhitespace=true removes spaces at the end of the line.

I also confirm that Tables.matrix best fits my needs right now. (I just want to access the elements as df[3, 5].)

So, the final solution is

df = CSV.read( . . . replace \t with ' ' . . . , Tables.matrix; 
  delim=' ', stripwhitespace=true, ignorerepeated=true, comment="#")

So, as @oheil says, it would be nice if multiple characters could be specified as delim. Then, CSV.read() would be as convenient as readdlm() for this kind of tabulated, space-delimited text data files.

I did add a comment to the github issue @oheil mentioned.


[Aside] CSV.jl, as its name suggests, is designed for CSV and CSV-like data files, which aren’t optimized for viewing on the computer screen. I often need to use Excel or a similar application just to view a CSV file.

On the other hand, I often encounter tabulated, text data files, which are designed to be intelligible on the command terminal.

I don’t know how those tab characters entered the data file my colleague gave me, but the file is intelligible by just

$ less thedatafile.txt

on the command terminal.

1 Like