Bug with DelimtedFiles.readdlm when header=true

JosephPollacco · February 26, 2021, 12:08am

Hallo,

Thanks for providing an easy and fast way to read delimited files.

When I use header=true as an option. In the output Data it skips 2 columns instead of 1 which is a bug.

Data, Header= DelimtedFiles.readdlm(path,',',header=true)

I added an issue #39831

Many thanks for providing a fix.
Joseph

stillyslalom · February 26, 2021, 12:17am

c.f. header=true, skips 2 lines · Issue #39831 · JuliaLang/julia · GitHub

As I mentioned there, it’d really help if you could provide a small example CSV file that reproduces this behavior.

JosephPollacco · February 26, 2021, 7:42am

Thanks Stilly for looking at the problem:

The input file which causes problem is attached
CSVfile
The input file has the following format:

Id , SELECT_1

11122911000031011, 1

11122911000031012, 1

…

The code

Data, Header= DelimtedFiles.readdlm(path, ',' , header=true)

 println(Data[1,:]) = 11122911000031012

Which is not what is expected.

Hope that helps to debug,
Joseph

jishnub · February 26, 2021, 7:55am

I can see that a problem might arise if the data is treated as floating point. I am able to read the file in as integers:

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , Int, header=true);

julia> println(Int.(Data[1,:]))
[11122911000031011, 1]

whereas as floating-point:

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , header=true);

julia> println(Int.(Data[1,:]))
[11122911000031012, 1]

Won’t say that this is a DelimitedFiles bug exactly, rather a floating-point artifact

JosephPollacco · February 26, 2021, 8:01am

Thanks for investigating. In this case I have only Int64, but what happens if Id is an integer and the data is FLoat64?

JosephPollacco · February 26, 2021, 8:10am

My work around is to avoid using Header as follow:

         # Read data
            Data = DelimitedFiles.readdlm("Smap_Id_Select.csv", ',')

         # Read header
            Header = Data[1,1:end]

         # Remove first row
            Data = Data[1:end.≠1,1:end]

jishnub · February 26, 2021, 8:16am

Such rounding might be happening at more places than the first row, so removing the first row is not the solution. If you have data of mixed types then it’s best to read it in by specifying the element type to be Any.

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , Any, header=true);

julia> Data[1:2, :]
2×2 Array{Any,2}:
 11122911000031011  1
 11122911000031012  1

julia> Data[1,1] |> typeof
Int64

For more sophisticated cases you may consider using a DataFrame

julia> using CSV, DataFrames

julia> df = DataFrame(CSV.File("Smap_Id_Select.csv"))
16281×2 DataFrame
   Row │ Id                 SELECT_1 
       │ Int64              Int64    
───────┼─────────────────────────────
     1 │ 11122911000031011         1
     2 │ 11122911000031012         1
     3 │ 11122911000031013         1
     4 │ 11122911000031014         1
[...]

JosephPollacco · February 26, 2021, 8:19am

Requiring some further explanations.
When I remove the first row I am removing the row which has the Headings.

JosephPollacco · February 26, 2021, 8:27am

What is the fastest way to read .csv => Array and not DataFrames?

greg_plowman · February 26, 2021, 8:37am

The issue with specifying header=true but no type is that by default, type is assumed to be Float64 and you get rounding errors on column 1 data.

In your example this can be fixed by specifying header=true and type Int for data.

But …

In this case I have only Int64, but what happens if Id is an integer and the data is FLoat64?

Well, here you can specify header=true with type Any.

Or use CSV, which can do a better job of inferring column types, or you can explicitly specify them.

Your solution of reading without header=true and manually slicing out the first row works because the type is automatically inferred as Any since the first row (header) contains strings and the data rows are numeric.

BTW, I think you can simplify your slicing:

# Read header
    Header = Data[1,1:end]

 # Remove first row
    Data = Data[1:end.≠1,1:end]

Header = Data[1, :]
Data = Data[2:end, :]

What is the fastest way to read .csv => Array and not DataFrames?

Checkout CSV:

JosephPollacco · March 4, 2021, 9:43pm

I would like to thank you for providing us such a DelimtedFiles which is in the core Julia package and therefore I expect that the DelimtedFiles works as expected with no surprises.

To my understanding in the documentation
DelimitedFiles

There is no where written that one must putting Any or else you might not get the expected results.

From my perspective Delimited Files has a bug which needs to be fixed so that users can get the results as the documentation promises.

Once again thanks for providing a free tool,
Joseph

JosephPollacco · March 4, 2021, 9:45pm

My comment is that if DelimitedFiles is outdated it may be a good idea to replace it with CSV.jl. I understand that there must be some build tools to easily convert DataFrameworks into Array.

Topic		Replies	Views
Read file with CSV.read New to Julia	8	19791	September 9, 2019
Reading headers of delimited files General Usage csv	1	581	January 3, 2023
CSV, DataFrame read data file with string and Float64 columns New to Julia dataframes	3	76	September 3, 2024
Question about DelimitedFiles Data question , package	1	325	July 30, 2020
Access columns by variable name with `readdlm` from `DelimitedFiles` General Usage	4	1130	October 15, 2019

Bug with DelimtedFiles.readdlm when header=true

Related topics