Bug with DelimtedFiles.readdlm when header=true

Hallo,

Thanks for providing an easy and fast way to read delimited files.

When I use header=true as an option. In the output Data it skips 2 columns instead of 1 which is a bug.

Data, Header= DelimtedFiles.readdlm(path,',',header=true)

I added an issue #39831

Many thanks for providing a fix.
Joseph

c.f. header=true, skips 2 lines · Issue #39831 · JuliaLang/julia · GitHub

As I mentioned there, it’d really help if you could provide a small example CSV file that reproduces this behavior.

Thanks Stilly for looking at the problem:

The input file which causes problem is attached
CSVfile
The input file has the following format:

Id , SELECT_1
11122911000031011, 1
11122911000031012, 1

The code

Data, Header= DelimtedFiles.readdlm(path, ',' , header=true)

 println(Data[1,:]) = 11122911000031012

Which is not what is expected.

Hope that helps to debug,
Joseph

I can see that a problem might arise if the data is treated as floating point. I am able to read the file in as integers:

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , Int, header=true);

julia> println(Int.(Data[1,:]))
[11122911000031011, 1]

whereas as floating-point:

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , header=true);

julia> println(Int.(Data[1,:]))
[11122911000031012, 1]

Won’t say that this is a DelimitedFiles bug exactly, rather a floating-point artifact

Thanks for investigating. In this case I have only Int64, but what happens if Id is an integer and the data is FLoat64?

My work around is to avoid using Header as follow:

         # Read data
            Data = DelimitedFiles.readdlm("Smap_Id_Select.csv", ',')

         # Read header
            Header = Data[1,1:end]

         # Remove first row
            Data = Data[1:end.≠1,1:end]

Such rounding might be happening at more places than the first row, so removing the first row is not the solution. If you have data of mixed types then it’s best to read it in by specifying the element type to be Any.

julia> Data, Header= DelimitedFiles.readdlm("Smap_Id_Select.csv", ',' , Any, header=true);

julia> Data[1:2, :]
2×2 Array{Any,2}:
 11122911000031011  1
 11122911000031012  1

julia> Data[1,1] |> typeof
Int64

For more sophisticated cases you may consider using a DataFrame

julia> using CSV, DataFrames

julia> df = DataFrame(CSV.File("Smap_Id_Select.csv"))
16281×2 DataFrame
   Row │ Id                 SELECT_1 
       │ Int64              Int64    
───────┼─────────────────────────────
     1 │ 11122911000031011         1
     2 │ 11122911000031012         1
     3 │ 11122911000031013         1
     4 │ 11122911000031014         1
[...]

Requiring some further explanations.
When I remove the first row I am removing the row which has the Headings.

What is the fastest way to read .csv => Array and not DataFrames?

The issue with specifying header=true but no type is that by default, type is assumed to be Float64 and you get rounding errors on column 1 data.

In your example this can be fixed by specifying header=true and type Int for data.

But …

In this case I have only Int64, but what happens if Id is an integer and the data is FLoat64?

Well, here you can specify header=true with type Any.

Or use CSV, which can do a better job of inferring column types, or you can explicitly specify them.

Your solution of reading without header=true and manually slicing out the first row works because the type is automatically inferred as Any since the first row (header) contains strings and the data rows are numeric.

BTW, I think you can simplify your slicing:

# Read header
    Header = Data[1,1:end]

 # Remove first row
    Data = Data[1:end.≠1,1:end]
Header = Data[1, :]
Data = Data[2:end, :]

What is the fastest way to read .csv => Array and not DataFrames?

Checkout CSV:

I would like to thank you for providing us such a DelimtedFiles which is in the core Julia package and therefore I expect that the DelimtedFiles works as expected with no surprises.

To my understanding in the documentation
DelimitedFiles

There is no where written that one must putting Any or else you might not get the expected results.

From my perspective Delimited Files has a bug which needs to be fixed so that users can get the results as the documentation promises.

Once again thanks for providing a free tool,
Joseph

My comment is that if DelimitedFiles is outdated it may be a good idea to replace it with CSV.jl. I understand that there must be some build tools to easily convert DataFrameworks into Array.