CSV.jl fails when reading multiple inputs at once if variable name "name" occurs

Hi there,

I am importing multiple csv files using the excellent CSV.jl package. However, I fail to import all files in one go when one of the imported variables has the variable name “name”.

Is this a bug, and can someone recommend a workaround? In my real example, I cannot change the content of the csv files I am importing.

I could of course import only one file at a time, but I like the idea of importing them all in one go.

MWE:

using CSV

# Data with ordinary variable names - works
data = [
    "a,b,c\n1,2,3\n4,5,6\n",
    "a,b,c\n7,8,9\n10,11,12\n",
]
CSV.File(map(IOBuffer, data))   # Works

# Only one set of data with variable name "name" - works
data = [
    "a,name,c\n1,2,3\n4,5,6\n",
]
CSV.File(map(IOBuffer, data)) # Works

# Data where the second variable name is "name" - fails
data = [
    "a,name,c\n1,2,3\n4,5,6\n",
    "a,name,c\n7,8,9\n10,11,12\n",
]
CSV.File(map(IOBuffer, data)) # Fails

REPL output:

4-element CSV.File:
 CSV.Row: (a = 1, b = 2, c = 3)   
 CSV.Row: (a = 4, b = 5, c = 6)   
 CSV.Row: (a = 7, b = 8, c = 9)   
 CSV.Row: (a = 10, b = 11, c = 12)

2-element CSV.File:
 CSV.Row: (a = 1, name = 2, c = 3)
 CSV.Row: (a = 4, name = 5, c = 6)

ERROR: MethodError: Cannot `convert` an object of type SentinelArrays.ChainedVector{Int64, Vector{Int64}} to an object of type String
Closest candidates are:
  convert(::Type{String}, ::WeakRefStrings.WeakRefString) at C:\Users\B046326\.julia\packages\WeakRefStrings\31nkb\src\WeakRefStrings.jl:81
  convert(::Type{String}, ::FilePathsBase.AbstractPath) at C:\Users\B046326\.julia\packages\FilePathsBase\9kSEl\src\path.jl:117       
  convert(::Type{String}, ::String) at essentials.jl:218
  ...
Stacktrace:
 [1] CSV.File(name::SentinelArrays.ChainedVector{Int64, Vector{Int64}}, names::Vector{Symbol}, types::Vector{Type}, rows::Int64, cols::Int64, columns::Vector{CSV.Column}, lookup::Dict{Symbol, CSV.Column})
   @ CSV C:\Users\B046326\.julia\packages\CSV\b8ebJ\src\file.jl:106
 [2] CSV.File(sources::Vector{IOBuffer}; source::Nothing, kw::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ CSV C:\Users\B046326\.julia\packages\CSV\b8ebJ\src\file.jl:940
 [3] CSV.File(sources::Vector{IOBuffer})
   @ CSV C:\Users\B046326\.julia\packages\CSV\b8ebJ\src\file.jl:891
 [4] top-level scope
   @ c:\Users\B046326\Levy\projects\bam_io\test\temp.jl:150

julia>
1 Like

Seems like a bug based on aggregating the data in some temporary f::File that has a field name and then calling f.name when creating the return value.

This call will go to getproperty for the type that is defined to first look at some dict to see if there is a column by that name, and if not it will get the field from the struct.

So for this case f.name will return a column, but it is at a position in the File constructor that expects a string.

Could probably be solved by using some of the getters defined at the top of the file instead of dot syntax?

Though that would mean a fix in CSV, I don’t have any suggestion for a simple local workaround other than just load them separately.

Seems like a brittle way of overriding getproperty though, easy for something like this to happen.

Some combination of the keyword header and a preprocessing step on the names in the header?

CSV.File

1 Like

Does this meet the need?

vcat(CSV.File.(map(IOBuffer, data))...)

Added a PR to fix this, so one workaround could be to use that branch until it gets merged.

3 Likes

That is amazing. Today I saw that @quinnj merged the bugfix into the main branch on github so I guess there will be a new version of the package soon to resolve this. I wish I can get good enough one day to be able to do a PR myself. As of now I can just express my gratitude to @albheim and @quinnj. Thanks!

2 Likes

Thanks for this idea. To make progress, I made a workaround without parsing the headers at all and instead hardcoding them, like this:

CSV.read(
      # Collect needed to materialize the vector since SubDataFrames are views into the parent
      collect(subdf.desired_filepath),
      DataFrame;
      header = [
          "date",
          "code",
          "isin",
          "sedol",
          "country",
          "currency",
          "exchange",
          "sector",
          "indexshares",
          "rate",
          "price",
          "indexweight",
          #"name"                 # CSV.jl has a bug that does not allow this variable name when reading multiple
                                  # files in one go, see
                                  # https://discourse.julialang.org/t/csv-jl-fails-when-reading-multiple-inputs-at-once-if-variable-name-name-occurs/97610
          "security_name",
      ],
      skipto = 5,
  )
1 Like

I decided to not wait for a new version of the CSV package but to move to the main branch of CSV.jl where @albheim got his bugfix implemented.

In practice, I simply entered the package manager in the REPL and entered:

add CSV#main

Then I could remove my workaround and simply use:

CSV.read(
      # Collect needed to materialize the vector since SubDataFrames are views into the parent
      collect(subdf.desired_filepath),
      DataFrame;
      header = 4,
)

Entering status in the package manager now shows:

[336ed68f] CSV v0.10.9 https://github.com/JuliaData/CSV.jl.git#main

Open source is pretty cool when it works like this. What a great community!

2 Likes