Find string with special characters in data frame read with the CSV package

NunonuN · August 18, 2023, 9:47pm

Supposing I have the data frame:

using DataFrames

df = DataFrame(year = [2012, 1993, 1991, 1984, 1957, 1972, 1980], lang = ["Julia", "R", "Python", "Matlab", "Fortran", "C", "C++"])

I can find the creation year of C++ by executing the following command:

df[df[:, 2] .== "C++", :][1, 1]

However, if I read this data frame from a text file using CSV

using CSV

dfr = CSV.read("data.csv", DataFrame)

and perform the same search, i.e.,

dfr[dfr[:, 2] .== "C++", :][1, 1]

I get a “BoundsError”, because the returned data frame is empty.
That is, the searching command works if I build the data frame by hand, but fails if I read it from a text file using CSV.
I also tried to use the filter function, but the results are similar.
What am I doing wrong?

I’m using Julia version 1.9.1, CSV v0.10.11, and DataFrames v1.6.1.
I’m a Manjaro user.

bkamins · August 18, 2023, 10:20pm

I cannot reproduce your problem. I did CSV.write("data.csv", df) and read it back and all worked without an issue.

Also, if you were interested, this is how I would writhe the operation you perform:

julia> only(dfr.year[dfr.lang .== "C++"])
1980

NunonuN · August 18, 2023, 10:25pm

Thank you for your reply.
In that case, I guess something is wrong with my system. I’ll try to figure that out.

And thanks for your code suggestion. It’s elegant, and I’ll be using it from now on.

nilshg · August 19, 2023, 7:21am

Can you show what dfr looks like after you read it in?

nalimilan · August 19, 2023, 1:37pm

Can you share the CSV file?

rocco_sprmnt21 · August 19, 2023, 2:34pm

While waiting for the OP to clarify how the facts went, I tried some conjectures and this seems to be one probable (or possible)

str="""
2012  Julia   
1993  R
1991  Python
1984  Matlab
1957  Fortran
1972  C
1980  C++
"""

open("datacpp.csv", "w") do file
   write(file, str)
end


julia> dfr = CSV.read("datacpp.csv", DataFrame)
6×1 DataFrame
 Row │ 2012  Julia    
     │ String15       
─────┼────────────────
   1 │ 1993  R
   2 │ 1991  Python
   3 │ 1984  Matlab
   4 │ 1957  Fortran
   5 │ 1972  C
   6 │ 1980  C++

julia> dfr[dfr[:, 2] .== "C++", :][1, 1]
ERROR: BoundsError: attempt to access data frame with 1 column at index [2]

rocco_sprmnt21 · August 19, 2023, 4:10pm

I know. I was just trying to reconstruct the crime scene

NunonuN · August 21, 2023, 4:50am

Thank you all, I appreciate your interest in my question.

You can download the original .csv file from here.
In case of need, how would you suggest I share a file with you? I guess it’s not possible to share files here in the forum, right? Any platform you recommend for this kind of cases?

My guess is that the problem has to do with the encoding of some special characters, in particular the “+” sign.
I updated my OS today and currently only the data frame read from the original .csv file shows the problem. Here are the tests I ran.
I’ll use the following helper function, to make testing easier:

function year_created(df, lang::String)
  res = df[:, 1][lowercase.(df[:, 2]) .== lowercase(lang)]
  !isempty(res) && return only(res)
  error("Could not find the programming language.")
end

Testing on the .csv file mentioned above (FAIL)

NOTE: I saved the file as “plangs01.csv”.

julia> df1 = CSV.read("plangs01.csv", DataFrame)
73×2 DataFrame
 Row │ year   plang
     │ Int64  String31
─────┼───────────────────────────────────
   1 │  1951  Regional Assembly Language
   2 │  1952  Autocode
   3 │  1954  IPL
  ⋮  │   ⋮                ⋮
  70 │  2011  Red
  71 │  2011  Elixir
  72 │  2012  Julia
  73 │  2014  Swift

julia> year_created(df1, "julia")
2012

julia> year_created(df1, "c++")
ERROR: Could not find the programming language.
Stacktrace:
 [1] year_created(df::DataFrame, lang::String)
   @ Main ./REPL[101]:4
 [2] top-level scope
   @ REPL[109]:1

julia> year_created(df1, "c#")
2001

Testing on a fresh data frame (PASS)

julia> df2 = DataFrame(
       year = [1993, 1991, 1984, 1957, 1972, 1980, 2012],
       lang = ["R", "Python", "MATLAB", "FORTRAN", "C", "C++", "Julia"]
       )
7×2 DataFrame
 Row │ year   lang
     │ Int64  String
─────┼────────────────
   1 │  1993  R
   2 │  1991  Python
   3 │  1984  MATLAB
   4 │  1957  FORTRAN
   5 │  1972  C
   6 │  1980  C++
   7 │  2012  Julia

julia> year_created(df2, "JUlia")
2012

julia> year_created(df2, "c++")
1980

Testing on a newly created .csv file (PASS)

julia> str = """
       "year","lang"
       1993,R
       1991,Python
       1984,Matlab
       1957,Fortran
       1972,C
       1980,C++
       2001,C#
       2012,Julia
       """
"\"year\",\"lang\"\n1993,R\n1991,Python\n1984,Matlab\n1957,Fortran\n1972,C\n1980,C++\n2001,C#\n2012,Julia\n"

julia> open("plangs03.csv", "w") do file
       write(file, str)
       end
93

julia> df3 = CSV.read("plangs03.csv", DataFrame)
8×2 DataFrame
 Row │ year   lang
     │ Int64  String7
─────┼────────────────
   1 │  1993  R
   2 │  1991  Python
   3 │  1984  Matlab
   4 │  1957  Fortran
   5 │  1972  C
   6 │  1980  C++
   7 │  2001  C#
   8 │  2012  Julia

julia> year_created(df3, "julia")
2012

julia> year_created(df3, "c++")
1980

julia> year_created(df3, "c#")
2001

I’m curious to know what your Test 1 results are…

@rocco_sprmnt21, as you can see from my Test 3, your example is currently working for me, but I used to have the same problem you’re reporting.

NunonuN · August 21, 2023, 5:04am

In fact, I found the culprit!
By examining the .csv file, I realised that some lines were ending in a white space, which caused the corresponding languages to not match the input in the year_created function in my last post. By deleting those spaces, everything works fine.

nilshg · August 21, 2023, 5:21am

If you are working with fumes where data was entered manually you’ll often find trailing whitespace, in this case you can use strip to remove it.

rocco_sprmnt21 · August 21, 2023, 7:40am

My example was different from yours, deliberately not having used the comma (default separator) as field separator, so that the search for the string “c++”, with the criterion of exact equality, fails(*).
In my case, it fails because the created dataframe has only one column containing, for each row, the string “year lang”.
It might be useful in analyzing situations of this type to loosen the matching criterion, using for example (as implicitly suggested by @Dan) the contains(str, substr) function or similar.

(*) I tried to simulate a cut and paste operation, hypothesizing what could have happened.
I selected the REPL output of the dataframe and pasted it embedding it in a string, to be able to save it as a text file.

NunonuN · August 21, 2023, 5:31pm

This is interesting, because I have partially tested your code, and it had worked. I created the same string as in Test 3 above, but replaced the commas with spaces. If you try it, you’ll see that it works as long as there is no extra space after the first line.

To make it clearer (the difference between stra and strb is a trailing space after “lang”):

Test A (PASS)

stra = """
       "year" "lang"
       1993 R
       1991 Python
       1980 C++
       2012 Julia
       """
"\"year\" \"lang\"\n1993 R\n1991 Python\n1980 C++\n2012 Julia\n"

open("data/eng/testa.csv", "w") do file
  write(file, stra)
end
53

dfa = CSV.read("testa.csv", DataFrame)
4×2 DataFrame
 Row │ year   lang    
     │ Int64  String7 
─────┼────────────────
   1 │  1993  R
   2 │  1991  Python
   3 │  1980  C++
   4 │  2012  Julia

year_created(dfa, "julia")
2012

year_created(dfa, "c++")
1980

Test B (FAIL)

strb = """
       "year" "lang" 
       1993 R
       1991 Python
       1980 C++
       2012 Julia
       """
"\"year\" \"lang\" \n1993 R\n1991 Python\n1980 C++\n2012 Julia\n"

open("data/eng/testb.csv", "w") do file
  write(file, strb)
end
54

dfb = CSV.read("testb.csv", DataFrame)
4×1 DataFrame
 Row │ year        
     │ String15    
─────┼─────────────
   1 │ 1993 R
   2 │ 1991 Python
   3 │ 1980 C++
   4 │ 2012 Julia

year_created(dfb, "julia")
ERROR: BoundsError: attempt to access data frame with 1 column at index [2]
Stacktrace:
 [1] getindex
   @ ~/.julia/packages/DataFrames/58MUJ/src/other/index.jl:193 [inlined]
 [2] getindex(df::DataFrame, #unused#::Colon, col_ind::Int64)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/dataframe.jl:543
 [3] (::var"#11#12")(df::DataFrame, lang::String)
   @ Main ./REPL[15]:2
 [4] top-level scope
   @ REPL[52]:1

I’m not a software engineer, but I wonder if this behaviour is expected from a programming language or if Julia should be made “more robust” relatively to this kind of small differences that might be hard to catch…

rocco_sprmnt21 · August 21, 2023, 6:31pm

I believe that the CSV package (like others) uses heuristics to be as convenient as possible.
For example in the case in question if it finds a list of lines with the same “structure” [Number, Spaces, Word] it will think it is doing what it likes by interpreting the text as two columns of data separated by spaces.

If it finds any of these lines that have some extra trailing spaces, it can’t arbitrarily split into columns and put everything together, as was the case in my case as well [looking closer I saw that there were trailing spaces in one of the lines].

Reading the CSV documentation I believe these rules should be made explicit. Otherwise, you can ask the package maintainers about it.

rafael.guerra · August 21, 2023, 10:35pm

You could read using space delimiter: CSV.read(file, DataFrame, delim=" ") .

This will create a column of missings that can be cleaned out:

using CSV, DataFrames
dfb = CSV.read("testb.csv", DataFrame, delim=" ")
dfb[!, Not(all.(ismissing, eachcol(dfb)))]

Dan · August 22, 2023, 1:03am

An actual use of LLMs (using Llama2-13B):

convert the following into a valid CSV table:
"year" "lang"
1993 R
1991 Python
1980 C++
2012 Julia

Output:

Sure! Here's the valid CSV table:

year,lang
1993,R
1991,Python
1980,C++
2012,Julia

Topic		Replies	Views
CSV won't read tab separated file General Usage csv	23	650	March 4, 2024
Using CSV.read() to import data from a data input file into a DataFrame General Usage question , dataframes , csv	27	6756	March 1, 2022
Save data frame with special characters in csv General Usage dataframes , csv	5	4034	February 5, 2020
CSV.read seems to expect at least 2 columns New to Julia csv	12	413	September 26, 2023
Read special characters using CSV.read New to Julia csv	22	1053	October 11, 2023

Find string with special characters in data frame read with the CSV package

Related topics