Find string with special characters in data frame read with the CSV package

Supposing I have the data frame:

using DataFrames

df = DataFrame(year = [2012, 1993, 1991, 1984, 1957, 1972, 1980], lang = ["Julia", "R", "Python", "Matlab", "Fortran", "C", "C++"])

I can find the creation year of C++ by executing the following command:

df[df[:, 2] .== "C++", :][1, 1]

However, if I read this data frame from a text file using CSV

using CSV

dfr = CSV.read("data.csv", DataFrame)

and perform the same search, i.e.,

dfr[dfr[:, 2] .== "C++", :][1, 1]

I get a โ€œBoundsErrorโ€, because the returned data frame is empty.
That is, the searching command works if I build the data frame by hand, but fails if I read it from a text file using CSV.
I also tried to use the filter function, but the results are similar.
What am I doing wrong?

Iโ€™m using Julia version 1.9.1, CSV v0.10.11, and DataFrames v1.6.1.
Iโ€™m a Manjaro user.

I cannot reproduce your problem. I did CSV.write("data.csv", df) and read it back and all worked without an issue.

Also, if you were interested, this is how I would writhe the operation you perform:

julia> only(dfr.year[dfr.lang .== "C++"])
1980
1 Like

Thank you for your reply.
In that case, I guess something is wrong with my system. Iโ€™ll try to figure that out.

And thanks for your code suggestion. Itโ€™s elegant, and Iโ€™ll be using it from now on. :+1:

Can you show what dfr looks like after you read it in?

Can you share the CSV file?

While waiting for the OP to clarify how the facts went, I tried some conjectures and this seems to be one probable (or possible)

str="""
2012  Julia   
1993  R
1991  Python
1984  Matlab
1957  Fortran
1972  C
1980  C++
"""

open("datacpp.csv", "w") do file
   write(file, str)
end


julia> dfr = CSV.read("datacpp.csv", DataFrame)
6ร—1 DataFrame
 Row โ”‚ 2012  Julia    
     โ”‚ String15       
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚ 1993  R
   2 โ”‚ 1991  Python
   3 โ”‚ 1984  Matlab
   4 โ”‚ 1957  Fortran
   5 โ”‚ 1972  C
   6 โ”‚ 1980  C++

julia> dfr[dfr[:, 2] .== "C++", :][1, 1]
ERROR: BoundsError: attempt to access data frame with 1 column at index [2]
1 Like

I know. I was just trying to reconstruct the crime scene :smile:

2 Likes

Thank you all, I appreciate your interest in my question.

You can download the original .csv file from here.
In case of need, how would you suggest I share a file with you? I guess itโ€™s not possible to share files here in the forum, right? Any platform you recommend for this kind of cases?

My guess is that the problem has to do with the encoding of some special characters, in particular the โ€œ+โ€ sign.
I updated my OS today and currently only the data frame read from the original .csv file shows the problem. Here are the tests I ran.
Iโ€™ll use the following helper function, to make testing easier:

function year_created(df, lang::String)
  res = df[:, 1][lowercase.(df[:, 2]) .== lowercase(lang)]
  !isempty(res) && return only(res)
  error("Could not find the programming language.")
end
  1. Testing on the .csv file mentioned above (FAIL)

NOTE: I saved the file as โ€œplangs01.csvโ€.

julia> df1 = CSV.read("plangs01.csv", DataFrame)
73ร—2 DataFrame
 Row โ”‚ year   plang
     โ”‚ Int64  String31
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚  1951  Regional Assembly Language
   2 โ”‚  1952  Autocode
   3 โ”‚  1954  IPL
  โ‹ฎ  โ”‚   โ‹ฎ                โ‹ฎ
  70 โ”‚  2011  Red
  71 โ”‚  2011  Elixir
  72 โ”‚  2012  Julia
  73 โ”‚  2014  Swift

julia> year_created(df1, "julia")
2012

julia> year_created(df1, "c++")
ERROR: Could not find the programming language.
Stacktrace:
 [1] year_created(df::DataFrame, lang::String)
   @ Main ./REPL[101]:4
 [2] top-level scope
   @ REPL[109]:1

julia> year_created(df1, "c#")
2001
  1. Testing on a fresh data frame (PASS)
julia> df2 = DataFrame(
       year = [1993, 1991, 1984, 1957, 1972, 1980, 2012],
       lang = ["R", "Python", "MATLAB", "FORTRAN", "C", "C++", "Julia"]
       )
7ร—2 DataFrame
 Row โ”‚ year   lang
     โ”‚ Int64  String
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚  1993  R
   2 โ”‚  1991  Python
   3 โ”‚  1984  MATLAB
   4 โ”‚  1957  FORTRAN
   5 โ”‚  1972  C
   6 โ”‚  1980  C++
   7 โ”‚  2012  Julia

julia> year_created(df2, "JUlia")
2012

julia> year_created(df2, "c++")
1980
  1. Testing on a newly created .csv file (PASS)
julia> str = """
       "year","lang"
       1993,R
       1991,Python
       1984,Matlab
       1957,Fortran
       1972,C
       1980,C++
       2001,C#
       2012,Julia
       """
"\"year\",\"lang\"\n1993,R\n1991,Python\n1984,Matlab\n1957,Fortran\n1972,C\n1980,C++\n2001,C#\n2012,Julia\n"

julia> open("plangs03.csv", "w") do file
       write(file, str)
       end
93

julia> df3 = CSV.read("plangs03.csv", DataFrame)
8ร—2 DataFrame
 Row โ”‚ year   lang
     โ”‚ Int64  String7
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚  1993  R
   2 โ”‚  1991  Python
   3 โ”‚  1984  Matlab
   4 โ”‚  1957  Fortran
   5 โ”‚  1972  C
   6 โ”‚  1980  C++
   7 โ”‚  2001  C#
   8 โ”‚  2012  Julia

julia> year_created(df3, "julia")
2012

julia> year_created(df3, "c++")
1980

julia> year_created(df3, "c#")
2001

Iโ€™m curious to know what your Test 1 results areโ€ฆ :thinking:

@rocco_sprmnt21, as you can see from my Test 3, your example is currently working for me, but I used to have the same problem youโ€™re reporting.

In fact, I found the culprit!
By examining the .csv file, I realised that some lines were ending in a white space, which caused the corresponding languages to not match the input in the year_created function in my last post. By deleting those spaces, everything works fine. :sunglasses:

If you are working with fumes where data was entered manually youโ€™ll often find trailing whitespace, in this case you can use strip to remove it.

1 Like

My example was different from yours, deliberately not having used the comma (default separator) as field separator, so that the search for the string โ€œc++โ€, with the criterion of exact equality, fails(*).
In my case, it fails because the created dataframe has only one column containing, for each row, the string โ€œyear langโ€.
It might be useful in analyzing situations of this type to loosen the matching criterion, using for example (as implicitly suggested by @Dan) the contains(str, substr) function or similar.

(*) I tried to simulate a cut and paste operation, hypothesizing what could have happened.
I selected the REPL output of the dataframe and pasted it embedding it in a string, to be able to save it as a text file.

This is interesting, because I have partially tested your code, and it had worked. I created the same string as in Test 3 above, but replaced the commas with spaces. If you try it, youโ€™ll see that it works as long as there is no extra space after the first line.

To make it clearer (the difference between stra and strb is a trailing space after โ€œlangโ€):

  • Test A (PASS)

    stra = """
           "year" "lang"
           1993 R
           1991 Python
           1980 C++
           2012 Julia
           """
    "\"year\" \"lang\"\n1993 R\n1991 Python\n1980 C++\n2012 Julia\n"
    
    open("data/eng/testa.csv", "w") do file
      write(file, stra)
    end
    53
    
    dfa = CSV.read("testa.csv", DataFrame)
    4ร—2 DataFrame
     Row โ”‚ year   lang    
         โ”‚ Int64  String7 
    โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
       1 โ”‚  1993  R
       2 โ”‚  1991  Python
       3 โ”‚  1980  C++
       4 โ”‚  2012  Julia
    
    year_created(dfa, "julia")
    2012
    
    year_created(dfa, "c++")
    1980
    
  • Test B (FAIL)

    strb = """
           "year" "lang" 
           1993 R
           1991 Python
           1980 C++
           2012 Julia
           """
    "\"year\" \"lang\" \n1993 R\n1991 Python\n1980 C++\n2012 Julia\n"
    
    open("data/eng/testb.csv", "w") do file
      write(file, strb)
    end
    54
    
    dfb = CSV.read("testb.csv", DataFrame)
    4ร—1 DataFrame
     Row โ”‚ year        
         โ”‚ String15    
    โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
       1 โ”‚ 1993 R
       2 โ”‚ 1991 Python
       3 โ”‚ 1980 C++
       4 โ”‚ 2012 Julia
    
    year_created(dfb, "julia")
    ERROR: BoundsError: attempt to access data frame with 1 column at index [2]
    Stacktrace:
     [1] getindex
       @ ~/.julia/packages/DataFrames/58MUJ/src/other/index.jl:193 [inlined]
     [2] getindex(df::DataFrame, #unused#::Colon, col_ind::Int64)
       @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/dataframe.jl:543
     [3] (::var"#11#12")(df::DataFrame, lang::String)
       @ Main ./REPL[15]:2
     [4] top-level scope
       @ REPL[52]:1
    

Iโ€™m not a software engineer, but I wonder if this behaviour is expected from a programming language or if Julia should be made โ€œmore robustโ€ relatively to this kind of small differences that might be hard to catchโ€ฆ

I believe that the CSV package (like others) uses heuristics to be as convenient as possible.
For example in the case in question if it finds a list of lines with the same โ€œstructureโ€ [Number, Spaces, Word] it will think it is doing what it likes by interpreting the text as two columns of data separated by spaces.

If it finds any of these lines that have some extra trailing spaces, it canโ€™t arbitrarily split into columns and put everything together, as was the case in my case as well [looking closer I saw that there were trailing spaces in one of the lines].

Reading the CSV documentation I believe these rules should be made explicit. Otherwise, you can ask the package maintainers about it.

1 Like

You could read using space delimiter: CSV.read(file, DataFrame, delim=" ") .

This will create a column of missings that can be cleaned out:

using CSV, DataFrames
dfb = CSV.read("testb.csv", DataFrame, delim=" ")
dfb[!, Not(all.(ismissing, eachcol(dfb)))]
1 Like

An actual use of LLMs (using Llama2-13B):

convert the following into a valid CSV table:
"year" "lang"
1993 R
1991 Python
1980 C++
2012 Julia

Output:

Sure! Here's the valid CSV table:

year,lang
1993,R
1991,Python
1980,C++
2012,Julia
2 Likes