New String7 type from result of CSV.file|>DataFrame makes my function failing (which takes a String as input)

My script used to work till yesterday, but is now failing because of the following error

ERROR: LoadError: MethodError: no method matching get_symbols_ta(::String7, ::String, ::String)
Closest candidates are:
  get_symbols_ta(::String, ::String, ::String) at C:\cygwin64\home\davidj\pb\anto-port\src\Yfjd.jl:190
Stacktrace:
 [1] get_adj(tick::String7, ddeb::String, dfin::String)
   @ Main C:\cygwin64\home\davidj\pb\anto-port\src\functions_gets.jl:9
 [2] getit(tick::String7, ddeb::String, dfin::String)
   @ Main C:\cygwin64\home\davidj\pb\anto-port\src\functions_gets.jl:14
 [3] get_tatous(aitick::Vector{String7})
   @ Main C:\cygwin64\home\davidj\pb\anto-port\src\functions_gets.jl:54
 [4] get_tatous3(aitick::Vector{String7}, ainb::Vector{Int64})
   @ Main C:\cygwin64\home\davidj\pb\anto-port\src\functions_gets.jl:82
 [5] macro expansion
   @ C:\cygwin64\home\davidj\pb\anto-port\src\tactantoh.jl:98 [inlined]
 [6] macro expansion
   @ C:\Users\davidj\.julia\packages\TimerOutputs\SSeq1\src\TimerOutput.jl:236 [inlined]
 [7] macro expansion
   @ C:\cygwin64\home\davidj\pb\anto-port\src\tactantoh.jl:96 [inlined]
 [8] top-level scope
   @ C:\Users\davidj\.julia\packages\TimerOutputs\SSeq1\src\TimerOutput.jl:236
in expression starting at C:\cygwin64\home\davidj\pb\anto-port\src\tactantoh.jl:6

I traced this to the appearance of String7 (unrecognized by my get_symbol_ta function) to the following

 dfaitxt="""
           Tick    Actions     Coursi  Nombre Domaine                      InCac40
           TTE.PA  Total       36.32   36      Energie                     Y
           DG.PA   Vinci       77.57   13      construction                Y
           NK.PA   Imerys      35.92   29      Meteaux/TerresRares         N
           PUB.PA  Publicis    28.99   33      Communication/PublicitΓ©     Y
           SGO.PA  SaintGobain 28.65   17      Travaux                     Y
           EN.PA   Bouygues    30.71   16      Telecomms/Construction      N
               BN.PA   Danone      58.56   8       Alimentaire                 Y
           ADP.PA  ADP         92.82   5       Aeroports                   N
           SK.PA   SEB         143.59  3       Electromenager              N
           CS.PA   AXA         21.45   15      Assurances                  Y
           SOI.PA  SOITEC      166.72  2       Technologie/Semiconducteurs N
           KORI.PA Korian      30.90   16      Sante                       N
           ENGI.PA ENGIE       12.22   41      Energie                     Y
           """
julia> dfai=CSV.File(IOBuffer(dfaitxt),delim=" ",ignorerepeated=true) |> DataFrame
13Γ—6 DataFrame
 Row β”‚ Tick     Actions      Coursi   Nombre  Domaine                      InCac40
     β”‚ String7  String15     Float64  Int64   String31                     String1
─────┼─────────────────────────────────────────────────────────────────────────────
   1 β”‚ TTE.PA   Total          36.32      36  Energie                      Y
   2 β”‚ DG.PA    Vinci          77.57      13  construction                 Y
   3 β”‚ NK.PA    Imerys         35.92      29  Meteaux/TerresRares          N
   4 β”‚ PUB.PA   Publicis       28.99      33  Communication/PublicitΓ©      Y
   5 β”‚ SGO.PA   SaintGobain    28.65      17  Travaux                      Y
   6 β”‚ EN.PA    Bouygues       30.71      16  Telecomms/Construction       N
   7 β”‚ BN.PA    Danone         58.56       8  Alimentaire                  Y
   8 β”‚ ADP.PA   ADP            92.82       5  Aeroports                    N
   9 β”‚ SK.PA    SEB           143.59       3  Electromenager               N
  10 β”‚ CS.PA    AXA            21.45      15  Assurances                   Y
  11 β”‚ SOI.PA   SOITEC        166.72       2  Technologie/Semiconducteurs  N
  12 β”‚ KORI.PA  Korian         30.9       16  Sante                        N
  13 β”‚ ENGI.PA  ENGIE          12.22      41  Energie                      Y

julia>

So, question : how to make things work again ? Should I convert Tick to String ? Or is there a way that get_symbols_ta can accept a String7 ?

For information, CSV version is 0.9.4 and DataFrames one is 1.2.2. Functions worked with versions CSV#0.8.5 and DataFrames#1.2.0

Thank you for any help and suggestions.

https://csv.juliadata.org/stable/reading.html#stringtype

try

stringtype=String
5 Likes

You should write your code to accept ::AbstractString instead of String. That will get rid of this particular method error and you get to take advantage of the performance of String7.

8 Likes

String7 etc. is meant for optimization, and as I suspected it would break a lot of code that accepts String only. That code needs to accept AbstractString instead, what most code should actually use in function definitions, if defining a type at all. That’s been true for a while, even though you’ve gotten away without that, since String isn’t the only type, there are others in packages, but I guess not yet widely used. CSV.jl is however widely used, I suspect the first widely used package to define its own string types.

That’s my theory anyway. A workaround is to convert to String (or ask for it in the first place) and then call functions that only assume those. I have my own idea, no code yet, how to change string handling in Julia similar to String7, while keeping the API the same, all hidden behind String, and I would prefer for only one good default String type in base, so that others aren’t needed.

1 Like

Thank you for the question and thanks, @pdeffebach for the solution. Actually, a few days ago I have a problem with that using MLJ (and DecisionTree), because they were different String type. Changing all types from the DataFrame to String has solved all my problems.

Dear @oheil, thank you, this is a good β€œquick-and-dirty” fix that neeeds only to fix the CSV call, so easier. I tried it and it works fine, thank you very much!

2 Likes

Dear @pdeffebach, thank you, it works and as you said I get much better performance with string handling; only inconvenient - is that really one?- is that I had to modify my library module.

So I mark your proposal as the solution as performance is better.

1 Like

Dear @Palli, thank you fotr your insightful remarks. When you will have code (or at least implementation outline) for your β€œnew but identical” String API I would be interested to see it / understand it. Question, when you say others String typer not needed in Base, you mean you would propose replace the current implementatrion by you new one ?

Aside, I saw that String7 etc. are defined in both WeakRefStrings and the brand new InlineStrings Is this new module an intended fork (in fact a split?) of WeakRefsStrings ? Or is there another reason for the fork or separation?

Just answering here, immediate issue is resolved so this is kind of off-topic:

What you could do, and I believe there’s even a Julia issue on it, is have a hybrid of the inline string and a regular String. I.e.

struct HybridString
  prefix::String7  # would prefer String8 type, would also fit in a register and saving length not needed
  full_string::String  # possibly type Ptr{String}
end

Often your prefix is the whole string and the other full_string can be a null pointer, not stressing the garbage collector. Then it would take as much memory as String15/16 so that’s the tradeoff. If you know strings are never longer than String7 then those are still a good idea. Many strings in your String31 column could fit in my type of string, but for your dataframe all the strings in that column are padded to 31 bytes. Most strings are short, but long strings do not have to be fully inline.

Then I have an idea to have 24 letters in the 8-byte prefix, but even without that the prefix should give you fast sorting already. My idea would give localized sorting for e.g. Icelandic alphabet, English and more languages, and O(1) indexing in most cases.

1 Like

Very interesting ! Thank you for sharing your views, please continue the very good work.