I have a large Dictionary, and currently I am using strings as keys. I am getting bad performance, which could be related to this (but I am not sure). I am wondering if it makes sense to rewrite the code so that the Dictionaries use Ints as keys, instead of Strings.
I could just have to keep track somewhere else of the integer indexing each string, but this I would only use to print some stuff to the user. The strings are actually irrelevant to the code logic.
Before I rewrite my code (could take some time), I would like some advice. Could this have an impact on performance?
Well, the profiling will tell you how much time is taken up accessing the Dicts (presumably your code does something else too?). If it is a small percentage already, then switching to Ints will have no effect.
In my testing for a package that I’m working on (which uses Dict{String,X} as a way to associate a String UUID with whatever X is), using strings of length 16 performed about as well as using a UInt64, which was the alternative that I was comparing against. I was mostly iterating over all the key/value pairs in the dictionary, not doing much else.
I think this only worked well, however, because the strings fit nicely into SIMD lanes, so they can be quickly compared and such. If you have a mix of different string lengths, and/or they don’t fall on a power-of-two boundary, then you may get worse results. But please take what I say with a grain of salt, I’m no expert on how strings are processed in Julia. Profiling and benchmarking a few operations that are common in your application/package would be the way to go to.
I am having the same problem reported by cossio. Basically 100% of my codes are related to string handling, dictionaries, hashs, and so on. My dicts are huge (really big) and I must use a string content as keys. Aside of that, I made use of a plenty number of fuzzy and string distance functions always readling and writing zillion number of CSV or TSV files.
Given the core business of my company, I managed a lot of projects related to string data frames in the last years. We made a lot of tests with new languages to find out the best option for us aside Python.
Although I am in love with Julia, I must confess her performance is not so good against APIs like Pandas, or NimData - A DataFrame written in Nim, or Kniren - A DataFrame and data wrangling in Go.
Anyway, for sure, I will continue to support the language and spread the word with my fellow colleagues. I am sure the upcomming Julias will be improved in the string handling area.
That’s not the languages, those are libraries built on the languages. That’s very different. Besides, Julia does have stuff like Pandas.jl if you really need the features and performance of Pandas right now.
I think the problem here isn’t strings but that dictionaries leave a lot of performance behind. I don’t know how that compares to other languages though.