Performance of using strings as keys in Dict (vs. integers)

question

#1

I have a large Dictionary, and currently I am using strings as keys. I am getting bad performance, which could be related to this (but I am not sure). I am wondering if it makes sense to rewrite the code so that the Dictionaries use Ints as keys, instead of Strings.

I could just have to keep track somewhere else of the integer indexing each string, but this I would only use to print some stuff to the user. The strings are actually irrelevant to the code logic.

Before I rewrite my code (could take some time), I would like some advice. Could this have an impact on performance?


#2

Have you tried profiling?
https://docs.julialang.org/en/latest/manual/profile/


#3

If I profile the code, how will that tell me if changing to Int keys will improve performance?


#4

Well, the profiling will tell you how much time is taken up accessing the Dicts (presumably your code does something else too?). If it is a small percentage already, then switching to Ints will have no effect.


#5

In my testing for a package that I’m working on (which uses Dict{String,X} as a way to associate a String UUID with whatever X is), using strings of length 16 performed about as well as using a UInt64, which was the alternative that I was comparing against. I was mostly iterating over all the key/value pairs in the dictionary, not doing much else.

I think this only worked well, however, because the strings fit nicely into SIMD lanes, so they can be quickly compared and such. If you have a mix of different string lengths, and/or they don’t fall on a power-of-two boundary, then you may get worse results. But please take what I say with a grain of salt, I’m no expert on how strings are processed in Julia. Profiling and benchmarking a few operations that are common in your application/package would be the way to go to.


#6

What versions of Julia have you been testing on?


#7

Hello all,

I am having the same problem reported by cossio. Basically 100% of my codes are related to string handling, dictionaries, hashs, and so on. My dicts are huge (really big) and I must use a string content as keys. Aside of that, I made use of a plenty number of fuzzy and string distance functions always readling and writing zillion number of CSV or TSV files.

I read at this polemic thread: https://www.reddit.com/r/Julia/comments/629qkz/about_a_year_ago_an_article_titled_giving_up_on/
that string handling and text formatting is one of the areas where Julia is been improved, and I am very happy with that, mainly because I loved the language.

Given the core business of my company, I managed a lot of projects related to string data frames in the last years. We made a lot of tests with new languages to find out the best option for us aside Python.

Although I am in love :heart: with Julia, I must confess her performance is not so good against APIs like Pandas, or NimData - A DataFrame written in Nim, or Kniren - A DataFrame and data wrangling in Go.

Anyway, for sure, I will continue to support the language and spread the word with my fellow colleagues. I am sure the upcomming Julias will be improved in the string handling area.


#8

That’s not the languages, those are libraries built on the languages. That’s very different. Besides, Julia does have stuff like Pandas.jl if you really need the features and performance of Pandas right now.

I think the problem here isn’t strings but that dictionaries leave a lot of performance behind. I don’t know how that compares to other languages though.


#9

If you can post an example of slow code (with comparison to Python), I’m sure that people here will be happy to help.