Char vs. String for Dict key


#1

I’m getting to know Julia and found some quick little exercises at exercism.io (http://exercism.io/languages/julia/exercises). In the Nucleotide Count exercise, based on the runtests.jl file, the preferred Dict structure is:
Dict(‘A’ => 0, ‘C’ => 0, ‘G’ => 0, ‘T’ => 0)

Is this preferred to String based keys? i.e. Dict(“A”=>0).

My first solution did this for String keys (trimmed slightly). count is the Dict, myStr the input string.

for i in myStr
count["$i"] += 1
end

This failed the test because of the data type of the key. So I changed to using indexes on the input string, which worked as well, but felt less elegant somehow. I realize it’s quite subjective.

for i in 1:length(myStr)
count[myStr[i]] += 1
end

Is there a reason to choose one over the other, aside from passing the supplied runtest.jl? Style, idiom, performance?

Thanks.


#2

Most idiomatic, I think, is to use symbols:

Dict(:A => 0, :C => 0, :G => 0, :T => 0)

although for this particular exercise that might not be an option.

I think chars are immutable and strings are not, so they should be more performant. I would think symbols are also performant. Why don’t you benchmark your variants using Benchmarktools.jl?


#3

Thanks for the reply. I played around with converting a String to a Symbol and realized that iterating over a string creates a Char. I’m not sure how I missed that before. The following works just fine for the purposes of the provided exercise.

for i in str1
if !haskey(count,i)
throw(ErrorException(“Key not found”))
end
count[i] += 1
end

On a side note, figuring out how to get from String to Symbol to Char and back is entertaining.

Thanks for your recommendation of Benchmarktools, I’ll take a look at it.


#4

(Note you can use triple backticks to typeset whole blocks of code at once)


#5

Note that String is generally a more heavyweight object than a Char. A String combines a length and an an array of (UTF8-encoded) characters (though technically it no longer uses an Array), whereas a Char is just one number (that can fit in a single CPU register). Operations on strings will typically be more expensive than operations on chars, e.g.

julia> k = "A"; @btime hash($k);
  10.671 ns (0 allocations: 0 bytes)

julia> k = 'A'; @btime hash($k);
  3.790 ns (0 allocations: 0 bytes)

julia> k = "A"; @btime isequal($k,$k);
  5.041 ns (0 allocations: 0 bytes)

julia> k = 'A'; @btime isequal($k,$k);
  1.893 ns (0 allocations: 0 bytes)

#6

And symbols are as fast as chars:

julia> @btime isequal(:a,:a);                                                                                                                         
  1.689 ns (0 allocations: 0 bytes)                                                                                                                   
                                                                                                                                                      
julia> @btime isequal('k', 'c');                                                                                                                      
  1.688 ns (0 allocations: 0 bytes)                                                                                                                   
                                                                                                                                                      
julia> @btime isequal("k", "v");                                                                                                                      
  5.033 ns (0 allocations: 0 bytes)                                                                                                                   

#7

Not for all operations. e.g. hash and isless are both slower for symbols.