Hello everyone,
Is there a function build-in julia to count all unique character frequency in a string? I seemingly saw it before some days ago, but I cann’t remember that and cann’t fount it in the doc.
Thanks~
Regards
Si
Hello everyone,
Is there a function build-in julia to count all unique character frequency in a string? I seemingly saw it before some days ago, but I cann’t remember that and cann’t fount it in the doc.
Thanks~
Regards
Si
You can use countmap
from StatsBase
:
julia> using StatsBase
[ Info: Recompiling stale cache file /Users/dpsanders/.julia/compiled/v1.0/StatsBase/EZjIG.ji for StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91]
julia> s = "hello"
"hello"
julia> countmap([c for c in s])
Dict{Char,Int64} with 4 entries:
'h' => 1
'l' => 2
'e' => 1
'o' => 1
I wonder if this will be faster and will use less memory if you run this on a large array of strings.
s = "hello"
count_char(s) = begin
res = Dict{Char, Int}()
for c in s
if haskey(res, c)
res[c] += 1
else
res[c] = 1
end
end
res
end
@time count_char(s)
@time count_char(s)
@dpsanders @xiaodai thanks both~ I just remember there is a function build-in julia can also do that, but I just can’t remember the function’s name. Maybe my memory is wrong.
Sorry, it is my bad memory~ I found it at Think Julia: How to Think Like a Computer Scientist.
no build-in fuction to do this count, just saw the histogram() function in Think Julia: How to Think Like a Computer Scientist
counter
from the DataStructures
package would be an option as well and should be equivalent in performance to the hand-made solution.
using DataStructures
s = "hello"
counter(s)
julia> using BenchmarkTools
julia> @time count_char(s);
0.106494 seconds (212.95 k allocations: 10.674 MiB)
julia> @time count_char(s);
0.000003 seconds (8 allocations: 704 bytes)
julia> @btime count_char($s);
183.447 ns (4 allocations: 544 bytes)
You should use BenchmarkTools for this, otherwise you won’t get accurate timing and memory info. Especially for short runtimes the timing is dramatically off.
Yeah but sometimes using BenchmarkTools
and waiting for it to compile is a pain. Close enough is close enough.
It seems a bit strange, BTW, that countmap
cannot handle generators. This should be faster and seems more elegant:
countmap(c for c in s)
But this is off by more than an order of magnitude.
I’ve never felt any compilation pain with BenchmarkTools.
Not that I want to argue, but it’s relativities that’s important here not the absolute
timings. Because, once you know the fastest way, it’s the fastest way, you have to choose it regardless of the absolute times.
Well, but then at least you need longer runtimes for the test. As it is now, you cannot tell the difference between 1ns and 3us. So you won’t know the relative runtimes.
This appears to be a bit faster than the solution:
function count_char2(s)
res = Dict{Char, Int}()
for c in s
res[c] = get(res, c, 0) + 1
end
return res
end
(BTW, that performance difference is hard to detect with @time
.)
If ASCII is enough it is possible to squeeze a bit more out of it
function count_char_ascii(s)
if !isascii(s)
error("non ascii string")
end
z = zeros(Int, 256)
for c in codeunits(s)
@inbounds z[c] += 1
end
let z = z
return Dict(Char(i) => z[i] for i in 1:length(z) if z[i] != 0)
end
end
julia> using Random
julia> s = randstring(10^6);
julia> @btime c1 = count_char(s);
20.043 ms (10 allocations: 5.34 KiB)
julia> @btime c2 = count_char2(s);
15.341 ms (10 allocations: 5.34 KiB)
julia> @btime c3 = count_char_ascii(s);
1.931 ms (15 allocations: 7.56 KiB)
Good question! There are lots of ways to do this, some better than others depending on the specifics of your problem. I’ll walk through several methods and compare them. For all methods, I’ll use the following sample string:
s = randstring(100_000)
oneliner(s) = Dict([(c, count(collect(s) .== c)) for c in unique(s)])
julia> @btime oneliner($s)
28.911 ms (401 allocations: 24.68 MiB)
Minimal code. No external dependencies. Stupid algorithm. Only consider this if performance doesn’t matter.
count_char(s) = begin
res = Dict{Char, Int}()
for c in s
if haskey(res, c)
res[c] += 1
else
res[c] = 1
end
end
res
end
julia> @btime count_char($s)
1.684 ms (10 allocations: 5.25 KiB)
Decent implementation, but can be made both faster and less verbose since it does three lookups for each duplicate key (first haskey
, then obtaining the current count, then writing an updated count)
function count_char2(s)
res = Dict{Char, Int}()
for c in s
res[c] = get(res, c, 0) + 1
end
return res
end
julia> @btime count_char2($s)
1.320 ms (10 allocations: 5.25 KiB)
A better version then the previous one since it only does two lookups (one read and one write). It’s also less verbose and doesn’t duplicate logic. All-around better.
using DataStructures
julia> @btime counter($s)
1.335 ms (11 allocations: 5.27 KiB)
This is pretty much the same as the previous implementation, but you don’t need to write the code yourself.
using StatsBase
julia> @btime countmap([c for c in $s])
949.163 μs (13 allocations: 395.97 KiB)
countmap
in StatsBase is in general the best method I know of for this task, and would be my method of choice in most cases. It uses an introspective algorithm of either a single lookup dictionary approach or a radix sort depending on the input data. Unfortunately, it doesn’t accept generator input, which wastes some time and memory in this case.
function count_char_single_lookup(s)
res = Dict{Char, Int}()
for c in s
index = Base.ht_keyindex2!(res, c)
if index > 0
@inbounds res.vals[index] += 1
else
@inbounds Base._setindex!(res, 1, c, -index)
end
end
res
end
julia> @btime count_char_single_lookup($s)
765.020 μs (10 allocations: 5.25 KiB)
Extracting the method used by countmap
, we can gain some performance. I would be quite wary using this method in code that you intend to live on for a while, since it exploits the internal, undocumented, structure of Dict
which could change at some point in the future.
Base.hash(c::Char) = UInt(c)
julia> @btime count_char_single_lookup($s)
487.873 μs (10 allocations: 5.25 KiB)
The default hash implementation for primitive types is not very efficient. It’s possible to squeeze a bit more performance out of dictionary based implementations by using a more efficient hash method. Again, I’d be wary using this method in practice since it redefines the hash code for Char
globally. I guess one could use a custom Char
type to get around that, but I’m not sure if that can be done in a clean and efficient way (suggestions appreciated).
using StatsBase
julia> @btime countmap([UInt(c) for c in $s]);
448.074 μs (16 allocations: 1.53 MiB)
To get even faster, we can turn to non-dictionary based approaches. By making the keys integers, countmap
switches to a radix sort implementation. At the end, one could trivially change the keys back to Char
s (see Kristoffer’s post). Just like above, we could extract this code to avoid creating the temporary array to gain some performance. (Or better yet, add support for generators in StatsBase.)
function count_limited_range(s)
counts = zeros(Int, 128)
@inbounds for c in codeunits(s)
counts[c] += 1
end
counts
end
julia> @btime count_limited_range($s)
72.770 μs (1 allocation: 1.14 KiB)
Finally, if the input data is guaranteed to be within a limited range, such as ASCII, by far the fastest way is to use a fixed-size array of counters.
You would do it something like this:
struct FastHashChar
c::Char
end
Base.hash(c::FastHashChar, h::UInt) = xor(c.c % UInt, h)
Base.isequal(c1::FastHashChar, c2::FastHashChar) = c1.c == c2.c
Base.Char(c1::FastHashChar) = c1.c
function count_char_single_lookup(s)
res = Dict{FastHashChar, Int}()
for _c in s
c = FastHashChar(_c)
index = Base.ht_keyindex2!(res, c)
if index > 0
@inbounds res.vals[index] += 1
else
@inbounds Base._setindex!(res, 1, c, -index)
end
end
return Dict(Char(c) => n for (c,n) in res)
end
(While it is always a fun exercise to try and optimize the heck out of some little problem like this, 99.9% it is not worth it and you should just use the simplest code like countmap
. It is nice that Julia allows us to go as crazy as we want in optimization, however! Next someone is going to suggest a SIMD implementation…)
I suggest using SIMD:
function count_limited_range_simd(s)
counts = zeros(Int, 128)
@inbounds @simd for c in codeunits(s)
counts[c] += 1
end
return counts
end
julia> @btime count_limited_range($s);
78.288 μs (1 allocation: 1.14 KiB)
julia> @btime count_limited_range_simd($s);
60.394 μs (1 allocation: 1.14 KiB)
I agree with you but I have to say that reading all theses answers is a very good way to learn different and efficient way of coding , the original poster can always choose the best solution for him so I would say it make more good than harm to publicly play the optimization game