How to count all unique character frequency in a string?

question

#1

Hello everyone,
Is there a function build-in julia to count all unique character frequency in a string? I seemingly saw it before some days ago, but I cann’t remember that and cann’t fount it in the doc.

Thanks~
Regards
Si


#2

You can use countmap from StatsBase:

julia> using StatsBase
[ Info: Recompiling stale cache file /Users/dpsanders/.julia/compiled/v1.0/StatsBase/EZjIG.ji for StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91]

julia> s = "hello"
"hello"

julia> countmap([c for c in s])
Dict{Char,Int64} with 4 entries:
  'h' => 1
  'l' => 2
  'e' => 1
  'o' => 1

#3

I wonder if this will be faster and will use less memory if you run this on a large array of strings.

s = "hello"

count_char(s) = begin
  res = Dict{Char, Int}()
  for c in s
      if haskey(res, c)
         res[c] += 1
      else
         res[c] = 1
      end
   end
   res
end

@time count_char(s)
@time count_char(s)

#4

@dpsanders @xiaodai thanks both~ I just remember there is a function build-in julia can also do that, but I just can’t remember the function’s name. Maybe my memory is wrong.


#5

Sorry, it is my bad memory~ I found it at https://benlauwens.github.io/ThinkJulia.jl/latest/book.html#_word_histogram.


#6

no build-in fuction to do this count, just saw the histogram() function in https://benlauwens.github.io/ThinkJulia.jl/latest/book.html#_word_histogram


#7

counter from the DataStructures package would be an option as well and should be equivalent in performance to the hand-made solution.

using DataStructures
s = "hello"
counter(s)

#8
julia> using BenchmarkTools

julia> @time count_char(s);
  0.106494 seconds (212.95 k allocations: 10.674 MiB)

julia> @time count_char(s);
  0.000003 seconds (8 allocations: 704 bytes)

julia> @btime count_char($s);
  183.447 ns (4 allocations: 544 bytes)

You should use BenchmarkTools for this, otherwise you won’t get accurate timing and memory info. Especially for short runtimes the timing is dramatically off.


#9

Yeah but sometimes using BenchmarkTools and waiting for it to compile is a pain. Close enough is close enough.


#10

It seems a bit strange, BTW, that countmap cannot handle generators. This should be faster and seems more elegant:

countmap(c for c in s)

#11

But this is off by more than an order of magnitude.

I’ve never felt any compilation pain with BenchmarkTools.


#12

Not that I want to argue, but it’s relativities that’s important here not the absolute timings. Because, once you know the fastest way, it’s the fastest way, you have to choose it regardless of the absolute times.


#13

Well, but then at least you need longer runtimes for the test. As it is now, you cannot tell the difference between 1ns and 3us. So you won’t know the relative runtimes.


#14

This appears to be a bit faster than the solution:

function count_char2(s)
    res = Dict{Char, Int}()
    for c in s
        res[c] = get(res, c, 0) + 1
    end
    return res
end

(BTW, that performance difference is hard to detect with @time.)


#15

If ASCII is enough it is possible to squeeze a bit more out of it

function count_char_ascii(s)
    if !isascii(s)
        error("non ascii string")
    end
    z = zeros(Int, 256)
    for c in codeunits(s)
        @inbounds z[c] += 1
    end
    let z = z
    	return Dict(Char(i) => z[i] for i in 1:length(z) if z[i] != 0)
    end
end
julia> using Random

julia> s = randstring(10^6);

julia> @btime c1 = count_char(s);
  20.043 ms (10 allocations: 5.34 KiB)

julia> @btime c2 = count_char2(s);
  15.341 ms (10 allocations: 5.34 KiB)

julia> @btime c3 = count_char_ascii(s);
  1.931 ms (15 allocations: 7.56 KiB)

#16

Good question! There are lots of ways to do this, some better than others depending on the specifics of your problem. I’ll walk through several methods and compare them. For all methods, I’ll use the following sample string:

s = randstring(100_000)

One-liner

oneliner(s) = Dict([(c, count(collect(s) .== c)) for c in unique(s)])

julia> @btime oneliner($s)
  28.911 ms (401 allocations: 24.68 MiB)

Minimal code. No external dependencies. Stupid algorithm. Only consider this if performance doesn’t matter.

Dictionary with haskey

count_char(s) = begin
  res = Dict{Char, Int}()
  for c in s
      if haskey(res, c)
         res[c] += 1
      else
         res[c] = 1
      end
   end
   res
end

julia> @btime count_char($s)
  1.684 ms (10 allocations: 5.25 KiB)

Decent implementation, but can be made both faster and less verbose since it does three lookups for each duplicate key (first haskey, then obtaining the current count, then writing an updated count)

Dictionary with default getter

function count_char2(s)
    res = Dict{Char, Int}()
    for c in s
        res[c] = get(res, c, 0) + 1
    end
    return res
end

julia> @btime count_char2($s)
  1.320 ms (10 allocations: 5.25 KiB)

A better version then the previous one since it only does two lookups (one read and one write). It’s also less verbose and doesn’t duplicate logic. All-around better.

DataStructures counter

using DataStructures

julia> @btime counter($s)
  1.335 ms (11 allocations: 5.27 KiB)

This is pretty much the same as the previous implementation, but you don’t need to write the code yourself.

StatsBase countmap

using StatsBase

julia> @btime countmap([c for c in $s])
  949.163 μs (13 allocations: 395.97 KiB)

countmap in StatsBase is in general the best method I know of for this task, and would be my method of choice in most cases. It uses an introspective algorithm of either a single lookup dictionary approach or a radix sort depending on the input data. Unfortunately, it doesn’t accept generator input, which wastes some time and memory in this case.

Single lookup dictionary

function count_char_single_lookup(s)
    res = Dict{Char, Int}()
    for c in s
        index = Base.ht_keyindex2!(res, c)
        if index > 0
            @inbounds res.vals[index] += 1
        else
            @inbounds Base._setindex!(res, 1, c, -index)
        end
    end
    res
end

julia> @btime count_char_single_lookup($s)
  765.020 μs (10 allocations: 5.25 KiB)

Extracting the method used by countmap, we can gain some performance. I would be quite wary using this method in code that you intend to live on for a while, since it exploits the internal, undocumented, structure of Dict which could change at some point in the future.

Dictionary with faster hash

Base.hash(c::Char) = UInt(c)

julia> @btime count_char_single_lookup($s)
  487.873 μs (10 allocations: 5.25 KiB)

The default hash implementation for primitive types is not very efficient. It’s possible to squeeze a bit more performance out of dictionary based implementations by using a more efficient hash method. Again, I’d be wary using this method in practice since it redefines the hash code for Char globally. I guess one could use a custom Char type to get around that, but I’m not sure if that can be done in a clean and efficient way (suggestions appreciated).

StatsBase countmap with radix sort

using StatsBase

julia> @btime countmap([UInt(c) for c in $s]);
  448.074 μs (16 allocations: 1.53 MiB)

To get even faster, we can turn to non-dictionary based approaches. By making the keys integers, countmap switches to a radix sort implementation. At the end, one could trivially change the keys back to Chars (see Kristoffer’s post). Just like above, we could extract this code to avoid creating the temporary array to gain some performance. (Or better yet, add support for generators in StatsBase.)

Fixed-size array

function count_limited_range(s)
    counts = zeros(Int, 128)
    @inbounds for c in codeunits(s)
        counts[c] += 1
    end
    counts
end

julia> @btime count_limited_range($s)
  72.770 μs (1 allocation: 1.14 KiB)

Finally, if the input data is guaranteed to be within a limited range, such as ASCII, by far the fastest way is to use a fixed-size array of counters.


#17

You would do it something like this:

struct FastHashChar
    c::Char
end
Base.hash(c::FastHashChar, h::UInt) = xor(c.c % UInt, h)
Base.isequal(c1::FastHashChar, c2::FastHashChar) = c1.c == c2.c
Base.Char(c1::FastHashChar) = c1.c

function count_char_single_lookup(s)
    res = Dict{FastHashChar, Int}()
    for _c in s
    	c = FastHashChar(_c)
        index = Base.ht_keyindex2!(res, c)
        if index > 0
            @inbounds res.vals[index] += 1
        else
            @inbounds Base._setindex!(res, 1, c, -index)
        end
    end
    return Dict(Char(c) => n for (c,n) in res)
end

#18

(While it is always a fun exercise to try and optimize the heck out of some little problem like this, 99.9% it is not worth it and you should just use the simplest code like countmap. It is nice that Julia allows us to go as crazy as we want in optimization, however! Next someone is going to suggest a SIMD implementation…)


#19

I suggest using SIMD:

function count_limited_range_simd(s)
    counts = zeros(Int, 128)
    @inbounds @simd for c in codeunits(s)
        counts[c] += 1
    end
    return counts
end

julia> @btime count_limited_range($s);
  78.288 μs (1 allocation: 1.14 KiB)

julia> @btime count_limited_range_simd($s);
  60.394 μs (1 allocation: 1.14 KiB)

#20

I agree with you but I have to say that reading all theses answers is a very good way to learn different and efficient way of coding , the original poster can always choose the best solution for him so I would say it make more good than harm to publicly play the optimization game :slight_smile: