# Fast String processing in Julia vs. Python

I am trying to convert this notebook on probability by Peter Norvig. I observe that Python runs much faster than Julia. I couldn’t pin down the problem in my Julia code.

What is happening here is that we are looping over a massive array of strings that represent all the possible combinations of 5 card draws from a standard deck and we are checking for certain straightforward patterns to work out the probabilities.

Here is the Julia code:

``````using Combinatorics

"counts the number of occurances of l in s "
count_o(s,l)=count(x->x==l,s)

"The set of ways of concatenating one item from collection A with one from B."
cross(A, B)=["\$a\$b" for a in A for b in B]

"All combinations of n items; each combo as a concatenated str."
function combos(items, n)
[join(combo," ") for combo in combinations(items,n)]
end

"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
function P(event, space)
event_ = such_that(event, space)
length(intersect(event_,space))//length(space)
end

#Making use of Julia's multiple dispatch
"The subset of elements in the collection for which the predicate is true."
function such_that(predicate::Function, collection)
filter(predicate,collection)
end

"Default return for a collection"
function such_that(event, collection)
event
end

suits = "SHDC"
ranks = "A23456789TJQK"
deck  = cross(ranks, suits)
Hands = combos(deck, 5) #A massive list of all possible hands

four_kind(hand)=any(count_o(hand,rank) == 4 for rank in ranks)

P(four_kind, Hands) #Warm up

@time P(four_kind, Hands)
``````

I get ` 46.577004 seconds (5.20 M allocations: 119.030 MB, 0.10% gc time)`

The Python code:

``````import itertools
from fractions import Fraction
from math import factorial

def cross(A, B):
"The set of ways of concatenating one item from collection A with one from B."
return {a + b
for a in A for b in B}

def combos(items, n):
"All combinations of n items; each combo as a concatenated str."
return {' '.join(combo)
for combo in itertools.combinations(items, n)}

def choose(n, c):
"Number of ways to choose c items from a list of n items."
return factorial(n) // (factorial(n - c) * factorial(c))

def P(event, space):
"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
if is_predicate(event):
event = such_that(event, space)
return Fraction(len(event & space), len(space))

is_predicate = callable

def such_that(predicate, collection):
"The subset of elements in the collection for which the predicate is true."
return {e for e in collection if predicate(e)}

suits = 'SHDC'
ranks = 'A23456789TJQK'
deck  = cross(ranks, suits)

Hands = combos(deck, 5)

def flush(hand):
return any(hand.count(suit) == 5 for suit in suits)

P(flush, Hands)

``````

It runs in just 4 seconds.

Any ideas on how I could improve the code, or if I can’t, why is Julia slow here?

1 Like

Try `cross(A, B) = (string(a, b) for a in A for b in B)`.
This avoids string interpolation, and uses a generator instead of an array.

1 Like

Just note that you don’t need to avoid string interpolateion. The two are exactly the same.

``````julia> expand(:("\$a\$b"))
:((Base.string)(a, b))
``````
3 Likes

The performance blocks are in a combination of `P`, `count_o` and `four_kinds`. Not sure how to make them any faster.

You appear to be using `set` in python and `Array` in julia and you are intersecting them. The `Set` should be the correct datastructure to use in julia.

1 Like

Using sets improved things a bit. So I modified `combos` to

``````function combos(items, n)
Set(join(combo," ") for combo in combinations(items,n))
end
``````

Now I get
`19.393026 seconds (5.20 M allocations: 119.007 MB, 0.25% gc time)`

Certainly twice a fast as before, but not yet near python’s 4 seconds.

You need a `Set` in `cross` too.

That wouldn’t work, because `combinations` from `StatsBase` requires an indexable object, a `Set` is not indexable.

Just doing `Hands = combos(deck, 5)` is already terribly slow.
I don’t understand why it’s necessary to convert these to strings.

Doing `combinations(deck, 5)` seems to be much faster on Julia v0.6 (where it now lives in the `Combinatorics.jl` package).

Basically each possible 5 card hand is represented as a string. This is keeping with the python implementation. Arguably, I could create a `hand` type and define operators on it. But that would be more code, and I would like to have simple (yet fast) tutorial. I am still puzzled as to why Julia is so much slower that python here (4 seconds total run time).
On my computer, the `combo` version with the `Set` ran at 17 seconds, while `Array` version ran 7 seconds. The `Set` implementation was faster, however, in the `intersect`.

Can you post the current version of the code?

By the way, the Python version is checking for a flush and the Julia for four hands. Not sure how much difference that makes.

Did you actually check the function `count_o`? I’m not sure it works correctly:

``````julia> count_o(s,l)=count(x->x==l,s)
count_o (generic function with 1 method)

julia> count_o("ABCA", "A")
0
``````
``````julia> s = "ABCA"
"ABCA"

julia> count(x->x=="A", s)
0

julia> count(x->x=='A', s)
2
``````

Ignore me, this is irrevelant since

``````julia v0.5> [suit for suit in "ABCD"]
4-element Array{Char,1}:
'A'
'B'
'C'
'D'
``````

Defining

``````have_flush(hand) = any(suit->count_o(hand, suit) == 5, suits)
``````

and reducing everything to

``````count(have_flush, Hands)
``````

gives 6.1 seconds on my machine, compared to 4.9 for Python.

This completely avoids the intersection though.

On Julia 0.6 latest master, using an iterator for `combos`:

``````function combos(items, n)
(join(combo, " ") for combo in combinations(items,n))
end

Hands = combos(deck, 5)
``````

I get

``````julia> @time count(have_flush, Hands) // length(Hands)
2.866761 seconds (36.41 M allocations: 1.628 GiB, 6.31% gc time)
33//16660
``````

I thought it is worth to elaborate more on what @dpsanders proposed.

Apart from using `Set` in the computations I think that the problem is that

``````four_kind(hand)=any(count_o(hand,rank) == 4 for rank in ranks)
``````

uses `ranks` that is a global variable. Using local variable in your definition of `four_kind` like this:

``````function four_kind(hand)
r::String = ranks
any(count_o(hand,rank) == 4 for rank in r)
end
``````

speeds things up.

@dpsanders definition of `have_flush` avoids this problem although global variable is used.