Fast String processing in Julia vs. Python


#1

I am trying to convert this notebook on probability by Peter Norvig. I observe that Python runs much faster than Julia. I couldn’t pin down the problem in my Julia code.

What is happening here is that we are looping over a massive array of strings that represent all the possible combinations of 5 card draws from a standard deck and we are checking for certain straightforward patterns to work out the probabilities.

Here is the Julia code:

using Combinatorics

"counts the number of occurances of l in s "
count_o(s,l)=count(x->x==l,s)

"The set of ways of concatenating one item from collection A with one from B."
cross(A, B)=["$a$b" for a in A for b in B]

"All combinations of n items; each combo as a concatenated str."
function combos(items, n)
    [join(combo," ") for combo in combinations(items,n)]
end

"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
function P(event, space)
    event_ = such_that(event, space)
    length(intersect(event_,space))//length(space)
end

#Making use of Julia's multiple dispatch
"The subset of elements in the collection for which the predicate is true."
function such_that(predicate::Function, collection)
    filter(predicate,collection)
end

"Default return for a collection"
function such_that(event, collection)
    event
end

suits = "SHDC"
ranks = "A23456789TJQK"
deck  = cross(ranks, suits)
Hands = combos(deck, 5) #A massive list of all possible hands

four_kind(hand)=any(count_o(hand,rank) == 4 for rank in ranks)

P(four_kind, Hands) #Warm up

@time P(four_kind, Hands)

I get 46.577004 seconds (5.20 M allocations: 119.030 MB, 0.10% gc time)

The Python code:

import itertools
from fractions import Fraction
from math import factorial

def cross(A, B):
    "The set of ways of concatenating one item from collection A with one from B."
    return {a + b 
            for a in A for b in B}


def combos(items, n):
    "All combinations of n items; each combo as a concatenated str."
    return {' '.join(combo) 
            for combo in itertools.combinations(items, n)}


def choose(n, c):
    "Number of ways to choose c items from a list of n items."
    return factorial(n) // (factorial(n - c) * factorial(c))


def P(event, space): 
    """The probability of an event, given a sample space of equiprobable outcomes.
    event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
    if is_predicate(event):
        event = such_that(event, space)
    return Fraction(len(event & space), len(space))

is_predicate = callable

def such_that(predicate, collection): 
    "The subset of elements in the collection for which the predicate is true."
    return {e for e in collection if predicate(e)}


suits = 'SHDC'
ranks = 'A23456789TJQK'
deck  = cross(ranks, suits)


Hands = combos(deck, 5)


def flush(hand):
    return any(hand.count(suit) == 5 for suit in suits)

P(flush, Hands)

It runs in just 4 seconds.

Any ideas on how I could improve the code, or if I can’t, why is Julia slow here?


#2

Try cross(A, B) = (string(a, b) for a in A for b in B).
This avoids string interpolation, and uses a generator instead of an array.


#3

Just note that you don’t need to avoid string interpolateion. The two are exactly the same.

julia> expand(:("$a$b"))
:((Base.string)(a, b))

#4

The performance blocks are in a combination of P, count_o and four_kinds. Not sure how to make them any faster.


#5

You appear to be using set in python and Array in julia and you are intersecting them. The Set should be the correct datastructure to use in julia.


#6

Using sets improved things a bit. So I modified combos to

function combos(items, n)
    Set(join(combo," ") for combo in combinations(items,n))
end

Now I get
19.393026 seconds (5.20 M allocations: 119.007 MB, 0.25% gc time)

Certainly twice a fast as before, but not yet near python’s 4 seconds.


#7

You need a Set in cross too.


#8

That wouldn’t work, because combinations from StatsBase requires an indexable object, a Set is not indexable.


#9

Just doing Hands = combos(deck, 5) is already terribly slow.
I don’t understand why it’s necessary to convert these to strings.


#10

Doing combinations(deck, 5) seems to be much faster on Julia v0.6 (where it now lives in the Combinatorics.jl package).


#11

Basically each possible 5 card hand is represented as a string. This is keeping with the python implementation. Arguably, I could create a hand type and define operators on it. But that would be more code, and I would like to have simple (yet fast) tutorial. I am still puzzled as to why Julia is so much slower that python here (4 seconds total run time).
On my computer, the combo version with the Set ran at 17 seconds, while Array version ran 7 seconds. The Set implementation was faster, however, in the intersect.


#12

Can you post the current version of the code?


#13

By the way, the Python version is checking for a flush and the Julia for four hands. Not sure how much difference that makes.


#14

Did you actually check the function count_o? I’m not sure it works correctly:

julia> count_o(s,l)=count(x->x==l,s)
count_o (generic function with 1 method)

julia> count_o("ABCA", "A")
0

#15
julia> s = "ABCA"
"ABCA"

julia> count(x->x=="A", s)
0

julia> count(x->x=='A', s)
2

#16

Ignore me, this is irrevelant since

julia v0.5> [suit for suit in "ABCD"]
4-element Array{Char,1}:
 'A'
 'B'
 'C'
 'D'

#17

Defining

have_flush(hand) = any(suit->count_o(hand, suit) == 5, suits)

and reducing everything to

count(have_flush, Hands)

gives 6.1 seconds on my machine, compared to 4.9 for Python.


#18

This completely avoids the intersection though.


#19

On Julia 0.6 latest master, using an iterator for combos:

function combos(items, n)
    (join(combo, " ") for combo in combinations(items,n))
end

Hands = combos(deck, 5)

I get

julia> @time count(have_flush, Hands) // length(Hands)
  2.866761 seconds (36.41 M allocations: 1.628 GiB, 6.31% gc time)
33//16660

#20

I thought it is worth to elaborate more on what @dpsanders proposed.

Apart from using Set in the computations I think that the problem is that

four_kind(hand)=any(count_o(hand,rank) == 4 for rank in ranks)

uses ranks that is a global variable. Using local variable in your definition of four_kind like this:

function four_kind(hand)
    r::String = ranks
    any(count_o(hand,rank) == 4 for rank in r)
end

speeds things up.

@dpsanders definition of have_flush avoids this problem although global variable is used.