Fast String processing in Julia vs. Python

You are of course completely correct. Embarrassing that I didn’t notice the first error that we always tell people to avoid!!

Try wrapping everything in a function .

Thanks @dpsanders and @bkamins

So this the latest version I have so far

using Combinatorics

"counts the number of occurances of l in s "
count_o(s,l)=count(x->x==l,s)
"The set of ways of concatenating one item from collection A with one from B."
cross(A, B)=["$a$b" for a in A for b in B]
"All combinations of n items; each combo as a concatenated str."
function combos(items, n)
    (join(combo," ") for combo in combinations(items,n))
end

"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
function P(event, space)
    length(intersect(event,space))//length(space)
end

function P(event::Function, space)
    count(event,space)//length(space)
end


suits = "SHDC"
ranks = "A23456789TJQK"
deck  = cross(ranks, suits)
Hands = combos(deck, 5) #A massive list of all possible hands
 
function flush(hand)
    s::String=suits
    any(count_o(hand,suit) == 5 for suit in s)
end

P(flush, Hands) #Warm up

Total execution time for Julia is now down to 3.94 vs python at 4.36. This is pretty decent and it also makes for nicer code. The use of generators and avoiding set operators really made a big difference.

However, I am still not fully satisfied. I am still surprised the python did better when full sets and set operators were used.
Can I infer form that the python’s handling of sets must be better that Julia’s at this stage?

For the sake of completeness, the implementation using sets is

using Combinatorics

"counts the number of occurances of l in s "
count_o(s,l)=count(x->x==l,s)
"The set of ways of concatenating one item from collection A with one from B."
cross(A, B)=["$a$b" for a in A for b in B]
"All combinations of n items; each combo as a concatenated str."
function combos(items, n)
    Set(join(combo," ") for combo in combinations(items,n))
end

"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
function P(event, space)
    event_ = such_that(event, space)
    length(intersect(event_,space))//length(space)
end

#Making use of Julia's multiple dispatch
"The subset of elements in the collection for which the predicate is true."
function such_that(predicate::Function, collection)
    filter(predicate,collection)
end;
"Default return for a collection"
function such_that(event, collection)
    event
end

suits = "SHDC"
ranks = "A23456789TJQK"
deck  = cross(ranks, suits)
Hands = combos(deck, 5) #A massive list of all possible hands

function flush(hand)
    s::String=suits
    any(count_o(hand,suit) == 5 for suit in s)
end

P(flush, Hands) #Warm up

This takes 14.1 seconds.

I continued with more experiments where I tried to preserve use of Sets so as not to deviate from the python code.

In the first variation, I used collect instead of simply passing the generator to the Set constructor.

function combos(items, n)
    Set(collect(join(combo," ") for combo = combinations(items,n)))
end

The speed improved and execution time went down to 8.7 seconds.

Then I did another variation when I mapped to a preallocated array.

function combos(items, n)
    combs=combinations(items,n)
    res=Vector{String}(length(combs))
    Set(map!(x->join(x," "),res,collect(combs)))
end

Speed improved further to 6.1 seconds.

Conclusions:

  1. Constructing a set is always faster when a preallocated array is provided
  2. map! always works faster than collect
  3. There is probably room for improving Set construction with a generator input.
2 Likes

On the risk of getting off topic, I would like to encourage a non-string solution for this problem. I see two problems with the string approach:

  1. The current implementation relies on the fact that the characters in suits and ranks don’t overlap, which is just a fortuitousness in the English language.
  2. In Python, using strings is probably more performant than using an abstraction, because operations on builtin types are much faster than on user-defined types.

But in Julia we can have abstractions at zero cost, which can also make the code safer. In this spirit I would like to offer the following version:

using Combinatorics

const suits = "SHDC"
const ranks = "A23456789TJQK"
 
immutable Card{TS,TR}
   suit :: TS
   rank :: TR
end

"""The probability of an event, given a sample space of equiprobable outcomes.
event can be either a set of outcomes, or a predicate (true for outcomes in the event)."""
function P(event::Function, space)
    count(event,space)//length(space)
end

function flush(hand)
    any( count(card -> card.suit==s, hand) == 5 for s in suits )
end

function main()
    deck  = [ Card(s,r) for r in ranks for s in suits ]
    hands = combinations(deck, 5) 
    P(flush, hands) #Warm up
    @time p = P(flush, hands)
    println(p)
end

main()

With julia 0.5.0, this gets me:

  0.624516 seconds (20.79 M allocations: 991.425 MB, 9.39% gc time)

The main speedup comes from not using strings, but using the dedicated Card type, and having a hand simply be a list of Cards (thus avoiding the strings). The other changes I made (explicitly marking global variables as const, and putting the top-level code into a main function) have little impact on the speed. A little more speed can actually be gained by changing the defintion of suits as follows:

@enum SUIT spades hearts diamonds clubs
const suits = instances(SUIT)

which gets me

  0.531291 seconds (20.79 M allocations: 991.425 MB, 10.73% gc time)

(I guess on current architectures, Chars are too small so that operating on them individually incurs too much overhead.)

Note: on my machine the original Python version takes 5.4 seconds.

8 Likes

Very nice!

Probably some more speed can be gained (also in the original version) by replacing the flush function to just check if all the suits are equal to each other.

@dpsanders is right, in my version replacing flush by

function flush(hand)
    s1 = hand[1].suit
    count(card -> card.suit==s1, hand) == 5
end

yields

  0.388377 seconds (15.59 M allocations: 872.454 MB, 11.35% gc time)

i.e. another ~25% faster.

2 Likes

Instead of an enum, you can just use integers instead of Chars for suits and ranks, and get a similar speedup. I am curious as to why there’s still so much allocation.

Or just

function flush(hand)
    hand[1].suit == hand[2].suit == hand[3].suit == hand[4].suit == hand[5].suit
end

which is even slightly faster. Not sure how to write this more beautifully.

all(card.suit==s1 for card in hand)

Ah, the allocations come from traversing the generator of the combinations.

Indeed, if you actually do hands = collect(combinations(deck, 5)), then the allocations go down to 0, and the goes down to 0.05 secs!

But that moves much of the work from the P function into the collect operation, so to be fair we should time that as well. Rewriting the bottom part to

function main()
    deck  = [ Card(s,r) for r in ranks for s in suits ]
    hands = combinations(deck, 5)
    p = P(flush, hands)
    println(p)
end

@time main() # Warm up
@time main()

gives me

  0.905246 seconds (18.43 M allocations: 961.952 MB, 27.77% gc time)
  0.419223 seconds (18.19 M allocations: 951.772 MB, 14.00% gc time)

but adding the collect call for the combinations gives me

  1.461934 seconds (18.45 M allocations: 982.869 MB, 31.46% gc time)
  1.061860 seconds (18.19 M allocations: 971.600 MB, 32.99% gc time)
3 Likes

Another way to speed up string processing code which I don’t think has been mentioned here is to use a different type of string. Julia defaults to UTF8, which is a variable length encoding and can slow some calculations down. You may want to look at LegacyStrings.jl or some of the stuff in JuliaString if you need even more speed.

2 Likes

It’s unclear (to me at least) which version of Julia you’re using here, @mbeltagy, so I should mention that strings on 0.6-alpha should be much faster than they were on 0.5. They have, until 0.6 been slower than Python, but should now be the same (or hopefully even faster). In the future, we have a few potential tricks that could make them even faster still, but they’re fairly tricky tricks, so it’s hard to say for sure. Of course, we want all aspects of Julia to be at least as fast any other language, but we haven’t quite gotten to that state of performance nirvana yet. Strings are something Python has been very good at for a long time (and Perl is even better); now at least Julia should be competitive.

I should also mention that although we surely can and will, we often haven’t spent that much time honing the performance of generic data structures like hashes and dictionaries. This isn’t because we don’t want them to be fast (we do), but because they’re not as important as they are in a language like Python where they’re your only option. In Julia, when performance counts, your best bet is often to use a different, better suited data structure. In this case, @traktofon’s post showed that you can get a huge speedup that way – far faster than you could ever get with generic data structures, no matter how much we hone them.

5 Likes

Thanks @StefanKarpinski
I really learned a great deal here. I am appreciating how Julia works and how best to put its power to effective use.
I tested on 0.5 and 0.6 (master branch). The results were very similar.
For the original code that I first posted, total execution was 52 seconds for 0.5 and 54 for 0.6.
For the one that uses sets (more comparable to Python) it was 5.3 and 5.4 seconds.
As far I as I can tell, the difference is quite minimal between the two versions.

In the code below, I just timed the main function after warmup.

using Combinatorics

count_o(s,l)=count(x->x==l,s)
cross_(A, B)=["$a$b" for a in A for b in B]
function combos(items, n)
    combs=combinations(items,n)
    res=Vector{String}(length(combs))
    map!(x->join(x," "),res,collect(combs))
end

function P(event, space)
    event_ = such_that(event, space)
    length(intersect(event_,Set(space)))//length(space)
end

function such_that(predicate::Function, collection)
    filter(predicate,collection)
end;

suits = "SHDC"
ranks = "A23456789TJQK"

function flush(hand)
    s::String=suits
    any(count_o(hand,suit) == 5 for suit in s)
end

function main()
    deck  = cross_(ranks, suits)
    Hands = combos(deck, 5) #A massive list of all possible hands
    P(flush, Hands) #Warm up
end

6 posts were split to a new topic: ANN: Cards.jl

I’ve split the discussion of my new Cards.jl package into a new thread.