Performance of Julia as a general-purpose programming language

bernhard · November 16, 2017, 9:50am

Thanks.

I realize now, there is actually a ton of literature and SO questions on those topics.
http://eternallyconfuzzled.com/arts/jsw_art_rand.aspx

rfourquet · November 16, 2017, 12:43pm

Actually, the base library does this in two steps: generate a random integer i in 1:length(231:33:423), and then indexing this StepRange at i, i.e. (231:33:423)[i]. So your approach generalizes very easily too. But as @bkamins nicely explained, this is not as perfectly distributed as the base method.

stevengj · November 16, 2017, 1:09pm

This is a consequence of how floating-point numbers work, not of how rand is implemented. A Float64 is a sign bit, 11 exponent bits, and a 52-bit significand (plus an implicit 1 at the beginning). For normalized values in [0,1), the sign bit and the exponent bits are fixed. So, it is only the significand bits that are random.

Tamas_Papp · November 16, 2017, 1:15pm

What was surprising to me is that even rand(UInt64) will get 52 random bits twice, then ⊻ them. I thought it would get 64 random bits directly.

bkamins · November 16, 2017, 1:49pm

Yes. But in [0,1] they are not distributed uniformly. They are distributed uniformly when you consider Close1Open2 interval.

You can see it by running:

[mean(bits(rand())[60]-'0' for i in 1:10^6) for i in 13:64]

and

[mean(bits(rand(Base.Random.Close1Open2))[60]-'0' for i in 1:10^6) for i in 13:64]

rfourquet · November 16, 2017, 4:20pm

It’s because the dSFMT library produces natively 52 random bits at a time. Some time ago I had experimented using SFMT instead, which produces directly 64 bits (or 128), but the benefits were not substantial compared to base’s approach (of course I may have had my benchmark wrong).

Seif_Shebl · November 17, 2017, 5:17pm

Actually, nothing stops you in Julia from achieving 1X C speed. You just need to be more careful and limit the amount of dynamism in your code. As others pointed out, SVectors are your best shoot here. In the following code, I used SVectors and only modified two lines; line 73 now reads trunc(Int, 1+rand()*game.nbFreeMoves) and line 80 contains a ternary condition which evaluates in both cases to the same function, I removed the condition. I also allowed inlining of most functions via the @inline macro.

The following code executes at the same speed as the C++ version on my machine.

const NB_GAMES = 1000

using StaticArrays
const EYE9 = eye(SMatrix{9, 9, Int})
const BOARD_LUT = SVector(EYE9, 2*EYE9)

struct Tictactoe
  board::SVector{9,Int}
  nbFreeMoves   :: Int
  currentPlayer :: Int
  nextPlayer    :: Int
end

Tictactoe() = Tictactoe( [0, 0, 0, 0, 0, 0, 0, 0, 0] , 9, 1, 2)

@inline function computeFinish(b::SVector{9,Int}, nbFreeMoves::Int)
  if b[1] != 0 && b[1] == b[2] == b[3]
    true, b[1]
  elseif b[4] != 0 && b[4] == b[5] == b[6]
    true, b[4]
  elseif b[7] != 0 && b[7] == b[8] == b[9]
    true, b[7]
  elseif b[1] != 0 && b[1] == b[4] == b[7]
    true, b[1]
  elseif b[2] != 0 && b[2] == b[5] == b[8]
    true, b[2]
  elseif b[3] != 0 && b[3] == b[6] == b[9]
    true, b[3]
  elseif b[1] != 0 && b[1] == b[5] == b[9]
    true, b[1]
  elseif b[3] != 0 && b[3] == b[5] == b[7]
    true, b[1]
  elseif nbFreeMoves == 0  # no winner, no empty cell -> draw
    true, 0
  else  # no winner, empty cell -> game not finished
    false, 0
  end
end

@inline function findIndex(board::SVector{9,Int}, moveIndex::Int)
  nbFree = 0
  k = 1
  while k<=9 
    if board[k] == 0
      nbFree += 1
      if nbFree == moveIndex
        break
      end
    end
    k += 1
  end
  k
end

@inline function playIndex(game::Tictactoe, moveIndex::Int)
  k = findIndex(game.board, moveIndex)
  if k <= 9
    board = game.board + BOARD_LUT[game.currentPlayer][k,:]
    Tictactoe(board, game.nbFreeMoves-1, game.nextPlayer, game.currentPlayer)
  else
    game
  end
end

abstract type Bot end

@inline function runGame(game::Tictactoe, bot1::Bot, bot2::Bot)
  while true
    finished, winner = computeFinish(game.board, game.nbFreeMoves)
    if finished
      return winner
    end
    moveIndex = genmove(game, bot1) # game.currentPlayer == 1, genmove(game, bot1), genmove(game, bot2) 
    game = playIndex(game, moveIndex)
  end
end

struct RandomBot <: Bot end

@inline genmove(game::Tictactoe, bot::RandomBot) = trunc(Int, 1+rand()*game.nbFreeMoves) # ceil(Int,rand()*game.nbFreeMoves)

struct McBot <: Bot
  maxSims::Int
end

function evalMove(game::Tictactoe, moveIndex::Int, nbSims::Int)
  bot = RandomBot()
  player = game.currentPlayer
  game1 = playIndex(game, moveIndex)
  evalMove1(unused) = runGame(game1, bot, bot) == player ? 1 : 0
  mapreduce(evalMove1, (+), 0, 1:nbSims)
end

function genmove(game::Tictactoe, bot::McBot)
  nbMoves = game.nbFreeMoves
  nbMoveSims = max(1, div(bot.maxSims, nbMoves))
  simulate(moveIndex) = evalMove(game, moveIndex, nbMoveSims)
  wins = map(simulate, [moveIndex for moveIndex in 1:nbMoves])     
  indmax(wins)
end

@inline function runGames(game::Tictactoe, bot1::Bot, bot2::Bot, nbGames::Int)
  res = [0, 0, 0]
  for _ in 1:nbGames
    winner = 1 + runGame(game, bot1, bot2)                     
    res[winner] += 1
  end
  res
end

function runXp(nbSims::Int)
  game = Tictactoe()
  botA = RandomBot()
  botB = McBot(nbSims)
  resAB = runGames(game, botA, botB, NB_GAMES)
  resBA = runGames(game, botB, botA, NB_GAMES)
  println("RandomBot McBot ", nbSims, " ", NB_GAMES, " ", 
          resAB[1], " ", resAB[2], " ", resAB[3], " ", 
          resBA[1], " ", resBA[2], " ", resBA[3], " ", 
          resAB[1]+resBA[1], " ", resAB[2]+resBA[3], " ", resAB[3]+resBA[2])
end

println("botA botB nbSims nbGames drawsA1B2 winsA1 winsB2 drawsB1A2 winsB1 winsA2 draws winsA winsB")
vecNbSims = [1, 10, 100, 1000, 10000]
@time map(runXp, vecNbSims)

juliendehos · November 17, 2017, 9:47pm

Line 80, the condition is needed because runGame is also used in runGames (with different bots). Inlining works great: the SVector version is almost as fast as the C++ version. Thanks.
Is there a way to predict which function will be interesting to inline ?

Seif_Shebl · November 18, 2017, 7:04pm

Oh, yes I missed this.

Well, you can still beat C++ using only SVector without struct. The following version is 5% faster than C++. Enjoy.

const NB_boardS = 1000

using StaticArrays
const EYE9 = eye(SMatrix{9, 9, Int})
const BOARD_LUT = SVector(EYE9, 2*EYE9)

const vec9 = SVector{9,Int}
const vec3 = SVector{3,Int}

@inline function computeFinish(b::SVector{9,Int}, nbFreeMoves::Int)
  if b[1] != 0 && b[1] == b[2] == b[3]
    true, b[1]
  elseif b[4] != 0 && b[4] == b[5] == b[6]
    true, b[4]
  elseif b[7] != 0 && b[7] == b[8] == b[9]
    true, b[7]
  elseif b[1] != 0 && b[1] == b[4] == b[7]
    true, b[1]
  elseif b[2] != 0 && b[2] == b[5] == b[8]
    true, b[2]
  elseif b[3] != 0 && b[3] == b[6] == b[9]
    true, b[3]
  elseif b[1] != 0 && b[1] == b[5] == b[9]
    true, b[1]
  elseif b[3] != 0 && b[3] == b[5] == b[7]
    true, b[1]
  elseif nbFreeMoves == 0  # no winner, no empty cell -> draw
    true, 0
  else  # no winner, empty cell -> game not finished
    false, 0
  end
end

@inline function findIndex(board::SVector{9,Int}, moveIndex::Int)
  nbFree = 0
  k = 1
  while k<=9 
    if board[k] == 0
      nbFree += 1
      if nbFree == moveIndex
        break
      end
    end
    k += 1
  end
  k
end

@inline function playIndex(board, last3, moveIndex::Int)
  k = findIndex(board, moveIndex)
  if k <= 9
    board = board + BOARD_LUT[last3[2]][k,:]
    board, vec3(last3[1]-1,last3[3],last3[2]) 
  else
    board, last3
  end
end

abstract type Bot end

@inline function runGame(board, last3, bot1::Bot, bot2::Bot)
  while true
    finished, winner = computeFinish(board, last3[1])
    if finished
      return winner
    end
    moveIndex = last3[2] == 1 ? genmove(board, last3, bot1) : genmove(board, last3, bot2) 
    board, last3 = playIndex(board, last3, moveIndex)
  end
end

struct RandomBot <: Bot end

@inline genmove(board, last3, bot::RandomBot) = trunc(Int, 1+rand()*last3[1]) 

struct McBot <: Bot
  maxSims::Int
end

function evalMove(board, last3, moveIndex::Int, nbSims::Int)
  bot = RandomBot()
  player = last3[2]
  board1,vvv3 = playIndex(board, last3, moveIndex)
  evalMove1(unused) = runGame(board1, vvv3, bot, bot) == player ? 1 : 0
  mapreduce(evalMove1, (+), 0, 1:nbSims)
end

function genmove(board, last3, bot::McBot)
  nbMoves = last3[1]
  nbMoveSims = max(1, div(bot.maxSims, nbMoves))
  simulate(moveIndex) = evalMove(board, last3, moveIndex, nbMoveSims)
  wins = map(simulate, [moveIndex for moveIndex in 1:nbMoves])     # 
  indmax(wins)
end

function runGames(board, last3, bot1::Bot, bot2::Bot, nbboards::Int)
  res = [0, 0, 0]
  for _ in 1:nbboards
    winner = 1 + runGame(board, last3, bot1, bot2)                # 
    res[winner] += 1
  end
  res
end

function runXp(nbSims::Int)
  board = vec9(0,0,0,0,0,0,0,0,0)
  last3 = vec3(9,1,2)
  botA = RandomBot()
  botB = McBot(nbSims)
  resAB = runGames(board, last3, botA, botB, NB_boardS)
  resBA = runGames(board, last3, botB, botA, NB_boardS)
  println("RandomBot McBot ", nbSims, " ", NB_boardS, " ", 
          resAB[1], " ", resAB[2], " ", resAB[3], " ", 
          resBA[1], " ", resBA[2], " ", resBA[3], " ", 
          resAB[1]+resBA[1], " ", resAB[2]+resBA[3], " ", resAB[3]+resBA[2])
end

println("botA botB nbSims nbboards drawsA1B2 winsA1 winsB2 drawsB1A2 winsB1 winsA2 draws winsA winsB")
vecNbSims = [1, 10, 100, 1000, 10000]
@time map(runXp, vecNbSims)

Elrod · November 19, 2017, 1:43am

Awesome work!

I remember in the benchmark thread, many said there’s more variability between compilers than between Julia and C/Fortran/&Co.

So, testing your code (following compiling via runXp(1)) on my machine:

julia> versioninfo()
Julia Version 0.6.1
Commit 0d7248e* (2017-10-24 22:15 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen Threadripper 1950X 16-Core Processor
  WORD_SIZE: 64
  BLAS: libopenblas (ZEN)
  LAPACK: liblapack
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, generic)

julia> @time map(runXp, vecNbSims);
RandomBot McBot 1 1000 25 301 674 33 861 106 58 407 1535
RandomBot McBot 10 1000 22 265 713 24 934 42 46 307 1647
RandomBot McBot 100 1000 7 115 878 7 982 11 14 126 1860
RandomBot McBot 1000 1000 3 66 931 0 1000 0 3 66 1931
RandomBot McBot 10000 1000 3 55 942 0 1000 0 3 55 1942
  7.797082 seconds (100.84 k allocations: 12.569 MiB, 0.11% gc time)

Testing clang++-4.0:

$ clang++-4.0 -o no_lam_clang tictactoe_cmp.cpp -march=native -Ofast -std=c++11
$ time ./no_lam_clang
botA botB nbSims nbGames drawsA1B2 winsA1 winsB2 drawsB1A2 winsB1 winsA2 draws winsA winsB
RandomBot McBot 1 1000 84 441 475 9 922 69 93 510 1397
RandomBot McBot 10 1000 90 399 511 9 938 53 99 452 1449
RandomBot McBot 100 1000 49 217 734 5 982 13 54 230 1716
RandomBot McBot 1000 1000 28 163 809 3 990 7 31 170 1799
RandomBot McBot 10000 1000 22 153 825 4 988 8 26 161 1813

real	0m27.158s
user	0m27.155s
sys	0m0.000s

Makes me wonder if I did something wrong with clang???
If not: that is quite the trouncing by Julia as an llvm frontend!

$ g++ -o no_lam_5_4 tictactoe_cmp.cpp -march=native -Ofast -std=c++11
$ time ./no_lam_5_4
botA botB nbSims nbGames drawsA1B2 winsA1 winsB2 drawsB1A2 winsB1 winsA2 draws winsA winsB
RandomBot McBot 1 1000 111 413 476 1 917 82 112 495 1393
RandomBot McBot 10 1000 87 416 497 6 948 46 93 462 1445
RandomBot McBot 100 1000 53 247 700 4 986 10 57 257 1686
RandomBot McBot 1000 1000 28 154 818 6 991 3 34 157 1809
RandomBot McBot 10000 1000 35 132 833 4 991 5 39 137 1824

real	0m8.165s
user	0m8.159s
sys	0m0.004s
$ g++-7.2 -o no_lam tictactoe_cmp.cpp -march=native -Ofast
$ time ./no_lam
botA botB nbSims nbGames drawsA1B2 winsA1 winsB2 drawsB1A2 winsB1 winsA2 draws winsA winsB
RandomBot McBot 1 1000 85 419 496 6 927 67 91 486 1423
RandomBot McBot 10 1000 81 432 487 8 935 57 89 489 1422
RandomBot McBot 100 1000 58 243 699 7 990 3 65 246 1689
RandomBot McBot 1000 1000 42 148 810 6 987 7 48 155 1797
RandomBot McBot 10000 1000 29 129 842 8 984 8 37 137 1826

real	0m8.028s
user	0m8.026s
sys	0m0.000s

gcc (versions 5.4 and 7.2) did much better, but still not as fast as Julia.

Did I do something wrong with clang – taking about 3.5 times longer than Julia?
Does that imply anything about the future of Julia with newer versions of llvm?

juliendehos · November 19, 2017, 10:28am

@Seif_Shebl: yes, but to be fair, we should implement this in the C++ code too.

@Elrod: same problem on my machine (linux, gcc 7.2 = 12s, clang 4.0.1 = 36s). I don’t know why…