Help to get my slow Julia code to run as fast as Rust/Java/Lisp

DNF · August 5, 2021, 10:33am

Are you sure? If I do

n = Base.GMP.MPZ.set_ui!(big"1", 1)

then the ns should be independent for each call. Even though it may reuse some global, that’s only a problem if calls execute in parallel.

Feels a bit hacky though. I wonder if the need for set_ui! is a bug.

Sukera · August 5, 2021, 10:38am

pfitzseb:

Replacing the BigInt with two Int128 s and allocating fewer arrays (while printing and during the nested call to printTranslations ) brings down runtime by a factor of ~15 for me:
j1 -O1 src/julia/phone_encoder.jl dictionary.txt input.txt  0.95s user 0.69s system 252% cpu 0.652 total
I haven’t read through all of this thread though, so dunno how this compares to the optimizations others have come up with.

Why are we using BigInt in the first place? Neither Rust nor Java seem to use their BigInt versions and instead just use UInt8[] and at least Rust passes around views of them since they’re not written to anyway. We can in theory do the exact same thing (well, minus the not-writing guarantee):

julia> a = Dict([0x01] => "a", [0x2] => "b")
Dict{Vector{UInt8}, String} with 2 entries: 
  [0x01] => "a"                             
  [0x02] => "b"                             
                                            
julia> b = [0x1, 0x2]                       
2-element Vector{UInt8}:                    
 0x01                                       
 0x02                                       
                                            
julia> a[view(b, 1:1)]                      
"a"

Seems like the ground is much less level than we initially thought

DNF · August 5, 2021, 10:58am

Thanks for your comment. But the behaviour I demonstrated seems to indicate that there’s a bug, doesn’t it? big"1" should not just reuse, but also make sure to set the correct value?

mkitti · August 5, 2021, 11:02am

Sukera:

Why are we using BigInt in the first place? Neither Rust nor Java seem to use their BigInt versions and instead just use UInt8[] and at least Rust passes around views of them since they’re not written to anyway. We can in theory do the exact same thing (well, minus the not-writing guarantee):

See above:

Also on the master branch they do use BigInt:

github.com

renatoathaydes/prechelt-phone-number-encoding/blob/master/src/java/Main2.java

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.math.BigInteger;
import java.util.*;
import java.util.function.BiConsumer;
import java.util.stream.Stream;

import static java.nio.charset.StandardCharsets.US_ASCII;

/**
 * This is a Java port of the Norvig's Common Lisp program from http://norvig.com/java-lisp.html.
 * <p>
 * The only relevant deviations from the original are:
 * <ul>
 *     <li>
 *     The words for a solution are accumulated using {@code cons} in CL, but using {@code List.add} in Java.
 *     This means that when a solution is found, the Java program doesn't need to reverse the words before printing
 *     them. Also, the Java program checks the <b>last</b> word when deciding if the previous word in a solution was
 *     a digit, while the CL program checks the <b>first</b> word.

This file has been truncated. show original

github.com

renatoathaydes/prechelt-phone-number-encoding/blob/master/src/rust/phone_encoder/src/main.rs

use std::collections::HashMap;
use std::env::args;
use std::fs::File;
use std::io::{self, BufRead};
use std::path::Path;

use lazy_static::lazy_static;
use num_bigint::{BigUint, ToBigUint};

type Dictionary = HashMap<BigUint, Vec<String>>;

lazy_static! {
    static ref ONE: BigUint = 1.to_biguint().unwrap();
    static ref TEN: BigUint =10.to_biguint().unwrap();
}

/// Port of Peter Norvig's Lisp solution to the Prechelt phone-encoding problem.
///
/// Even though this is intended as a port, it deviates quite a bit from it
/// due to the very different natures of Lisp and Rust.

This file has been truncated. show original

I guess we could spin up Int256 via BitIntegers.jl though?

DNF · August 5, 2021, 11:02am

BTW, mul_ui! is faster than mul_si!, and is ok because you multiply with a positive number, 10.

I think creating a muladd! method would make the code cleaner. There is an addmul in the GMP library, but curiously no muladd. Maybe there’s no performance advantage.

Sukera · August 5, 2021, 11:07am

I see Missed that since I’ve only worked on julia-optimisations. Which branch are the most recent benchmarks from then? master or some older version where Vec[u8] was a thing?

In any case, the benchmark runner from master also doesn’t work for me, so I can’t even reproduce any timings @renatoathaydes do you maybe know what’s going on here? I’m just running ./benchmark.sh from the master branch of your repo. I still get the same error I posted above, about not being able to find pid_rusage in libproc.

jakobnissen · August 5, 2021, 11:40am

Couldn’t help but try to optimize this further . Here is my code: https://github.com/jakobnissen/prechelt_benchmark

Note that it is optimized for runtime only, with compiletime subtracted. Here are a few of the optimizations:

Use UInt256 from BitIntegers.jl instead of BigInt.
Use a lookup-table to translate words to numbers, and do this by operating on the bytes of the input string instead of the chars
Sizehint the dictionary
Do not allocate a new string when printing, instead, do multiple calls to print
Do not allocate new arrays of strings when recursing in printTranslations where possible, instead push and pop from the existing array (perhaps even more allocations can be removed)
The dict’s valtype is now Union{Tuple{String}, Vector{String}}

The result is this, median runtimes, when printing to an IOBuffer:
Before (v1): 856 ms
After (v2): 28.8 ms
Speedup: 30x

The majority of time in the optimized version is spent building the dictionary, on stuff like rehasing it, allocating the individual arrays

jakobnissen · August 5, 2021, 11:44am

@DNF Nope - it parses the macro at parsetime, and allocates the number once. When re-running the code, it uses the same object, and does not re-initialize it. It is, in fact, correct behaviour

DNF · August 5, 2021, 11:53am

I understand that it reuses the object without re-initializing, but it seems very odd that it doesn’t reset the value.

I see no hint of this behaviour in the docstring:

help?> @big_str
  @big_str str
  @big_str(str)


  Parse a string into a BigInt or BigFloat, and throw an ArgumentError if the string is not a valid number. For
  integers _ is allowed in the string as a separator.

  Examples
  ≡≡≡≡≡≡≡≡≡≡

  julia> big"123_456"
  123456

  julia> big"7891.5"
  7891.5

Does this mean that anytime one writes x = big"143" in a function one might get a something that is dramatically different from 143?

It seems really dangerous and surprising, do you disagree with that?

pfitzseb · August 5, 2021, 11:57am

Btw, this is what I get for my code when benchmarking against the other implementations:

Benchmarking...
===> java -cp build/java Main
Proc,Memory(bytes),Time(ms)
java,0,290
Proc,Memory(bytes),Time(ms)
java,0,354
Proc,Memory(bytes),Time(ms)
java,0,702
Proc,Memory(bytes),Time(ms)
java,0,766
===> java -cp build/java Main2
Proc,Memory(bytes),Time(ms)
java,0,375
Proc,Memory(bytes),Time(ms)
java,0,590
Proc,Memory(bytes),Time(ms)
java,0,1785
Proc,Memory(bytes),Time(ms)
java,0,2471
===> ./phone_encoder
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,142
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,274
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,804
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,1683
===> j1 src/julia/phone_encoder.jl
Proc,Memory(bytes),Time(ms)
j1,0,802
Proc,Memory(bytes),Time(ms)
j1,0,825
Proc,Memory(bytes),Time(ms)
j1,0,1243
Proc,Memory(bytes),Time(ms)
j1,0,1618

@jakobnissen’s version fails validation for me.

jakobnissen · August 5, 2021, 11:58am

It seems really dangerous and surprising, do you disagree with that?

I sort of disagree. It’s not documented in @big_str, since it’s part of how macros in general work. It is documented under Julia’s section on macros: Metaprogramming · The Julia Language

It actually is a feature, not a bug (although it can cause bugs, if people are unaware of it). The advantage is that all the work of creating the object is done at compile time, so it can be much more efficient. You can think of it as being a constant that is associated with the function. If you mutate the constant, then yes, that can lead to problems.

DNF · August 5, 2021, 12:06pm

Yeah, I have at least a rudimentary understanding of macros, but this just seems a bit crazy. The docs don’t say anything about reusing memory, so I don’t think this is obvious, just because it’s a macro.

For small numbers, setting the correct value after creating it is basically free, but I guess that it will be expensive and complicated for really big numbers.

I always understood the big"" macro to be there only because it was necessary for creating some numbers that would otherwise be truncated. I had no idea there was a performance optimization in there, let alone one this dangerous.

jakobnissen · August 5, 2021, 12:06pm

@jakobnissen’s version fails validation for me.

Whooops, that’s because I accidentally printed an extra space for each line. Just pushed a fix to my repo.

jakobnissen · August 5, 2021, 12:09pm

@DNF After thinking about it, you’re probably right. It would be better if the default behaviour was that it allocated a new integer every time the function with the macro was run. It could work like BioSequence’s @dna_str, except it could default to dynamic allocation. People who want the performance can then opt-in to the macro optimization. Should be easy to implement.

Will you make an issue on the Julia repo, or should I?

DNF · August 5, 2021, 12:10pm

It would be great if you could (I should really do something else now), but if not, I’ll do it tonight.

DNF · August 5, 2021, 12:17pm

What am I missing here:

words2 = push!(copy(words), string(nthDigit(digits, start)))
printTranslations(io, num, digits, dict, start + 1, words2)

Won’t push!/pop! work here as well?

jakobnissen · August 5, 2021, 12:24pm

Sure, there is no difference there. Also, I made the issue on Julia about @big_str

pfitzseb · August 5, 2021, 1:15pm

I don’t think you pushed anything, but if I fix the printing I get

Benchmarking...
===> java -cp build/java Main
Proc,Memory(bytes),Time(ms)
java,0,308
Proc,Memory(bytes),Time(ms)
java,0,390
Proc,Memory(bytes),Time(ms)
java,0,591
Proc,Memory(bytes),Time(ms)
java,0,828
===> java -cp build/java Main2
Proc,Memory(bytes),Time(ms)
java,0,336
Proc,Memory(bytes),Time(ms)
java,0,545
Proc,Memory(bytes),Time(ms)
java,0,1798
Proc,Memory(bytes),Time(ms)
java,0,1907
===> ./phone_encoder
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,127
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,263
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,802
Proc,Memory(bytes),Time(ms)
./phone_encoder,0,1711
===> j1 src/julia/phone_encoder.jl
Proc,Memory(bytes),Time(ms)
j1,0,2043
Proc,Memory(bytes),Time(ms)
j1,0,2831
Proc,Memory(bytes),Time(ms)
j1,0,6398
Proc,Memory(bytes),Time(ms)
j1,0,8082
===> j1 src/julia/phone_encoder_pfitzseb.jl
Proc,Memory(bytes),Time(ms)
j1,0,703
Proc,Memory(bytes),Time(ms)
j1,0,773
Proc,Memory(bytes),Time(ms)
j1,0,1155
Proc,Memory(bytes),Time(ms)
j1,0,1664
===> j1 src/julia/phone_encoder_jakni.jl
Proc,Memory(bytes),Time(ms)
j1,0,1074
Proc,Memory(bytes),Time(ms)
j1,0,1153
Proc,Memory(bytes),Time(ms)
j1,0,1496
Proc,Memory(bytes),Time(ms)
j1,0,1896

Oh, and here is my version (just realized I hadn’t posted that anywhere):

Replaced BigInt usage with two Int128s, optimized printing:

#=
# Port of Peter Norvig's Common Lisp program from http://norvig.com/java-lisp.html.
#
# - Julia version: 1.6.2
# - Author: Renato Athaydes
# - Date: 2021-07-24
=#
const emptyStrings = String[]

function printTranslations(dict, num, digits, start=1, words=String[])
    if start > length(digits)
        print(num, ": ")
        for (i, word) in enumerate(words)
            print(word)
            i == length(words) ? println() : print(' ')
        end
        return
    end
    foundWord = false
    n1, n2 = Int128(0), Int128(1)
    for i in start:length(digits)
        if n2 >= 1e25
            n1 = n1 * 10 + nthDigit(digits, i)
        else
            n2 = n2 * 10 + nthDigit(digits, i)
        end
        for word in get(dict, (n1, n2), emptyStrings)
            foundWord = true
            push!(words, word)
            printTranslations(dict, num, digits, i + 1, words)
            pop!(words)
        end
    end
    if !foundWord &&
        !(!isempty(words) && length(words[end]) == 1 && isdigit(words[end][begin]))
        push!(words, string(nthDigit(digits, start)))
        printTranslations(dict, num, digits, start + 1, words)
        pop!(words)
    end
end

function loadDictionary(file)::Dict{Tuple{Int128, Int128}, Vector{String}}
    local dict = Dict{Tuple{Int128, Int128}, Vector{String}}()
    for word in eachline(file)
        push!(get!(() -> String[], dict, wordToNumber(word)), word)
    end
    dict
end

function nthDigit(digits::String, i::Int64)::UInt
    UInt(digits[i]) - UInt('0')
end

function charToDigit(ch::Char)::UInt
    ch = lowercase(ch)
    ch == 'e' && return 0
    ch in ('j', 'n', 'q') && return 1
    ch in ('r', 'w', 'x') && return 2
    ch in ('d', 's', 'y') && return 3
    ch in ('f', 't') && return 4
    ch in ('a', 'm') && return 5
    ch in ('c', 'i', 'v') && return 6
    ch in ('b', 'k', 'u') && return 7
    ch in ('l', 'o', 'p') && return 8
    ch in ('g', 'h', 'z') && return 9
    throw(DomainError(ch, "Not a letter"))
end

function wordToNumber(word::String)::Tuple{Int128, Int128}
    n1, n2 = Int128(0), Int128(1)
    for ch in word
        if isletter(ch) && isascii(ch)
            if n2 >= 1e25
                n1 = n1 * 10 + charToDigit(ch)
            else
                n2 = n2 * 10 + charToDigit(ch)
            end
        end
    end
    n1, n2
end


function main(dictionary, input)
    dict = open(dictionary) do file
        loadDictionary(file)
    end

    open(input) do file
        for num in eachline(file)
            printTranslations(dict, num, filter(isdigit, num))
        end
    end
end

if length(ARGS) == 2
    main(ARGS...)
end

Sukera · August 5, 2021, 1:30pm

That’s the one with two Int128s? Seems like our bottleneck truly is GMP then, since that’s what we’re binding against It’s also hard to optimize for julia, since it’s a ccall that does all the work here.

pfitzseb · August 5, 2021, 1:37pm

Yes. Not really surprising. Imho BigInt should only be used if you don’t actually have an idea on how big your numbers can get.

Topic		Replies	Views
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	5831	November 4, 2023
Code slower than Rust equivalent New to Julia performance , rust	17	1315	October 8, 2021
Comparison of Rust to Julia for scientific computing? Offtopic question , rust	123	33604	July 17, 2023
Julia slower than Python to sort and reverse a list of integers Performance	40	2517	April 28, 2023
Why is julia slowest here? Performance performance	4	1120	March 6, 2022

Help to get my slow Julia code to run as fast as Rust/Java/Lisp

Related topics