Code slower than Rust equivalent

fcharih · October 7, 2021, 4:46am

Hello,

I am trying to port code from a Rust codebase (not my own code) to Julia. Even though the code is nearly identical to the Rust one, it appears to run ~10x slower. Code is not clean at all – I am just doing a “literal” translation from the original code. It produces the exact same results, and follows the same sequence of operations (goes through the branches the same number of times, and so on). Are there blatantly obvious reasons why this code is so much slower than its Rust equivalent?

The code is below:

function compute_similar_smers(smer::UInt64, seed::Seed, hit_threshold::Int64)::Set{UInt64}
    encoded::UInt64 = SPRINT.Utils.encode_smer(smer)
    similar_smers::Set{UInt64} = Set()
    smer_score::Int16 = 0
    seed_current_digit::UInt64 = 0
    smer_current_digit::UInt64 = 0

    # Compute score of smer with itself (baseline)
    for i in 1:length(seed)
        seed_current_digit = seed.value & (31 << (i * 5))
        if seed_current_digit != 0
            smer_current_digit = encoded & (31 << (i * 5));
            smer_current_digit = smer_current_digit >> (i * 5);
            smer_score += PAM120_SPRINT[smer_current_digit,smer_current_digit]
        end
    end

    change_smer_digit(encoded, UInt64(0), seed, hit_threshold - smer_score, similar_smers)
    similar_smers
end

function change_smer_digit(
    smer::UInt64,
    digit_position::UInt64,
    seed::Seed,
    score_needed::Int64,
    similar_smers::Set{UInt64}
)
    
    if digit_position == length(seed)
        return
    end

    current_seed_digit::UInt64 = seed.value & (31 << (digit_position * 5))
    smer_current_digit::UInt64 = 0
    y::UInt64 = 0

    if current_seed_digit == 0
        change_smer_digit(smer, digit_position + 1, seed, score_needed, similar_smers)
    elseif digit_position == length(seed) - 1
        smer_current_digit = (smer & (31 << (digit_position * 5))) >> (digit_position * 5)

        for i in 0:19
            if PAM120_SPRINT[smer_current_digit,PAM120_ORDERED_SPRINT[smer_current_digit,i]] >= score_needed
                y = smer & (~(31 << (digit_position * 5)))
                y = y | (PAM120_ORDERED_SPRINT[smer_current_digit,i] << (5 * digit_position))
                push!(similar_smers, SPRINT.Utils.decode_smer(y, seed))
            else
                break
            end
        end
    else
        smer_current_digit = (smer & (31 << (digit_position * 5))) >> (digit_position * 5)
        smer_current_digit_plus1::UInt64 = get_smer_dig_plus1(smer, digit_position, seed)
        new_score_needed::Int64 = 0

        for i in 0:19
            if PAM120_SPRINT[smer_current_digit,PAM120_ORDERED_SPRINT[smer_current_digit,i]] >= score_needed
                y = smer & (~(31 << (digit_position * 5)))
                y = y | (PAM120_ORDERED_SPRINT[smer_current_digit,i] << (5 * digit_position))
                push!(similar_smers, SPRINT.Utils.decode_smer(y, seed))
                new_score_needed = score_needed + PAM120_SPRINT[smer_current_digit_plus1,smer_current_digit_plus1] - PAM120_SPRINT[smer_current_digit,PAM120_ORDERED_SPRINT[smer_current_digit,i]]
                change_smer_digit(y, digit_position + 1, seed, new_score_needed, similar_smers)
            else
                break
            end
        end
    end
end

function get_smer_dig_plus1(smer::UInt64, digit_position::UInt64, seed::Seed)::UInt64
    seed_current_digit::UInt64 = seed.value & (31 << ((digit_position + 1) * 5))

    if seed_current_digit != 0
        return (smer & (31 << ((digit_position + 1) * 5))) >> ((digit_position + 1) * 5)
    else
        return get_smer_dig_plus1(smer, digit_position + 1, seed)
    end
end

The PAM120_SPRINT and PAM120_ORDERED_SPRINT are global variables, but I was told that by declaring them as const, Julia would be able to optimize for that. Could it have anything to do with memory allocation (for my input, it runs in ~0.019s with ~233k allocations on second call to exclude compilation time, which seems like a lot, given that my function change_smer_digits is called ~4000 times including recursive calls)?

Cheers,
Francois

Sukera · October 7, 2021, 4:52am

What’s Seed? Do you have a link to the original Rust code?

There definitely shouldn’t be 200k allocations if rust isn’t doing that, so my guess would be something has gone wrong during translation. My first idea would be to avoid the global variables - const only fixes their type, not their values, and since that means the value could change at any time, it’s possible that lookups to the global scope are inserted everywhere, preventing SIMD, possibly allocating etc.

Sukera · October 7, 2021, 4:56am

fcharih:

encoded::UInt64 = SPRINT.Utils.encode_smer(smer)
    similar_smers::Set{UInt64} = Set()
    smer_score::Int16 = 0
    seed_current_digit::UInt64 = 0
    smer_current_digit::UInt64 = 0

These type annotations don’t really help the compiler - they only serve as an aid, preventing you from accidentally assigning a value of a different type.

Also, Set() should probably be Set{Uint64}(), to directly create an object of the correct type instead of having to convert it.

Also - what does @code_warntype say, anything red/falling back to Any?

fcharih · October 7, 2021, 8:00pm

Thanks for your comments. No point in sharing the Rust code, really – it is identical.

@code_warntype didn’t suggest any type instability, and passing the globals as typed arguments to the functions only helps marginally.

This is really odd…

Oscar_Smith · October 7, 2021, 8:06pm

What is the definition of Seed?

fcharih · October 7, 2021, 8:14pm

Just a simple struct.

struct Seed
    seed_string::String
    value::UInt64
end

kristoffer.carlsson · October 7, 2021, 8:37pm

What are your inputs then? In general, please give enough information for people to at least have a chance of running your code.

Jeff_Emanuel · October 7, 2021, 8:43pm

Does SPRINT.Utils.decode_smer allocate? It looks like there are up to 20 calls each to push! and decode_smer per change_smer_digit call.

fcharih · October 7, 2021, 8:46pm

I ran my tests with --track-allocation=user and it appears that the bulk of the allocations occur on the lines where I push! to my Set, so it does seem as though SPRINT.Utils.decode_smer allocates.

Jeff_Emanuel · October 7, 2021, 8:48pm

push! itself can allocate

fcharih · October 7, 2021, 8:49pm

To be honest, I’m unclear as to how Julia allocates. Are there any resources that cover that? I Googled it, but didn’t manage to find anything that clarified it for me.

Jeff_Emanuel · October 7, 2021, 8:56pm

Try calling sizehint! on your Set if you know how many elements you expect it to grow to.

fcharih · October 7, 2021, 9:14pm

That didn’t make a big difference, but thank you.

Oh well, I give up! Not super important anyway! It’s just surprising that equivalent code would generate such different times of execution.

giordano · October 7, 2021, 10:00pm

Not really, what is idiomatic in a language may not be in a different language which has different properties. Verbatim translations are unlikely to be fair for both of them.

Providing a minimal reproducible example may help others help you, otherwise we go back to the

fcharih · October 7, 2021, 10:22pm

Fair point… I was not expecting others to debug my code, but merely to see whether there was something obviously wrong. Thank you though!

giordano · October 7, 2021, 10:31pm

You may be surprised to know that topics titled “my code in Julia is x times slower than with Language Y” are those that attract most eyes

Sukera · October 7, 2021, 10:35pm

I’m quite sure that, at least semantically, it’s not identical - otherwise you wouldn’t observe a difference It can be a little subtle sometimes when translating, so that’s why I asked. I’m just a casual rust user, but I think I’ve got enough context from my experiments with it and experience with C to figure out hoe to translate Rust to julia.

I should probably make a doc PR at some point, but the rule of thumb I’m going by is that type instabilities, access to global variables, slicing of arrays (without @view), string joining/creation and growing of arrays all either allocate or can cause allocations down the road. If the majority of time is spent in push!, it suggests that either you’re growing the array a lot of times or not enough work is done otherwise to offset the growth heuristic. Julia uses doubling of memory when growing a vector with full capacity, such that the cost amortizes to O(1) when growing (it’s a common tactic, as far as I know).

As has been mentioned above, if it’s really equivalent semantically it should have the same or at least very similar runtime - if it doesn’t and the code is as close as can be, it’s either a bug or a missing feature in julia. For that comparison, we’d need to be able to run the code ourselves and compare to the rust version though

jling · October 8, 2021, 12:04am

please at least give the community a chance of doing free labor for you? You posted a question “why this Julia code is much slower than Rust equivalent” without 1) runnable code 2) Rust code

Topic		Replies	Views
Help to get my slow Julia code to run as fast as Rust/Java/Lisp Performance	100	4581	August 6, 2021
Rust / Julia Comparison Post Performance rust	56	5173	February 1, 2022
Funny Benchmark with Julia (no longer) at the bottom Performance benchmark	149	5991	November 4, 2023
Trivial port from C++/Rust to Julia Performance performance , benchmark , loopvectorization	51	3169	December 28, 2023
Julia slower than Python to sort and reverse a list of integers Performance	40	2571	April 28, 2023

Code slower than Rust equivalent

Related topics