Finding the memory allocation in some code

performance

#1

Hi,

I’ve been trying to find where the memory allocations are in my code if possible. I have this function in a package, which I don’t expect to allocate, because I avoid arrays, and use immutable types:

const CDN = Union{BioSequences.DNACodon, BioSequences.RNACodon}
const DEFAULT_TRANS = BioSequences.ncbi_trans_table[1]
const CDN_POS_MASKS = (0xFFFFFFFFFFFFFFFC, 0xFFFFFFFFFFFFFFF3, 0xFFFFFFFFFFFFFFCF)

function expected_NG86(codon::C, k::Float64, code::GeneticCode) where {C <: CDN}
    cdn_bits = UInt64(codon)
    aa = code[codon]
    S = N = 0.0
    for (pos, msk) in enumerate(CDN_POS_MASKS)
        @inbounds for base in 0:3
            # Create the neighbor codon.
            ins = base << (2 * (pos - 1))
            neighbor = C((cdn_bits & msk) | ins)
            if codon == neighbor # Codon created is not a neighbor: should happen 3 times.
                continue
            end
            # See if the mutation is transition or transversion.
            cdn_purine = ispurine(codon[pos])
            neighbor_purine = ispurine(neighbor[pos])
            istransition = (cdn_purine && neighbor_purine) || (!cdn_purine && !neighbor_purine)
            # See if the protein changes between codon and neighbor, and update
            # N and S counts accordingly.
            inc = ifelse(istransition, 1.0, k)
            neighbor_aa = code[neighbor]
            if neighbor_aa == AA_Term
                N += inc
            elseif neighbor_aa == aa
                S += inc
            else
                N += inc
            end
        end
    end
    normalization = (N + S) / 3
    return (S / normalization), (N / normalization)
end

cdn = DNACodon(0x000000000000000) # a test codon to benchmark and look at allocation.

Both DNACodon and RNACodon are primitive 64 bits types. Here’s the full type definition:

primitive type Kmer{T<:NucleicAcid, K} <: Sequence 64 end

const DNAKmer{K} = Kmer{DNA, K}
const RNAKmer{K} = Kmer{RNA, K}
const DNACodon = DNAKmer{3}
const RNACodon = RNAKmer{3}

If I do benchmark tools I can see it is very fast but I see 3 allocations:

julia> @benchmark GeneticVariation.expected_NG86($cdn, $1.0, $GeneticVariation.DEFAULT_TRANS)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     71.827 ns (0.00% GC)
  median time:      74.665 ns (0.00% GC)
  mean time:        83.535 ns (4.07% GC)
  maximum time:     1.955 μs (92.34% GC)
  --------------
  samples:          10000
  evals/sample:     972

Given this function might be used on 1000 such RNACodons or DNACodons, that’s 3000 allocations when as far as I can tell, I shouldn’t need to allocate anything in my code - it should all be immutable variables on the stack.

So I tracked allocation and produced a mem file:

- function expected_NG86(codon::C, k::Float64, code::GeneticCode) where {C <: CDN}
   101471     cdn_bits = UInt64(codon)
        0     aa = code[codon]
        0     S = N = 0.0
        0     for (pos, msk) in enumerate(CDN_POS_MASKS)
        0         @inbounds for base in 0:3
        -             # Create the neighbor codon.
        0             ins = base << (2 * (pos - 1))
        0             neighbor = C((cdn_bits & msk) | ins)
        0             if codon == neighbor # Codon created is not a neighbor: should happen 3 times.
        0                 continue
        -             end
        -             # See if the mutation is transition or transversion.
        0             cdn_purine = ispurine(codon[pos])
        0             neighbor_purine = ispurine(neighbor[pos])
        0             istransition = (cdn_purine && neighbor_purine) || (!cdn_purine && !neighbor_purine)
        -             # See if the protein changes between codon and neighbor, and update
        -             # N and S counts accordingly.
        0             inc = ifelse(istransition, 1.0, k)
        0             neighbor_aa = code[neighbor]
        0             if neighbor_aa == AA_Term
        0                 N += inc
        0             elseif neighbor_aa == aa
        0                 S += inc
        -             else
        0                 N += inc
        -             end
        -         end
        -     end
        0     normalization = (N + S) / 3
        0     return (S / normalization), (N / normalization)
        - end

So no allocations, except for the first line. Where I convert RNACodon or DNACodon types to UInt64. This falls back to the convert method, which is defined as:

Base.convert(::Type{UInt64}, x::DNAKmer{K}) where {K} = reinterpret(UInt64, x)
Base.convert(::Type{UInt64}, x::RNAKmer{K}) where {K} = reinterpret(UInt64, x)

So all I’m doing is reinterpreting one 64 bit bitstype as a UInt64, and yet I see such a memory allocation? I’m not sure why this is, as if I do:

julia> @benchmark UInt64($cdn)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.598 ns (0.00% GC)
  median time:      1.898 ns (0.00% GC)
  mean time:        1.911 ns (0.00% GC)
  maximum time:     22.747 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

0 allocations reported, or if I do:

julia> @time UInt64(cdn)
  0.000004 seconds (4 allocations: 160 bytes)
0x0000000000000000

It tells me 4 allocations.

So I’m baffled why this one line in my otherwise non allocating function, which simply reinterprets a 64-bit value appears to be allocating according to the .mem.

This function used to use lots of short lived small arrays and did an awful amount of allocation that was really slowing it down, so I’m pleased I got it this far, I’m just trying to figure out why there’s still the allocation that remains!


#2

Allocations pointing to the first line in a function is in my experience not related to that actual line but something to do with the calling of the function. Have you tried minimizing your example, i.e. removed as much code as you can while still having the problem.


#3

So I actually decided that I may want to use some of the codon manipulation again, so I’ve moved it into a seperate minimal function, with the reinterpretation to a UInt64 again as the top line:

function replace_base(codon::C, position::Int, base::UInt64) where {C <: CDN}
    bits = reinterpret(UInt64, codon)
    msk = CDN_POS_MASKS[position]
    ins = base << (2 * (position - 1))
    newcdn = C((bits & msk) | ins)
    return newcdn
end

This time benchmark tools tells me no memory allocation occurs:

julia> @benchmark GeneticVariation.replace_base($cdn, $1, $0x0000000000000001)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.094 ns (0.00% GC)
  median time:      3.489 ns (0.00% GC)
  mean time:        3.595 ns (0.00% GC)
  maximum time:     28.471 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

I made a smaller example of the above expected_NG86 function, taking out more than half the loop body and only doing the UInt64 conversion and a bit of counting. I still get those 3 allocs according to BenchmarkTools:

function expected_NG86_min(codon::C, k::Float64, code::GeneticCode) where {C <: CDN}
    cdn_bits = UInt64(codon)
    S = N = 0.0
    for (pos, msk) in enumerate(CDN_POS_MASKS)
        @inbounds for base in 0:3
            # Create the neighbor codon.
            ins = base << (2 * (pos - 1))
            neighbor = C((cdn_bits & msk) | ins)
            S += 1.0
            N += 1.0
        end
    end
    normalization = (N + S) / 3
    return (S / normalization), (N / normalization)
end
julia> @benchmark GeneticVariation.expected_NG86_min($cdn, $1.0, $GeneticVariation.DEFAULT_TRANS)
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  3
  --------------
  minimum time:     58.451 ns (0.00% GC)
  median time:      59.382 ns (0.00% GC)
  mean time:        68.993 ns (8.32% GC)
  maximum time:     3.078 μs (96.53% GC)
  --------------
  samples:          10000
  evals/sample:     981

And in the .mem, very similar allocation values are assigned to the first line of both functions, even though benchmark tools tells me replace_base did not allocate any memory! So I think you may be right in it being some kind of overhead thing.

  - function replace_base(codon::C, position::Int, base::UInt64) where {C <: CDN}
    48496     bits = reinterpret(UInt64, codon)
        0     msk = CDN_POS_MASKS[position]
        0     ins = base << (2 * (position - 1))
        0     newcdn = C((bits & msk) | ins)
        0     return newcdn
        - end
        - 
        - function expected_NG86_min(codon::C, k::Float64, code::GeneticCode) where {C <: CDN}
    48272     cdn_bits = UInt64(codon)
        0     S = N = 0.0
        0     for (pos, msk) in enumerate(CDN_POS_MASKS)
        0         @inbounds for base in 0:3
        -             # Create the neighbor codon.
        0             ins = base << (2 * (pos - 1))
        0             neighbor = C((cdn_bits & msk) | ins)
        0             S += 1.0
        0             N += 1.0
        -         end
        -     end
        0     normalization = (N + S) / 3
        0     return (S / normalization), (N / normalization)
        - end

#4

I believe $GeneticVariation.DEFAULT_TRANS should be $(GeneticVariation.DEFAULT_TRANS). You’re interpolating the module, but not the constant. This makes the allocations go away in both functions.

My test code:

using BioSequences
using BioSequences.GeneticCode
using GeneticVariation
using BenchmarkTools

const CDN = Union{BioSequences.DNACodon, BioSequences.RNACodon}
const DEFAULT_TRANS = BioSequences.ncbi_trans_table[1]
const CDN_POS_MASKS = (0xFFFFFFFFFFFFFFFC, 0xFFFFFFFFFFFFFFF3, 0xFFFFFFFFFFFFFFCF)


function expected_NG86_min(codon::C, k::Float64, code::GeneticCode) where {C <: CDN}
    cdn_bits = UInt64(codon)
    S = N = 0.0
    for (pos, msk) in enumerate(CDN_POS_MASKS)
        @inbounds for base in 0:3
            # Create the neighbor codon.
            ins = base << (2 * (pos - 1))
            neighbor = C((cdn_bits & msk) | ins)
            S += 1.0
            N += 1.0
        end
    end
    normalization = (N + S) / 3
    return (S / normalization), (N / normalization)
end

cdn = DNACodon(0x000000000000000) # a test codon to benchmark and look at allocation.

@benchmark expected_NG86_min($cdn, 1.0, $(GeneticVariation.DEFAULT_TRANS))