I don't undestand allocations

I was trying to investigate why write(file, hton(Float32(somereal))) was allocating and I come up with this simple test, run with --track-allocation=user, whose results left me more puzzled than before.

        - 
        - 
        - function writematton(file::IOStream, m::AbstractMatrix)
        -     for j in 1:size(m)[2]
        -         for i in 1:size(m)[1]
        -             write(file, hton(Float32(m[i,j])))
        -         end
        -     end
        - end
        - 
        - function writematf32(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, Float32(m[i,j]))
        0         end
        0     end
        - end
        - 
        - function writemat(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     3200             write(file, m[i,j])
        0         end
        0     end
        - end
        - 
        - a=rand(10,10)
        - fileton = open("testton.dat", "w")
        - writemat(fileton, a)
        - filef32 = open("testf32.dat", "w")
        - writematf32(filef32,a)
        - file = open("test.dat", "w")
        - writemat(file, a)
        - 

1 Like

I believe write is allocating a String for the output of the Float32.

I thought that write wrote the canonical binary representation…

I was “more” puzzled because I made a silly mistake (forgot to save) this was the intended test:

        - 
        - 
        - function writematton(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, hton(Float32(m[i,j])))
        0         end
        0     end
        - end
        - 
        - function writematf32(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, Float32(m[i,j]))
        0         end
        0     end
        - end
        - 
        - function writemat(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, m[i,j])
        0         end
        0     end
        - end
        - 
        - a=rand(10,10)
        - fileton = open("testton.dat", "w")
        - writematton(fileton, a)
        - filef32 = open("testf32.dat", "w")
        - writematf32(filef32,a)
        - file = open("test.dat", "w")
        - writemat(file, a)
        - 

now it is clear to me that it is write’s fault the allocation

I think it would be more technically correct to say that it’s allocating a Vector{UInt8} (and arguably shouldn’t be). If I understand what’s happening here correctly, we are allocating for storage of the 4 byte value.

I see, however I can’t help noticing that this “buffer” is completely redundant.

It is not the first time I came up struggling with more allocation than expected, e.g. array broadcasting that made me turn everything to loop (and ask what’s the purpose of a fancy synthetic vector notation that penalizes performances).

It is a pity as the language has tremendous potential!

No, it’s not that.

We’re allocating a Ref (which is 16 bytes, so the 100 elements totalling 16000 bytes make sense here) since the underlying C API expects a pointer and taking a stack pointer is not really always legal for julia, especially when FFI is involved. We’re very conservative with no pointers escaping by accident.

Broadcasting itself (when broadcasting into an existing array) definitely should not allocate extra stuff, that’s its whole raison d’etre! If you have something that shouldn’t allocate but does, please do ask here or in an issue on github, with a reproducible example, as that’s a bug that should be fixed.

So I understand that for the write there’s nothing to do.

regarding my struggles with the vector notation (I’m not sure that I used the term “broadcasting” in the correct way isn’t it the . notation? like in .= .+= abs.( ) ?) in the function below you can see a lot of short loops on ii that were made in order to avoid the allocations I was getting with vector notation, I also tried to use StaticArrays that worked in some cases. The whole experience has been, to me “a nightmare” and that’s makes me angry as i really like a lot julia.

(Btw that way I was able (mostly by clever algorithm in another function) to reach 2.4x speed over C)

function read_xtc_atoms(file::XtcFile, frame::Integer, coords::AbstractMatrix{T}) where {T <: Real}
    local smallnum::Int
    local smaller::Int
    minint = MVector{3, Int32}(undef)
    maxint = MVector{3, Int32}(undef)
    thiscoord = MVector{3, Int}(undef)
    prevcoord = MVector{3, Int}(undef)
    # magicints = file.magicints
    FIRSTINDEX = 9
    seek(file.file, file.offsets[frame] + 36) # we don't read box here
    size = ntoh(read(file.file, Int32))
    if size <= 9
        for j in 1:3
            for i in 1:size
                coords[i, j] = ntoh(read(file.file, Float32))
            end
        end
    else
        precision = ntoh(read(file.file, Float32))
        read!(file.file, minint)
        minint .= ntoh.(minint)
        read!(file.file, maxint)
        maxint .= ntoh.(maxint)
        small_idx = ntoh(read(file.file, Int32))
        smallidx = Int(small_idx)
        smaller = magicints[max(FIRSTINDEX, smallidx - 1) + 1] Ă· 2 
        smallnum = magicints[smallidx + 1] Ă· 2 
        sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]
        nbytes = ntoh(read(file.file, Int32))
        buffer = BitBuffer(nbytes)
        readbytes!(file.file, buffer.bits, nbytes)
        sizeint = SA[maxint[1] - minint[1] + 1,
                     maxint[2] - minint[2] + 1,
                     maxint[3] - minint[3] + 1]
        if any( sizeint .> 2^24-1)
            bitsizeint = sizeofint.(sizeint)
            bitsize = 0
        else    
            bitsize = sizeofints(sizeint)
        end
        i = 1 # atom index
        run = 0
        while i <= file.natoms
            if bitsize == 0
                for ii in 1:3
                    thiscoord[ii] = receivebits!(buffer, bitsizeint[ii])
                end
            else
                receiveints!(buffer, bitsize, sizeint, thiscoord)
            end
            for ii in 1:3
                thiscoord[ii] += minint[ii]
            end
            flag = receivebits!(buffer, 1)
            is_smaller = 0
            if flag == 1
                run = receivebits!(buffer, 5)
                is_smaller = run % 3
                run -= is_smaller
                is_smaller -= 1
            end
            if run == 0
                for ii in 1:3
                    coords[ii, i] = thiscoord[ii] / precision 
                end
                i += 1
            else
                for ii in 1:3
                    prevcoord[ii] = thiscoord[ii]
                end
                for k in 1:3:run
                    receiveints!(buffer, smallidx, sizesmall, thiscoord)
                    for ii in 1:3
                        thiscoord[ii] += prevcoord[ii] - smallnum # smallnum ??? WTF!!
                    end
                    if k == 1 # exchange first with second atom WTF!!
                        thiscoord, prevcoord = prevcoord, thiscoord
                        for ii in 1:3
                            coords[ii,i] = prevcoord[ii] / precision
                        end
                        i += 1
                    else
                        for ii in 1:3
                            prevcoord[ii] = thiscoord[ii]
                        end
                    end
                    for ii in 1:3
                        coords[ii,i] = thiscoord[ii] / precision
                    end
                    i += 1
                end
            end
            smallidx += is_smaller;
            if is_smaller < 0
                smallnum = smaller
                if smallidx > FIRSTINDEX 
                    smaller = magicints[smallidx] >>> 1
                else
                    smaller = 0
                end
            elseif is_smaller > 0
                smaller = smallnum
                smallnum = magicints[smallidx + 1] >>> 1
            end
            sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]
        end
    end
    return nothing      
end

Not sure why you’re using ntoh all over the place (is the endianness of your input data indeterminate?), but without knowing what e.g. receivebits! does, it’s hard to say why e.g.

is necessary and can’t be written as directly writing to thiscoord instead.

I’m assuming you wrote this like thiscoord .+= minint before? This should not have allocated.

Same goes for this coords[:, i] .= thiscoord ./ precision should not allocate.

prevcoord .= thiscoord

thiscoord .+= prevcoord .- smallnum


and similar for the other loops. Just from the code you posted I don’t see any reason why broadcasting should need to allocate anything - I suspect the problem was somewhere else. Hard to say without seeing the initial code.

I’m guessing the switch to StaticArrays was done to not allocate an array in this function? Why not use a tuple here? Since you’re not using any linear algebra and always seem to operate either on whole arrays or only read from them, they should be just as good compared to pulling in a dependency.

Well, not “nothing”, but it’s just tricky to solve directly. Since you have your own loop around write though, you could create a Ref in your outer loop, pass that to unsafe_write directly and reuse it for subsequent calls in that function.

I don’t even know what to say: I put in your modifications and doesn’t allocate anymore. Most of the code was exactly as you suggested (maybe I used prevcoord = thiscoord instead of .= (which still doesn’t allocate, I just tried) BTW is there a difference?

Maybe at the time when I switched to explicit loops the surroundings were much different but I really cant’ remember.

Anyway, good to know, and thank you for your time!

Random notes:
ntoh is required as the file format is based on xdr library with the purpose to be endian-independent.

The loop in the write is required because of “row major storage” and transposing in place I understand means reallocating the array. Still probably better to allocate once when the writing function will be ready I’ll measure if it makes any difference

I forgot the tuples: it depends on the argument expected by the other functions called.

It all started with Vector{UInt32}, mimicking the C code, then I grokked AbstractWhatever so the functions are now expecting AbstractVectors, can I pass tuples instead? are tuples type-homogeneous?

The dependency would still be there as is the same used for the four MVectors allocated at the beginning, are they worth something or can I refactor to Tuples & regular Vectors?

.= writes into the existing object on the left hand side of the assignment, while = just rebinds the variable name (called a binding) to the object that results on the right hand side. So .= in some sense requires mutability.

You only seem to have Vectors, not matrices - how does transposing make a difference here? A Vector in julia is always a one dimensional storage. It’s not the same as a nx1 matrix, which has two dimensions (albeit one of them being only one wide).

The final coords can (and should, depending on what you use this for) stay as AbstractVector{T}, I was more talking about these MVectors:

If you move those to tuples, you wouldn’t .= those anymore, since tuples are immutable. So you’d use regular assignment, but the operations on the right hand side of those assignments you should still broadcast.

Tuples are not homogenous, no - they store the element type of each element in their type. It shouldn’t make any difference here, since they’re all well defined here, right?

So rebinding would mean that if I mutate thiscoord after the assignment the value of prevcoord gets mutated too, correct?

The loop on write is for a part of the file record not read in the shown part of code (the + 36 skipped at the beginning) these are three direction vectors stored as a 3x3 matrix.

About static vectors I was referring to

sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]

which appears twice. (Also, here I had to unpack components, is there a vector notation for that?)

It would make a difference if the values in tuple are stored as boxed type due to inhomogeneity. How would an homogeneous tuple be stored? memory contiguous like an array?

Only if the object is mutable, else you can’t mutate at all, but yes.

If your code is type stable, no boxing will occur. In general yes, tuples are memory contigous, but since they’re immutable values that most of the time end up in registers, it’s not really relevant.

I tried to get rid of MVectors to remove the dependency from StaticArrays

    # minint = MVector{3, Int32}(undef)
    # maxint = MVector{3, Int32}(undef)
    # thiscoord = MVector{3, Int}(undef)
    # prevcoord = MVector{3, Int}(undef)
    minint = Vector{Int32}(undef, 3)
    maxint = Vector{Int32}(undef, 3)
    thiscoord = Vector{Int}(undef, 3)
    prevcoord = Vector{Int}(undef, 3)

I get

9.893195 seconds (72.48 k allocations: 3.058 GiB, 3.17% gc time, 0.06% compilation time)

vs.

7.598854 seconds (52.48 k allocations: 3.055 GiB, 3.70% gc time, 0.05% compilation time)

that’s a huge difference!

Which if those is which, and what did you benchmark exactly?

Seems like you also have another bottleneck causing more allocations than necessary.

It is a 10k iteration (timeframes) on that function reading a file about 700MB of compressed data expanding it to 2.3 GB of float32. I’m pretty much sure that this is the only difference between the runs