I don't undestand allocations

uliano · June 15, 2022, 3:25am

I was trying to investigate why write(file, hton(Float32(somereal))) was allocating and I come up with this simple test, run with --track-allocation=user, whose results left me more puzzled than before.

        - 
        - 
        - function writematton(file::IOStream, m::AbstractMatrix)
        -     for j in 1:size(m)[2]
        -         for i in 1:size(m)[1]
        -             write(file, hton(Float32(m[i,j])))
        -         end
        -     end
        - end
        - 
        - function writematf32(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, Float32(m[i,j]))
        0         end
        0     end
        - end
        - 
        - function writemat(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     3200             write(file, m[i,j])
        0         end
        0     end
        - end
        - 
        - a=rand(10,10)
        - fileton = open("testton.dat", "w")
        - writemat(fileton, a)
        - filef32 = open("testf32.dat", "w")
        - writematf32(filef32,a)
        - file = open("test.dat", "w")
        - writemat(file, a)
        -

Oscar_Smith · June 15, 2022, 5:06am

I believe write is allocating a String for the output of the Float32.

uliano · June 15, 2022, 5:12am

I thought that write wrote the canonical binary representation…

I was “more” puzzled because I made a silly mistake (forgot to save) this was the intended test:

        - 
        - 
        - function writematton(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, hton(Float32(m[i,j])))
        0         end
        0     end
        - end
        - 
        - function writematf32(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, Float32(m[i,j]))
        0         end
        0     end
        - end
        - 
        - function writemat(file::IOStream, m::AbstractMatrix)
        0     for j in 1:size(m)[2]
        0         for i in 1:size(m)[1]
     1600             write(file, m[i,j])
        0         end
        0     end
        - end
        - 
        - a=rand(10,10)
        - fileton = open("testton.dat", "w")
        - writematton(fileton, a)
        - filef32 = open("testf32.dat", "w")
        - writematf32(filef32,a)
        - file = open("test.dat", "w")
        - writemat(file, a)
        -

now it is clear to me that it is write’s fault the allocation

Oscar_Smith · June 15, 2022, 5:16am

I think it would be more technically correct to say that it’s allocating a Vector{UInt8} (and arguably shouldn’t be). If I understand what’s happening here correctly, we are allocating for storage of the 4 byte value.

uliano · June 15, 2022, 5:45am

I see, however I can’t help noticing that this “buffer” is completely redundant.

It is not the first time I came up struggling with more allocation than expected, e.g. array broadcasting that made me turn everything to loop (and ask what’s the purpose of a fancy synthetic vector notation that penalizes performances).

It is a pity as the language has tremendous potential!

Sukera · June 15, 2022, 6:02am

No, it’s not that.

github.com/JuliaLang/julia

write(io, ::Int) allocates

opened 07:44AM - 30 Dec 20 UTC

Seelengrab

performance I/O

```julia julia> p, io = mktemp() ("/tmp/jl_ry8Ly0", IOStream(<fd 21>)) juli…a> @time write(io, 1234) 0.000006 seconds (1 allocation: 16 bytes) 8 julia> versioninfo() Julia Version 1.5.3 Commit 788b2c77c1 (2020-11-09 13:37 UTC) Platform Info: OS: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-9.0.1 (ORCJIT, skylake) Environment: JULIA_PKG_SERVER = JULIA_NUM_THREADS = 4 ``` Called in a tight loop, this quickly racks up allocations, overwhelming GC. This seems to be caused by the `Int` getting wrapped in a `Ref` further down the call chain, forcing it to be moved to the heap because of a `@no_inline` even further down the call chain. Feels like primitve types in `write` shouldn't allocate.

We’re allocating a Ref (which is 16 bytes, so the 100 elements totalling 16000 bytes make sense here) since the underlying C API expects a pointer and taking a stack pointer is not really always legal for julia, especially when FFI is involved. We’re very conservative with no pointers escaping by accident.

Broadcasting itself (when broadcasting into an existing array) definitely should not allocate extra stuff, that’s its whole raison d’etre! If you have something that shouldn’t allocate but does, please do ask here or in an issue on github, with a reproducible example, as that’s a bug that should be fixed.

uliano · June 15, 2022, 6:28am

So I understand that for the write there’s nothing to do.

regarding my struggles with the vector notation (I’m not sure that I used the term “broadcasting” in the correct way isn’t it the . notation? like in .= .+= abs.( ) ?) in the function below you can see a lot of short loops on ii that were made in order to avoid the allocations I was getting with vector notation, I also tried to use StaticArrays that worked in some cases. The whole experience has been, to me “a nightmare” and that’s makes me angry as i really like a lot julia.

(Btw that way I was able (mostly by clever algorithm in another function) to reach 2.4x speed over C)

function read_xtc_atoms(file::XtcFile, frame::Integer, coords::AbstractMatrix{T}) where {T <: Real}
    local smallnum::Int
    local smaller::Int
    minint = MVector{3, Int32}(undef)
    maxint = MVector{3, Int32}(undef)
    thiscoord = MVector{3, Int}(undef)
    prevcoord = MVector{3, Int}(undef)
    # magicints = file.magicints
    FIRSTINDEX = 9
    seek(file.file, file.offsets[frame] + 36) # we don't read box here
    size = ntoh(read(file.file, Int32))
    if size <= 9
        for j in 1:3
            for i in 1:size
                coords[i, j] = ntoh(read(file.file, Float32))
            end
        end
    else
        precision = ntoh(read(file.file, Float32))
        read!(file.file, minint)
        minint .= ntoh.(minint)
        read!(file.file, maxint)
        maxint .= ntoh.(maxint)
        small_idx = ntoh(read(file.file, Int32))
        smallidx = Int(small_idx)
        smaller = magicints[max(FIRSTINDEX, smallidx - 1) + 1] ÷ 2 
        smallnum = magicints[smallidx + 1] ÷ 2 
        sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]
        nbytes = ntoh(read(file.file, Int32))
        buffer = BitBuffer(nbytes)
        readbytes!(file.file, buffer.bits, nbytes)
        sizeint = SA[maxint[1] - minint[1] + 1,
                     maxint[2] - minint[2] + 1,
                     maxint[3] - minint[3] + 1]
        if any( sizeint .> 2^24-1)
            bitsizeint = sizeofint.(sizeint)
            bitsize = 0
        else    
            bitsize = sizeofints(sizeint)
        end
        i = 1 # atom index
        run = 0
        while i <= file.natoms
            if bitsize == 0
                for ii in 1:3
                    thiscoord[ii] = receivebits!(buffer, bitsizeint[ii])
                end
            else
                receiveints!(buffer, bitsize, sizeint, thiscoord)
            end
            for ii in 1:3
                thiscoord[ii] += minint[ii]
            end
            flag = receivebits!(buffer, 1)
            is_smaller = 0
            if flag == 1
                run = receivebits!(buffer, 5)
                is_smaller = run % 3
                run -= is_smaller
                is_smaller -= 1
            end
            if run == 0
                for ii in 1:3
                    coords[ii, i] = thiscoord[ii] / precision 
                end
                i += 1
            else
                for ii in 1:3
                    prevcoord[ii] = thiscoord[ii]
                end
                for k in 1:3:run
                    receiveints!(buffer, smallidx, sizesmall, thiscoord)
                    for ii in 1:3
                        thiscoord[ii] += prevcoord[ii] - smallnum # smallnum ??? WTF!!
                    end
                    if k == 1 # exchange first with second atom WTF!!
                        thiscoord, prevcoord = prevcoord, thiscoord
                        for ii in 1:3
                            coords[ii,i] = prevcoord[ii] / precision
                        end
                        i += 1
                    else
                        for ii in 1:3
                            prevcoord[ii] = thiscoord[ii]
                        end
                    end
                    for ii in 1:3
                        coords[ii,i] = thiscoord[ii] / precision
                    end
                    i += 1
                end
            end
            smallidx += is_smaller;
            if is_smaller < 0
                smallnum = smaller
                if smallidx > FIRSTINDEX 
                    smaller = magicints[smallidx] >>> 1
                else
                    smaller = 0
                end
            elseif is_smaller > 0
                smaller = smallnum
                smallnum = magicints[smallidx + 1] >>> 1
            end
            sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]
        end
    end
    return nothing      
end

Sukera · June 15, 2022, 6:47am

Not sure why you’re using ntoh all over the place (is the endianness of your input data indeterminate?), but without knowing what e.g. receivebits! does, it’s hard to say why e.g.

is necessary and can’t be written as directly writing to thiscoord instead.

I’m assuming you wrote this like thiscoord .+= minint before? This should not have allocated.

Same goes for this coords[:, i] .= thiscoord ./ precision should not allocate.

prevcoord .= thiscoord

thiscoord .+= prevcoord .- smallnum

and similar for the other loops. Just from the code you posted I don’t see any reason why broadcasting should need to allocate anything - I suspect the problem was somewhere else. Hard to say without seeing the initial code.

I’m guessing the switch to StaticArrays was done to not allocate an array in this function? Why not use a tuple here? Since you’re not using any linear algebra and always seem to operate either on whole arrays or only read from them, they should be just as good compared to pulling in a dependency.

Sukera · June 15, 2022, 6:52am

Well, not “nothing”, but it’s just tricky to solve directly. Since you have your own loop around write though, you could create a Ref in your outer loop, pass that to unsafe_write directly and reuse it for subsequent calls in that function.

uliano · June 15, 2022, 7:15am

I don’t even know what to say: I put in your modifications and doesn’t allocate anymore. Most of the code was exactly as you suggested (maybe I used prevcoord = thiscoord instead of .= (which still doesn’t allocate, I just tried) BTW is there a difference?

Maybe at the time when I switched to explicit loops the surroundings were much different but I really cant’ remember.

Anyway, good to know, and thank you for your time!

Random notes:
ntoh is required as the file format is based on xdr library with the purpose to be endian-independent.

The loop in the write is required because of “row major storage” and transposing in place I understand means reallocating the array. Still probably better to allocate once when the writing function will be ready I’ll measure if it makes any difference

uliano · June 15, 2022, 7:36am

I forgot the tuples: it depends on the argument expected by the other functions called.

It all started with Vector{UInt32}, mimicking the C code, then I grokked AbstractWhatever so the functions are now expecting AbstractVectors, can I pass tuples instead? are tuples type-homogeneous?

The dependency would still be there as is the same used for the four MVectors allocated at the beginning, are they worth something or can I refactor to Tuples & regular Vectors?

Sukera · June 15, 2022, 8:38am

.= writes into the existing object on the left hand side of the assignment, while = just rebinds the variable name (called a binding) to the object that results on the right hand side. So .= in some sense requires mutability.

You only seem to have Vectors, not matrices - how does transposing make a difference here? A Vector in julia is always a one dimensional storage. It’s not the same as a nx1 matrix, which has two dimensions (albeit one of them being only one wide).

The final coords can (and should, depending on what you use this for) stay as AbstractVector{T}, I was more talking about these MVectors:

If you move those to tuples, you wouldn’t .= those anymore, since tuples are immutable. So you’d use regular assignment, but the operations on the right hand side of those assignments you should still broadcast.

Tuples are not homogenous, no - they store the element type of each element in their type. It shouldn’t make any difference here, since they’re all well defined here, right?

uliano · June 15, 2022, 8:48am

So rebinding would mean that if I mutate thiscoord after the assignment the value of prevcoord gets mutated too, correct?

The loop on write is for a part of the file record not read in the shown part of code (the + 36 skipped at the beginning) these are three direction vectors stored as a 3x3 matrix.

About static vectors I was referring to

sizesmall = SA[magicints[smallidx + 1], magicints[smallidx + 1], magicints[smallidx + 1]]

which appears twice. (Also, here I had to unpack components, is there a vector notation for that?)

It would make a difference if the values in tuple are stored as boxed type due to inhomogeneity. How would an homogeneous tuple be stored? memory contiguous like an array?

Sukera · June 15, 2022, 8:54am

Only if the object is mutable, else you can’t mutate at all, but yes.

If your code is type stable, no boxing will occur. In general yes, tuples are memory contigous, but since they’re immutable values that most of the time end up in registers, it’s not really relevant.

uliano · June 15, 2022, 11:45am

I tried to get rid of MVectors to remove the dependency from StaticArrays

    # minint = MVector{3, Int32}(undef)
    # maxint = MVector{3, Int32}(undef)
    # thiscoord = MVector{3, Int}(undef)
    # prevcoord = MVector{3, Int}(undef)
    minint = Vector{Int32}(undef, 3)
    maxint = Vector{Int32}(undef, 3)
    thiscoord = Vector{Int}(undef, 3)
    prevcoord = Vector{Int}(undef, 3)

I get

9.893195 seconds (72.48 k allocations: 3.058 GiB, 3.17% gc time, 0.06% compilation time)

vs.

7.598854 seconds (52.48 k allocations: 3.055 GiB, 3.70% gc time, 0.05% compilation time)

that’s a huge difference!

Sukera · June 15, 2022, 2:36pm

Which if those is which, and what did you benchmark exactly?

Seems like you also have another bottleneck causing more allocations than necessary.

uliano · June 15, 2022, 6:32pm

It is a 10k iteration (timeframes) on that function reading a file about 700MB of compressed data expanding it to 2.3 GB of float32. I’m pretty much sure that this is the only difference between the runs

Topic		Replies	Views
Vector notation, intermediate value and allocations New to Julia	8	463	June 16, 2022
Common allocation mistakes Performance memory-allocation	47	7174	August 21, 2023
Why/where does this allocate? Performance	10	337	November 15, 2024
Removing undesirable allocations in some functions Performance performance , memory-allocation	15	422	January 27, 2024
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018

I don't undestand allocations

Related topics