Why does explicit declaration of ::INT save 40% runtime?

uwestoehr · March 31, 2025, 9:25pm

I am relatively new to Julia and I wrote some scripts my main issue is that my scripts perform always poor and I don’t understand why or what I am doing wrong.

Take this code as example where I am failing:

dimension = 2
typeof(dimension)
dimension2::Int = 2
typeof(dimension2)
testMatrix = ones(1500, dimension)
countMatrix = zeros(Int, 5, 5)

@elapsed for NN in 1:1000
    for k in eachindex(testMatrix[:, 1])
        coordMatrix= zeros(Int, dimension)
        # due to precision issues, we subtract a small value from the values
        for i in 1:dimension
            coordMatrix[i] = trunc(testMatrix[k, i] / 4.3) + 1
        end
        #nTuple = Tuple(coordMatrix)
        #countMatrix[coordMatrix[1], coordMatrix[2]] += 1
        #countMatrix[coordMatrix...] += 1
    end
end

The code as it is needs on my PC about 1.02 seconds to run.

But when I replace in the for k loop ‘dimension’ by ‘dimension2’ the same code runs 40 % faster (about 0.608 seconds).

Why is that the case? typeof(dimension) and typeof(dimension2) shows me in botch cases “Int64” as datatype?

Next problem:
when I leave “dimension” and uncomment the line
ntuple = Tuple(coordMatrix)
the runtime jumps from 1.02 to 1.50 seconds. But this creates just a tuple of a vector with just 2 entries.

I use Win 10, 64bit and Julia 1.11.4. I get the same results, no matter if I work in the REPL or in VS Code.

stevengj · March 31, 2025, 10:11pm

Have you looked at the performance tips in the manual? The first two explain what you are doing wrong.

mikmoore · March 31, 2025, 10:13pm

Welcome to Julia.

You should check out the first performance tip. The issue is that you are benchmarking code in global scope. Since global variables can change at any time for any reason, the compiler has to write code that spends most of its time checking what variables are and then figuring out what to do with them (rather than just doing things).

Your ::Int annotation helps because it makes that global variable of constant type, so the compiler can at least do a little with it (and things that follow). But if you wrote your calculation inside a function, this would be unnecessary.

Less important, but consider using packages to benchmark your code. Benchmarking packages include BenchmarkTools.jl and Chairmarks.jl.

So your function might look something like

function countstuff(testMatrix)
    dimension = size(testMatrix, 2)
    countMatrix = zeros(Int, 5, 5)
    coordMatrix = zeros(Int, dimension)
    for k in axes(testMatrix, 1)
        for i in 1:dimension
            coordMatrix[i] = trunc(Int, testMatrix[k, i] / 4.3) + 1
        end
        countMatrix[coordMatrix...] += 1
    end
    return countMatrix
end

But, written this way, the indexing with coordMatrix is very expensive. Measuring with BenchmarkTools.jl:

julia> using BenchmarkTools

julia> testMatrix = ones(1500, 2);

julia> @btime countstuff($testMatrix)
  175.268 μs (5970 allocations: 117.00 KiB)
5×5 Matrix{Int64}:
 1500  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0

Using Tuple here is tricky if I’m guessing your intended use correctly and if you care about performance. The easiest way is the somewhat advanced technique of moving values into the type domain via Val. Here’s an example of that:

function countstuff2(testMatrix, ::Val{dimension}) where dimension
    if size(testMatrix, 2) != dimension 
        error("dimension should match size(testMatrix, 2)")
    end
    countMatrix = zeros(Int, ntuple(i -> 5, Val(dimension))) # may want to adjust size here
    for k in axes(testMatrix, 1)
        coord = ntuple(i -> trunc(Int, testMatrix[k, i] / 4.3) + 1, Val(dimension))
        countMatrix[coord...] += 1
    end
    return countMatrix
end

julia> @btime countstuff2($testMatrix, Val(2))
  2.783 μs (2 allocations: 272 bytes)
5×5 Matrix{Int64}:
 1500  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0
    0  0  0  0  0

It would be easier if testMatrix wasn’t actually a matrix, but rather was a vector of tuples or something that encodes the dimension that way.

uwestoehr · March 31, 2025, 10:33pm

My code is actually inside a function. But the function is very slow thus I posted here an excerpt to have only as much code as necessary.
Here is my complete function:

function createCountMatrix(dataMatrix, resolution::Int, minMatrix, maxMatrix,
                            dimensions::Int)
    # at first calculate the cell size for every dimension
    sizeMatrix= zeros(dimensions)
    for i in 1:dimensions
        sizeMatrix[i] = (maxMatrix[i] - minMatrix[i]) / resolution
    end
    # we need now a tensor in that has resolution entries for every dimension
    # one option would be to create it using a tuple:
    #   gridTuple::Tuple= ntuple(i->resolution, dimensions)
    #   countMatrix = zeros(Int, gridTuple)
    # but this is extremely slow. Therefore we use a vector:
    creationVector = zeros(Int, dimensions)
    for i in 1:dimensions
        creationVector[i] = resolution
    end
    countMatrix = zeros(Int, creationVector...)

    # create a matrix with coordinates in our grid
    # every of its columns will get the coordinate values for the particular data point
    coordMatrix= zeros(Int, dimensions)
    for k in eachindex(dataMatrix[:, 1])
        for i in 1:dimensions
            coordMatrix[i] = trunc((dataMatrix[k, i] - minMatrix[i]) /
                                        sizeMatrix[i]) + 1
        end
        countMatrix[coordMatrix...] += 1
    end
    return countMatrix
end

This function is the bottleneck of my whole program and as it is I always get a
OutOfMemoryError()

When I replace

countMatrix = zeros(Int, creationVector...)

with

gridTuple::Tuple= ntuple(i->resolution, dimensions)
countMatrix = zeros(Int, gridTuple)

Then I get no error but is is really, slow.

Regarding the in Int added now explicit ::Int to the function header, this alone sped up the function by 40 %

uwestoehr · March 31, 2025, 10:54pm

OK, the OutOfMemory error went away after a restart of Julia. I will now try out your approach with Val.

mikmoore · March 31, 2025, 10:57pm

Here’s a pretty big speedup:

function createCountMatrix2(dataMatrix, resolution, minMatrix, maxMatrix, ::Val{dim}) where dim
    if !(length(minMatrix) == length(maxMatrix) == size(dataMatrix, 2) == dim)
        error("unexpected set of sizes")
    end
    sz = ntuple(i -> (maxMatrix[i] - minMatrix[i]) / resolution, Val(dim))
    countMatrix = zeros(Int, ntuple(i -> resolution, Val(dim)))
    for k in axes(dataMatrix, 1)
        coord = ntuple(i -> trunc(Int, (dataMatrix[k, i] - minMatrix[i]) / sz[i]) + 1, Val(dim))
        countMatrix[coord...] += 1
    end
    return countMatrix
end

julia> dataMat = rand(1000, 2);

julia> @btime createCountMatrix($dataMat, $10, $(zeros(2)), $(ones(2)), 2)
  121.242 μs (2010 allocations: 55.88 KiB)
10×10 Matrix{Int64}:
 13  10  13  15   9  10   6   9  13   7
  8  11   6   5   8  10  15   6  10  14
 17  13   7   6   7  19  10  12  11   5
  7   7   9  12  12  10  10   9  10  11
 13  13   8  15   7  15   6  10  10   8
  6  12  12   8   8   6  16   7   5  11
  9  12   7  12  10  15   6   6  11  12
  9  14   9   9  11  10  11   7  12   7
 15  16  15  11  11   7   5  10   8   7
 10   5  13   9  11  13  13  12   8   4

julia> @btime createCountMatrix2($dataMat, $10, $(zeros(2)), $(ones(2)), Val(2))
  1.737 μs (2 allocations: 944 bytes)
10×10 Matrix{Int64}:
 13  10  13  15   9  10   6   9  13   7
  8  11   6   5   8  10  15   6  10  14
 17  13   7   6   7  19  10  12  11   5
  7   7   9  12  12  10  10   9  10  11
 13  13   8  15   7  15   6  10  10   8
  6  12  12   8   8   6  16   7   5  11
  9  12   7  12  10  15   6   6  11  12
  9  14   9   9  11  10  11   7  12   7
 15  16  15  11  11   7   5  10   8   7
 10   5  13   9  11  13  13  12   8   4

I had to use Val to put the dimension in the type domain here to let me use tuples in a type stable way. This is probably useful in this case, but isn’t always.

uwestoehr · April 1, 2025, 12:25am

Many thanks! This works pretty well.

But of course I want to understand why this is so much faster.

Background info: the datamatrix is read out from a CSV file. I combined ever column of the CSV file via hcat() to the dataMatrix. Maybe I can use another data type.The actual data values are all floats.

Can you explain please a bit why it was necessary to handle the dimension as a Val()? I mean I treated it as an Int64 and it is not yet clear to me why when I use it to create tuples, it is so slow and with Val() it is so fast.

Salmon · April 1, 2025, 7:26am

A tuple is a type that contains information about its length in the type itself. This will make them generally faster than vectors (for small enough tuples) as long as the compiler knows ahead of time what this length will be.
If it does not, the compiler will not be able to figure out the types which leads to a lot of overhead, as julia needs to check the types at runtime and then choose the appropriate compiled methods for these types, i.e. indexing your matrix etc.

In your case, dimensions is not actually known ahead of compilation, but the trick is that we can move all the uncertainty out of the function that is actually performance sensitive.
This means that we create a value type which contains information about the length of the tuple. When this type is created the compiler will not immediately know what the outcome is so that will be relatively slow by itself. however, afterwards it dispatches a function that is compiled just in time specifically for the value of dimensions.
Since the cost of this function is much larger than whatever happened before, this leads to a significant speedup.

sgaure · April 1, 2025, 7:39am

The key to understand why is what the compiler knows when it compiles your function. When dimensions is an Int, the compiler does not know its value. So it knows neither the size of the creationVector, nor the shape and size of countMatrix and the size of coordMatrix. This means a lot more work inside the main loop for computing the actual indices into countMatrix.

With mikmoore’s suggestion, the value of dimensions is known to the compiler (since it’s a type). So the shape of countMatrix is known, and the size of coord is known. Thus the loop is much better optimized, there is no uncertainty about what to do.

uwestoehr · April 1, 2025, 5:11pm

Many thanks @Salmon and @sgaure . I understand it a bit but not yet completely:

To use the Val() method consistently I changed my function header now to

function createCountTensor(dataMatrix, ::Val{resolution}, minMatrix::Vector, maxMatrix::Vector,
                            ::Val{dimensions}) where {resolution, dimensions}

But on calling now

countTensor = zeros(Int, ntuple(i -> Val(resolution), Val(dimensions)))

I get the error:

no method matching zeros(::Type{Int64}, ::Tuple{Val{8}, Val{8}})

calling instead

countTensor = zeros(Int, ntuple(i -> resolution, Val(dimensions)))

works.
So why can’t I use Val() also for resolution? I mean it is the same data type as dimensions.

mikmoore · April 1, 2025, 5:27pm

A Val is not a “normal” value and it can’t be used in place of one. In this case, passing the dimension in a Val tells the compiler to compile a new set of instructions custom-made for that particular dimension. Because the number of dimensions is used to generate a lot of code relating to the use of arrays, it’s useful to have that particular part known to the compiler. See more about Val here and here.

ntuple specially accepts a Val in place of its length argument to ensure the compiler knows the length. It’s not possible to use Val in most places because most functions don’t work like this (because it’s not beneficial for them).

Since resolution is used to determine the size of arrays (but not the number of dimensions), and the size is a runtime value (that the compiler does not significantly benefit from knowing up-front) it’s not beneficial to pass it as a Val and none of the functions related to your resolution would benefit even if they did support it.

The use of Val is a somewhat advanced technique and should be applied sparingly. It can require the compiler to compile a lot more functions than it otherwise would, so misuse can be detrimental.

Salmon · April 1, 2025, 5:36pm

I think it is probably appropriate to mention alternatives to using Val. A more natural solution would actually be the following pattern, which boils down to using
function barriers

function createArray(N, dimensions)
    return zeros((N for N in dimensions)...) # this is type unstable
end
function computeArrayValues!(array,params)
   for i in eachindex(array)
      ...
   end
end

arr = createArray(5,3)
computeArrayValues!(arr,params)

It might require some thinking to see how to apply this patter for your case, but im pretty sure it will be worth it and also make your code much simpler

Vasily_Pisarev · April 1, 2025, 6:05pm

In this case, I think, function barrier pattern is helpful:

function createCountMatrix(data::AbstractMatrix, resolution::Integer, upper_limits, lower_limits)
    nvars = size(data, 2)

    nlevels = ntuple(i->resolution, nvars)
    # function barrier: move all size-dependent logic to make_heatmap
    # for which the number of dimensions are encoded in the size of nlevels
    counts = make_heatmap(data, nlevels, upper_limits, lower_limits)

    return counts
end

function make_heatmap(
    data::AbstractMatrix{T},
    nlevels::NTuple{N,Integer},
    upper_limits,
    lower_limits,
) where {N, T<:Real}
    # at first calculate the step size for every dimension
    cols = eachcol(data)
    col_steps = ntuple(N) do i
        return (upper_limits[i] - lower_limits[i]) / nlevels[i]
    end
    
    counts = zeros(Int, nlevels)

    # create a matrix with coordinates in our grid
    # every of its columns will get the coordinate values for the particular data point
    
    for k in axes(data, 1)
        coords = ntuple(N) do i
            trunc(Int, (data[k, i] - lower_limits[i]) / col_steps[i]) + 1
        end
        counts[coords...] += 1
    end
    return counts
end

mikmoore · April 1, 2025, 6:05pm

I’ll echo the above poster that Val is not needed very often and can usually be avoided (and the code made better) by restructuring things. In the case of the original question it was useful because the information was only available as a runtime value (not in a type anywhere), being the size(testMatrix, 2).

“Array-ifying” data by smashing data points together into a higher-dimensional array is extremely common in Python/MATLAB/etc because of their performance characteristics. But in Julia it’s not necessary and code is usually cleaner if it can be avoided.

Currently your testMatrix is a N x dimension Matrix{Float64}. Things might be a lot cleaner if it was a length N Vector{NTuple{dimension, Float64}} or Vector{SVector{dimension, Float64}} where SVector comes from StaticArrays.jl. In either version, the dimension is encoded in the type so it would not be necessary to pass as a separate Val. These would probably be more idiomatic ways to handle the data, also.

Or the above post using a function barrier is also very idiomatic. It isolates the type stability in such a way that it has a negligible effect on performance.

What’s actually best will depend on where the data comes from and what you do with it.

Vasily_Pisarev · April 1, 2025, 6:19pm

Another possibility, as I now see, is to require tuples or static arrays for lower and upper limits which also makes the type of output inferrable from the types of inputs.

uwestoehr · April 2, 2025, 12:49am

I tried this but there is a general issue with Julia, I opened a new thread for it:

Regarding SVector the run time is not stable, therefore it is no option for me. Sometimes it is unbeatably fast, but sometimes slow.

DNF · April 2, 2025, 6:00am

The problem here is that you are putting everything into a static array. They are only made for small arrays. For dimensions like 2 they are perfect, so use

Vector{SVector{2, Float64}}

Instead of big SMatrix.

To generate the random data, do

sDataMatrix = [rand(SVector{2,Float64}) for n in 1:1500]

DNF · April 2, 2025, 6:14am

And, as mentioned in the other thread, the slowness is due to jit compilation, since large static arrays put an enormous strain on the compiler. Larger SMatrixes than 10x10 is not recommended.

Topic		Replies	Views
Experiments in speeding up runtime dispatch for integer value specialization Performance	28	1233	November 18, 2022
Fast type indexing Performance	6	615	March 10, 2019
Performance regression with StaticArrays? Performance question , staticarrays	5	469	January 27, 2023
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018
Why does indexing into CartesianIndices lead to excessive runtime dispatch? Performance	9	277	February 20, 2024

Why does explicit declaration of ::INT save 40% runtime?

Related topics