Improving perfomance for a beginner

jpmorr · August 13, 2021, 4:59pm

I’ve recently started a side project at work to see if I can use julia to replace python for the majority of my data processing due to the possible gains in performance. Over the years the python code has become quite quick thanks to numpy and numba so I’ve been quite curious as to what julia could bring to the party.

So I’ve done a very basic inital port of the code to get some of the basic processes operational. These are proabably not very well wrtten julia code, but they are giving the correct results and already seem to be almost twice as quick as the python code. I’ll describe a few basic functions and hopefully someone can critique them and see if there’s some simple tweaks that could be made to make things quicker. Asking for guidance early on is usually a good way to ensure time is not wasted.

I’m reading my data from HDF5 file so I have a basic function that reads the required datasets:

function get_generic_property(filepath::String, property::String)
    fid = h5open(filepath, "r")
    timestep_key = keys(fid["Data"])[1]
    query = read(fid["Data"][timestep_key]["groupA"][property])
    close(fid)
    return query
end

So most of the functions have maybe several calls to get the raw data, depending on what’s being processed. Other’s just one, but I don’t thinks there’s any optimisation that can be done here as it’s just using HDF5.jl.

The first processing function is like this:

function normalize_by_row(arr)
    arr ./ sqrt.(sum(arr.^2, dims=2))
end

function getFabricTensor(filepath::String)
    cv1 = get_generic_property(filepath, "cvecs")
    norm_vec = normalize_by_row(cv1)
    tensor_prod = norm_vec' * norm_vec
   
    return tensor_prod / size(norm_vec)[1]
end

where cv1 is nx3 array of Float64 and n is typically in the region of 50-100K.

The second function is calculating some location indices and then looking up the correct result based on several datsets read from the HDF5 file:

function getlocIDs(filepath::String)
    # read data 
    indxs = get_generic_property(filepath, "loc_indx")
    type_elements, type_IDs = _gettypeData(filepath)
    
    # all types will have at least 1 element so generate a vector of one
    num_elems1 = ones(Int, size(indxs)[1])
    num_elems2 = ones(Int, size(indxs)[1])

    # update the number of elements by looping through Dict. 
    # I probably should include a check to see if the number of elements for the type is >1 before replacing
    for type in type_elements
        num_elems1[indxs[:, 1] .== type_elements.first] .= type_elements.second
        num_elems2[indxs[:, 3] .== type_elements.first] .= type_elements.second

    end

    # calculate correct index
    indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1
    indexNumber2 = ((indxs[:, 4] - (indxs[:, 4] .% num_elems2)) .÷ num_elems2).+1

    # add 1 to convert to julia indexing and extract correct ID
    ids1 = getindex.(Ref(type_IDs ), indexNumber1, indxs[:, 1].+1)
    ids2 = getindex.(Ref(type_IDs ), indexNumber2, indxs[:, 3].+1)

   # concatente the return into a matrix
    return hcat(ids1, ids2)
end

type_elements is a Dict() where each key is a string name storing an Int value for the number of elements in that object type. type_IDs is a mxn Matrix of Ints where m is the maximum number of elements for any of the types and n is the number of types in type_elements. This is typically 2-3 types. type_IDs may have parts or the column where the data is a NaN, since not all types will have the same number of elements.
indxs is a px4 array, where p is typically about 6x larger than n.

Both functions run fine and return results that are correct, but I’m trying to understand julia and how to write quick code and I’m wondering about things like @inbounds checking, memory allocations and type stability.

I think all my functions are type stable. I may have a few more memory allocations than is necessary. What about function size - is it better to create lots of smaller functions? Should I break getlocIDs() into multiple smaller specialised functions, like for calculating num_elems1/num_elems2?

Thanks in advance.

lmiq · August 13, 2021, 5:39pm

One of the advantages of smaller functions is, for example, that you could show one of them here with an input that works, and get very specific suggestions on how to improve the performance of that specific code.

I won’t speculate on how much that will make a difference in performance in your code, but the immediate thing to improve is to remove the intermediate vectors that are allocating in many of the broadcasting operations you perform. For example, here:

    indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1

The indxs[:,2] create a new intermediate vector (use views). Then, the indxs[:, 2] .% num_elems1 creates a new intermediate vector. Then (indxs[:, 2] - (indxs[:, 2] .% num_elems1)) will create a new intermediate vector, and then another one will be created. To not create intermediate arrays in all these operations use @views and probably it is easier to use broadcasting with @.:

Thus, for example:

julia> function index(indxs,num_elems1)
          indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1
       end
index (generic function with 1 method)

julia> indxs = rand(1:100,100,2); num_elems1 = 100;

julia> @btime index($indxs,$num_elems1);
  750.114 ns (5 allocations: 4.38 KiB)

Can be written as:

julia> function index3(indxs,num_elems1)
          indexNumber1 = @views @. (indxs[:,2] - indxs[:,2] % num_elems1) ÷ num_elems1 + 1
       end
index3 (generic function with 1 method)

julia> @btime index3($indxs,$num_elems1);
  558.812 ns (1 allocation: 896 bytes)

The remaining allocation is that of the indexNumber vector, which inside your loop you might want to preallocate or avoid it completely.

You can also expand that loop:

julia> function index2(indxs,num_elems1)
          indexNumber1 = similar(indxs,num_elems1)
          for i in axes(indxs,1)
            indexNumber1[i] = (indxs[i,2] - indxs[i,2] % num_elems1) ÷ num_elems1 + 1
          end
          return indexNumber1
       end
index2 (generic function with 1 method)

julia> @btime index2($indxs,$num_elems1);
  505.316 ns (1 allocation: 896 bytes)

Now that loop looks a good candidate for loop vectorization:

julia> using LoopVectorization

julia> function index3(indxs,num_elems1)
          indexNumber1 = similar(indxs,num_elems1)
          @turbo for i in axes(indxs,1)
            indexNumber1[i] = (indxs[i,2] - indxs[i,2] % num_elems1) ÷ num_elems1 + 1
          end
          return indexNumber1
       end
index3 (generic function with 1 method)

julia> @btime index3($indxs,$num_elems1);
  342.514 ns (1 allocation: 896 bytes)

jpmorr · August 13, 2021, 7:35pm

@lmiq Thanks for these tips. I’ve just been looking at the first example you gave to see how that improves things and it does make quite a difference:

@btime indexNumber1 = ((cc2[:, 2] - (cc2[:, 2] .% numSpheres1)) .÷ numSpheres1).+1
  384.600 μs (18 allocations: 1.97 MiB)

@btime indexNumber1 = index3(cc2, numSpheres1)
  291.400 μs (2 allocations: 404.27 KiB)

So I though about how to make it generic rather than have a separate function for each index calculation since it’s reading the whole indxs array and the calcuation is the same it’s just with different inputs so I tried two other options:

function index4(indxs, num_elems)
    indexNumber1 = @views @. (indxs - indxs % num_elems) ÷ num_elems + 1
end

function index5(indxs, num_elems)
    indexNumber1 = @. (indxs - indxs % num_elems) ÷ num_elems + 1
end

However, both of these options are a bit slower, I’m not sure why, but probabaly becuase of the extra allocations? I don’t know where the extra allocations are, I’m still trying to get my head around this aspect of writing code for julia.

@btime indexNumber1 = index4(cc2[:,2], numSpheres1)
  324.000 μs (5 allocations: 808.55 KiB)

@btime indexNumber1 = index5(cc2[:,2], numSpheres1)
  312.300 μs (5 allocations: 808.55 KiB)

I’m going to have a look at the other suggested improvements later tonight or tomorrow. Thanks again!

lmiq · August 13, 2021, 8:35pm

Always use views when slicing, and to use @btime you need to use the $:

julia> function index5(indxs, num_elems)
           indexNumber1 = @. (indxs - indxs % num_elems) ÷ num_elems + 1
       end
index5 (generic function with 1 method)

julia> indxs = rand(1:100,100,2); num_elems1 = 100;

julia> @btime indexNumber1 = index5(indxs[:,2], num_elems1) # as you did;
  779.379 ns (3 allocations: 1.77 KiB)

julia> @btime indexNumber1 = index5($(@view(indxs[:,2])), $num_elems1) # as should be done;
  559.914 ns (1 allocation: 896 bytes)

outside the benchmark, call your function with:

Topic		Replies	Views
Julia Performance - Help Needed Performance question , python	40	2858	September 17, 2021
General questions from Python user Performance	59	4246	March 8, 2021
Performance input on my code Performance	8	837	October 15, 2018
Speeding up julia New to Julia	15	5294	July 12, 2017
Speeding up some non-optimized Julia functions Performance	47	1142	July 19, 2022

Improving perfomance for a beginner

Related topics