Improving perfomance for a beginner

I’ve recently started a side project at work to see if I can use julia to replace python for the majority of my data processing due to the possible gains in performance. Over the years the python code has become quite quick thanks to numpy and numba so I’ve been quite curious as to what julia could bring to the party.

So I’ve done a very basic inital port of the code to get some of the basic processes operational. These are proabably not very well wrtten julia code, but they are giving the correct results and already seem to be almost twice as quick as the python code. I’ll describe a few basic functions and hopefully someone can critique them and see if there’s some simple tweaks that could be made to make things quicker. Asking for guidance early on is usually a good way to ensure time is not wasted.

I’m reading my data from HDF5 file so I have a basic function that reads the required datasets:

function get_generic_property(filepath::String, property::String)
    fid = h5open(filepath, "r")
    timestep_key = keys(fid["Data"])[1]
    query = read(fid["Data"][timestep_key]["groupA"][property])
    return query

So most of the functions have maybe several calls to get the raw data, depending on what’s being processed. Other’s just one, but I don’t thinks there’s any optimisation that can be done here as it’s just using HDF5.jl.

The first processing function is like this:

function normalize_by_row(arr)
    arr ./ sqrt.(sum(arr.^2, dims=2))

function getFabricTensor(filepath::String)
    cv1 = get_generic_property(filepath, "cvecs")
    norm_vec = normalize_by_row(cv1)
    tensor_prod = norm_vec' * norm_vec
    return tensor_prod / size(norm_vec)[1]

where cv1 is nx3 array of Float64 and n is typically in the region of 50-100K.

The second function is calculating some location indices and then looking up the correct result based on several datsets read from the HDF5 file:

function getlocIDs(filepath::String)
    # read data 
    indxs = get_generic_property(filepath, "loc_indx")
    type_elements, type_IDs = _gettypeData(filepath)
    # all types will have at least 1 element so generate a vector of one
    num_elems1 = ones(Int, size(indxs)[1])
    num_elems2 = ones(Int, size(indxs)[1])

    # update the number of elements by looping through Dict. 
    # I probably should include a check to see if the number of elements for the type is >1 before replacing
    for type in type_elements
        num_elems1[indxs[:, 1] .== type_elements.first] .= type_elements.second
        num_elems2[indxs[:, 3] .== type_elements.first] .= type_elements.second


    # calculate correct index
    indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1
    indexNumber2 = ((indxs[:, 4] - (indxs[:, 4] .% num_elems2)) .÷ num_elems2).+1

    # add 1 to convert to julia indexing and extract correct ID
    ids1 = getindex.(Ref(type_IDs ), indexNumber1, indxs[:, 1].+1)
    ids2 = getindex.(Ref(type_IDs ), indexNumber2, indxs[:, 3].+1)

   # concatente the return into a matrix
    return hcat(ids1, ids2)

type_elements is a Dict() where each key is a string name storing an Int value for the number of elements in that object type. type_IDs is a mxn Matrix of Ints where m is the maximum number of elements for any of the types and n is the number of types in type_elements. This is typically 2-3 types. type_IDs may have parts or the column where the data is a NaN, since not all types will have the same number of elements.
indxs is a px4 array, where p is typically about 6x larger than n.

Both functions run fine and return results that are correct, but I’m trying to understand julia and how to write quick code and I’m wondering about things like @inbounds checking, memory allocations and type stability.

I think all my functions are type stable. I may have a few more memory allocations than is necessary. What about function size - is it better to create lots of smaller functions? Should I break getlocIDs() into multiple smaller specialised functions, like for calculating num_elems1/num_elems2?

Thanks in advance.

One of the advantages of smaller functions is, for example, that you could show one of them here with an input that works, and get very specific suggestions on how to improve the performance of that specific code.

I won’t speculate on how much that will make a difference in performance in your code, but the immediate thing to improve is to remove the intermediate vectors that are allocating in many of the broadcasting operations you perform. For example, here:

    indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1

The indxs[:,2] create a new intermediate vector (use views). Then, the indxs[:, 2] .% num_elems1 creates a new intermediate vector. Then (indxs[:, 2] - (indxs[:, 2] .% num_elems1)) will create a new intermediate vector, and then another one will be created. To not create intermediate arrays in all these operations use @views and probably it is easier to use broadcasting with @.:

Thus, for example:

julia> function index(indxs,num_elems1)
          indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1
index (generic function with 1 method)

julia> indxs = rand(1:100,100,2); num_elems1 = 100;

julia> @btime index($indxs,$num_elems1);
  750.114 ns (5 allocations: 4.38 KiB)

Can be written as:

julia> function index3(indxs,num_elems1)
          indexNumber1 = @views @. (indxs[:,2] - indxs[:,2] % num_elems1) ÷ num_elems1 + 1
index3 (generic function with 1 method)

julia> @btime index3($indxs,$num_elems1);
  558.812 ns (1 allocation: 896 bytes)

The remaining allocation is that of the indexNumber vector, which inside your loop you might want to preallocate or avoid it completely.

You can also expand that loop:

julia> function index2(indxs,num_elems1)
          indexNumber1 = similar(indxs,num_elems1)
          for i in axes(indxs,1)
            indexNumber1[i] = (indxs[i,2] - indxs[i,2] % num_elems1) ÷ num_elems1 + 1
          return indexNumber1
index2 (generic function with 1 method)

julia> @btime index2($indxs,$num_elems1);
  505.316 ns (1 allocation: 896 bytes)

Now that loop looks a good candidate for loop vectorization:

julia> using LoopVectorization

julia> function index3(indxs,num_elems1)
          indexNumber1 = similar(indxs,num_elems1)
          @turbo for i in axes(indxs,1)
            indexNumber1[i] = (indxs[i,2] - indxs[i,2] % num_elems1) ÷ num_elems1 + 1
          return indexNumber1
index3 (generic function with 1 method)

julia> @btime index3($indxs,$num_elems1);
  342.514 ns (1 allocation: 896 bytes)


@lmiq Thanks for these tips. I’ve just been looking at the first example you gave to see how that improves things and it does make quite a difference:

@btime indexNumber1 = ((cc2[:, 2] - (cc2[:, 2] .% numSpheres1)) .÷ numSpheres1).+1
  384.600 μs (18 allocations: 1.97 MiB)

@btime indexNumber1 = index3(cc2, numSpheres1)
  291.400 μs (2 allocations: 404.27 KiB)

So I though about how to make it generic rather than have a separate function for each index calculation since it’s reading the whole indxs array and the calcuation is the same it’s just with different inputs so I tried two other options:

function index4(indxs, num_elems)
    indexNumber1 = @views @. (indxs - indxs % num_elems) ÷ num_elems + 1

function index5(indxs, num_elems)
    indexNumber1 = @. (indxs - indxs % num_elems) ÷ num_elems + 1

However, both of these options are a bit slower, I’m not sure why, but probabaly becuase of the extra allocations? I don’t know where the extra allocations are, I’m still trying to get my head around this aspect of writing code for julia.

@btime indexNumber1 = index4(cc2[:,2], numSpheres1)
  324.000 μs (5 allocations: 808.55 KiB)

@btime indexNumber1 = index5(cc2[:,2], numSpheres1)
  312.300 μs (5 allocations: 808.55 KiB)

I’m going to have a look at the other suggested improvements later tonight or tomorrow. Thanks again!

Always use views when slicing, and to use @btime you need to use the $:

julia> function index5(indxs, num_elems)
           indexNumber1 = @. (indxs - indxs % num_elems) ÷ num_elems + 1
index5 (generic function with 1 method)

julia> indxs = rand(1:100,100,2); num_elems1 = 100;

julia> @btime indexNumber1 = index5(indxs[:,2], num_elems1) # as you did;
  779.379 ns (3 allocations: 1.77 KiB)

julia> @btime indexNumber1 = index5($(@view(indxs[:,2])), $num_elems1) # as should be done;
  559.914 ns (1 allocation: 896 bytes)

outside the benchmark, call your function with:

1 Like