I’ve recently started a side project at work to see if I can use
julia to replace
python for the majority of my data processing due to the possible gains in performance. Over the years the python code has become quite quick thanks to
numba so I’ve been quite curious as to what
julia could bring to the party.
So I’ve done a very basic inital port of the code to get some of the basic processes operational. These are proabably not very well wrtten
julia code, but they are giving the correct results and already seem to be almost twice as quick as the python code. I’ll describe a few basic functions and hopefully someone can critique them and see if there’s some simple tweaks that could be made to make things quicker. Asking for guidance early on is usually a good way to ensure time is not wasted.
I’m reading my data from HDF5 file so I have a basic function that reads the required datasets:
function get_generic_property(filepath::String, property::String) fid = h5open(filepath, "r") timestep_key = keys(fid["Data"]) query = read(fid["Data"][timestep_key]["groupA"][property]) close(fid) return query end
So most of the functions have maybe several calls to get the raw data, depending on what’s being processed. Other’s just one, but I don’t thinks there’s any optimisation that can be done here as it’s just using
The first processing function is like this:
function normalize_by_row(arr) arr ./ sqrt.(sum(arr.^2, dims=2)) end function getFabricTensor(filepath::String) cv1 = get_generic_property(filepath, "cvecs") norm_vec = normalize_by_row(cv1) tensor_prod = norm_vec' * norm_vec return tensor_prod / size(norm_vec) end
where cv1 is
nx3 array of
n is typically in the region of 50-100K.
The second function is calculating some location indices and then looking up the correct result based on several datsets read from the HDF5 file:
function getlocIDs(filepath::String) # read data indxs = get_generic_property(filepath, "loc_indx") type_elements, type_IDs = _gettypeData(filepath) # all types will have at least 1 element so generate a vector of one num_elems1 = ones(Int, size(indxs)) num_elems2 = ones(Int, size(indxs)) # update the number of elements by looping through Dict. # I probably should include a check to see if the number of elements for the type is >1 before replacing for type in type_elements num_elems1[indxs[:, 1] .== type_elements.first] .= type_elements.second num_elems2[indxs[:, 3] .== type_elements.first] .= type_elements.second end # calculate correct index indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1 indexNumber2 = ((indxs[:, 4] - (indxs[:, 4] .% num_elems2)) .÷ num_elems2).+1 # add 1 to convert to julia indexing and extract correct ID ids1 = getindex.(Ref(type_IDs ), indexNumber1, indxs[:, 1].+1) ids2 = getindex.(Ref(type_IDs ), indexNumber2, indxs[:, 3].+1) # concatente the return into a matrix return hcat(ids1, ids2) end
type_elements is a
Dict() where each key is a string name storing an
Int value for the number of elements in that object type.
type_IDs is a
m is the maximum number of elements for any of the types and
n is the number of types in
type_elements. This is typically 2-3 types.
type_IDs may have parts or the column where the data is a
NaN, since not all types will have the same number of elements.
indxs is a
px4 array, where
p is typically about 6x larger than
Both functions run fine and return results that are correct, but I’m trying to understand
julia and how to write quick code and I’m wondering about things like
@inbounds checking, memory allocations and type stability.
I think all my functions are type stable. I may have a few more memory allocations than is necessary. What about function size - is it better to create lots of smaller functions? Should I break
getlocIDs() into multiple smaller specialised functions, like for calculating
Thanks in advance.