I’ve recently started a side project at work to see if I can use julia
to replace python
for the majority of my data processing due to the possible gains in performance. Over the years the python code has become quite quick thanks to numpy
and numba
so I’ve been quite curious as to what julia
could bring to the party.
So I’ve done a very basic inital port of the code to get some of the basic processes operational. These are proabably not very well wrtten julia
code, but they are giving the correct results and already seem to be almost twice as quick as the python code. I’ll describe a few basic functions and hopefully someone can critique them and see if there’s some simple tweaks that could be made to make things quicker. Asking for guidance early on is usually a good way to ensure time is not wasted.
I’m reading my data from HDF5 file so I have a basic function that reads the required datasets:
function get_generic_property(filepath::String, property::String)
fid = h5open(filepath, "r")
timestep_key = keys(fid["Data"])[1]
query = read(fid["Data"][timestep_key]["groupA"][property])
close(fid)
return query
end
So most of the functions have maybe several calls to get the raw data, depending on what’s being processed. Other’s just one, but I don’t thinks there’s any optimisation that can be done here as it’s just using HDF5.jl
.
The first processing function is like this:
function normalize_by_row(arr)
arr ./ sqrt.(sum(arr.^2, dims=2))
end
function getFabricTensor(filepath::String)
cv1 = get_generic_property(filepath, "cvecs")
norm_vec = normalize_by_row(cv1)
tensor_prod = norm_vec' * norm_vec
return tensor_prod / size(norm_vec)[1]
end
where cv1 is nx3
array of Float64
and n
is typically in the region of 50-100K.
The second function is calculating some location indices and then looking up the correct result based on several datsets read from the HDF5 file:
function getlocIDs(filepath::String)
# read data
indxs = get_generic_property(filepath, "loc_indx")
type_elements, type_IDs = _gettypeData(filepath)
# all types will have at least 1 element so generate a vector of one
num_elems1 = ones(Int, size(indxs)[1])
num_elems2 = ones(Int, size(indxs)[1])
# update the number of elements by looping through Dict.
# I probably should include a check to see if the number of elements for the type is >1 before replacing
for type in type_elements
num_elems1[indxs[:, 1] .== type_elements.first] .= type_elements.second
num_elems2[indxs[:, 3] .== type_elements.first] .= type_elements.second
end
# calculate correct index
indexNumber1 = ((indxs[:, 2] - (indxs[:, 2] .% num_elems1)) .÷ num_elems1).+1
indexNumber2 = ((indxs[:, 4] - (indxs[:, 4] .% num_elems2)) .÷ num_elems2).+1
# add 1 to convert to julia indexing and extract correct ID
ids1 = getindex.(Ref(type_IDs ), indexNumber1, indxs[:, 1].+1)
ids2 = getindex.(Ref(type_IDs ), indexNumber2, indxs[:, 3].+1)
# concatente the return into a matrix
return hcat(ids1, ids2)
end
type_elements
is a Dict()
where each key is a string name storing an Int
value for the number of elements in that object type. type_IDs
is a mxn
Matrix
of Int
s where m
is the maximum number of elements for any of the types and n
is the number of types in type_elements
. This is typically 2-3 types. type_IDs
may have parts or the column where the data is a NaN
, since not all types will have the same number of elements.
indxs
is a px4
array, where p
is typically about 6x larger than n
.
Both functions run fine and return results that are correct, but I’m trying to understand julia
and how to write quick code and I’m wondering about things like @inbounds
checking, memory allocations and type stability.
I think all my functions are type stable. I may have a few more memory allocations than is necessary. What about function size - is it better to create lots of smaller functions? Should I break getlocIDs()
into multiple smaller specialised functions, like for calculating num_elems1
/num_elems2
?
Thanks in advance.