I have some code currently implemented via multiple dispatch that looks vaguely like:
function process_data(data,n)
for i in 1:n
# apply operation to entire data set, data
end
end
function process_data(data_sets,n)
for i in 1:n
# apply operation to data = data_sets[i]
end
end
function process_data(data,n,k)
for i in 1:n
# randomly subsample k elements from data and process those
end
end
So
The first version works on the entire data set all at once
The second version works on an array structure of data sets, using a different data set at each iterate.
The third version creates randomly subsampled data sets at each iterate, and proccesses them.
Right now, I have coded (nearly) the entire loop in each case, and I feel like I should be able to use multiple dispatch such that only one of the loops is fully coded out. What I want to avoid doing is doing allocations, and that’s where I’m looking for help. Is there a way to map a single data set, data into a abstract structure data_sets that will just return data at every data_sets[i]? Is there some similar way to get the random subsampling to work?
Hi! I think it’s impossible to answer your questions accurately without more details. The actual code would be most useful to find a solution, but if you cannot disclose that, can you make a minimal example that runs and has the same features as your actual code?
Here’s what comes to my mind based on your description:
Are all the data elements of the same (concrete) type? If so, I think you don’t need a struct wrapping around your array of data entries.
For the subsampling, you can just draw the indices and iterate over a view of the full datasets array?
What is the meaning of n ? In the first function, it looks like you are doing something n times to a single data object, whereas in the second one you want to apply something to the ith data object – but what is n for here?
To make it a bit more concrete, let us assume that
data = 1:10;
and I want to apply some function n times to this same data set, which need have nothing to do with the size of the data set.
for i in 1:n
# do something with all of data
end
But wha tI would also like to be able to do is to have an array of such data sets (all of the same size),
data1 = 1:10;
data2 = 21:30;
data_sets = Iterators.cycle([data1, data2], n/2) # assume n divisible by 2
for (i,data) in enumerate(data_sets)
# do something with all of data
end
The first case can be collapses into the second case merely by letting
data_sets = Iterators.repeated(data,n)
What I would now like to do is to find a way to incorporate the third case, without doing lots of allocations, where as I iterate through, I extract a random subset of a fixed size, i.e.,
for i in 1:n
sample_idx = sample(1:length(data), batch_size, replace=false);
# apply process to data[sample_idx];
end
It seems like you should write a fourth function that just operates on a single dataset:
function process(data::T) # T is your type of interest
# apply operation to data
end
and then write the other three in terms of that:
function process_data(data::T, n)
for _ in 1:n
process(data)
end
end
function process_data(data_sets::AbstractVector{T},n)
for i in 1:n
process(data_sets[i])
end
end
function process_data(data::AbstractVector{T}, n, k)
for i in 1:n
# randomly subsample k elements from data
process(sampled_data)
end
end
The above advice might change significantly with more background details provided, though.