Coding different iterations over a dataset

gideonsimpson · September 21, 2024, 1:47pm

I have some code currently implemented via multiple dispatch that looks vaguely like:

function process_data(data,n)
   for i in 1:n
   # apply operation to entire data set, data
   end
end

function process_data(data_sets,n)
   for i in 1:n
   # apply operation to data = data_sets[i]
   end
end

function process_data(data,n,k)
   for i in 1:n
   # randomly subsample k elements from data and process those
   end
end

So

The first version works on the entire data set all at once
The second version works on an array structure of data sets, using a different data set at each iterate.
The third version creates randomly subsampled data sets at each iterate, and proccesses them.

Right now, I have coded (nearly) the entire loop in each case, and I feel like I should be able to use multiple dispatch such that only one of the loops is fully coded out. What I want to avoid doing is doing allocations, and that’s where I’m looking for help. Is there a way to map a single data set, data into a abstract structure data_sets that will just return data at every data_sets[i]? Is there some similar way to get the random subsampling to work?

Sevi · September 21, 2024, 7:26pm

Hi! I think it’s impossible to answer your questions accurately without more details. The actual code would be most useful to find a solution, but if you cannot disclose that, can you make a minimal example that runs and has the same features as your actual code?

Here’s what comes to my mind based on your description:

Are all the data elements of the same (concrete) type? If so, I think you don’t need a struct wrapping around your array of data entries.
For the subsampling, you can just draw the indices and iterate over a view of the full datasets array?
What is the meaning of n ? In the first function, it looks like you are doing something n times to a single data object, whereas in the second one you want to apply something to the ith data object – but what is n for here?

gideonsimpson · September 21, 2024, 7:45pm

To make it a bit more concrete, let us assume that

data = 1:10;

and I want to apply some function n times to this same data set, which need have nothing to do with the size of the data set.

for i in 1:n
  # do something with all of data
end

But wha tI would also like to be able to do is to have an array of such data sets (all of the same size),

data1 = 1:10;
data2 = 21:30;
data_sets = Iterators.cycle([data1, data2], n/2)  # assume n divisible by 2
for (i,data) in enumerate(data_sets)
  # do something with all of data
end

The first case can be collapses into the second case merely by letting

data_sets = Iterators.repeated(data,n)

What I would now like to do is to find a way to incorporate the third case, without doing lots of allocations, where as I iterate through, I extract a random subset of a fixed size, i.e.,

for i in 1:n
   sample_idx = sample(1:length(data), batch_size, replace=false);
   # apply process to data[sample_idx];
end

DNF · September 22, 2024, 6:26am

It seems like you should write a fourth function that just operates on a single dataset:

function process(data::T) # T is your type of interest
    # apply operation to data
end

and then write the other three in terms of that:

function process_data(data::T, n) 
    for _ in 1:n
        process(data) 
    end
end

function process_data(data_sets::AbstractVector{T},n)
   for i in 1:n
       process(data_sets[i]) 
   end
end

function process_data(data::AbstractVector{T}, n, k)
   for i in 1:n
   # randomly subsample k elements from data 
        process(sampled_data) 
   end
end

The above advice might change significantly with more background details provided, though.

Topic		Replies	Views
How Do I Most Effectively Use Multiple Dispatch in Data Science Workflows? General Usage data , dataframes , dispatch	11	941	April 7, 2021
How to assign subsets of a large dataset into different arrays within a loop Data juliacomputing	1	764	August 4, 2017
Implement iterator over subsets of array General Usage	9	714	March 30, 2021
Data spread over multiple files Machine Learning	3	642	September 26, 2018
Efficient repeated sampling of small vector Performance	13	414	April 8, 2023

Coding different iterations over a dataset

Related topics