Best way to store subsets of a sample

Hello everyone,

I would like to store subsamples from an initial sample of data, perform different operations according to the subsample, and then return outputs in the same order than the initial inputs (** EDIT: with Julia Base types only, ie. Without DataFrames for example**). For now, what I’m doing is something like this:

  1. In the main function, initialize an empty Simple Algorithm struct
  2. Based on the X and Y input, create a Subset Sample and store it in Simple Algorithm struct
  3. Perform some subsampling to create Subset A, and store it in the Simple Algorithm struct
  4. Perform a particular computation for Subset A and store the results in the Subset A
  5. Create a subset B with all observations non-A and store it in the Simple Algorithm struct
  6. Perform some computation on Subset B and store the results in the Subset B
  7. Try to unite results from subset A and B and to sort it with a weird function in order to get the results in the same order than the initial inputs X and Y.

I think my implementation is a bit ugly, and is prone to errors. My questions would be:

  • Is there a Julia Base Type that you would suggest I use rather than this “Subset” one in order to robustly track the index of each observations in the case of subsampling?
  • Would you think that storing subsamples in another struct like this “SimpleAlgorithm” one is a good practice?

EDIT: I need to stay with Julia Base Type in order to stay close to the rest of the functions in the package I would like to contribute, with only Matrix or Vector allowed as inputs and giving matrix or vector as outputs.

Thank you in advance for your feedback / help.

Please find below a MWE:



Base.@kwdef mutable struct Subset
    """ Will store the subsample"""
    X::Union{Matrix,Vector} = []
    Y::Union{Matrix,Vector} = []
    indexobs::Vector{Int64} = []
    Results::Matrix = Matrix(zeros(2,2)) # This is only in order to store results for this subset later
end

Base.@kwdef mutable struct SimpleAlgorithm 
    """A collection of subsamples, the initial sample and the final results"""
    Sample::Subset = Subset()
    SubsetA::Subset = Subset()
    SubsetB::Subset = Subset()
    Results::Matrix = Matrix(zeros(2,2)) # This is only in order to store results for this subset later
end

function someselection!(simple_algorithm::SimpleAlgorithm)
    """ Performs simple subsampling """
    simple_algorithm.SubsetA = Subset(X = simple_algorithm.Sample.X[1:6,:],
                      Y = simple_algorithm.Sample.Y[1:6,:],
                      indexobs = simple_algorithm.Sample.indexobs[1:6])
end


function excludingsubset(sample::Subset, subset::Subset)
    """ Take the rest of the observations unselected by the someselection! function """
    index_to_keep = findall(x -> !(x in subset.indexobs), sample.indexobs)
    return Subset(X = sample.X[index_to_keep, :], Y = sample.Y[index_to_keep,:], indexobs = sample.indexobs[index_to_keep])
end


function somecalculationsforA!(simple_algorithm::SimpleAlgorithm)
    """ Perform calculation specific to A """
    simple_algorithm.SubsetA.Results = simple_algorithm.SubsetA.X .* 2
end

function somecalculationsforB!(simple_algorithm::SimpleAlgorithm)
    simple_algorithm.SubsetB.Results = simple_algorithm.SubsetB.X .* 3
end

function formatresults!(simple_algorithm::SimpleAlgorithm)
    """Objective: format the results in order to appear in the same order than the initial inputs X and Y"""
    results = vcat(simple_algorithm.SubsetA.Results, simple_algorithm.SubsetB.Results)
    results_index = vcat(simple_algorithm.SubsetA.indexobs, simple_algorithm.SubsetB.indexobs)
    results_matrix = hcat(results, results_index)
    results_matrix = results_matrix[sortperm(results_matrix[:, size(results,2)+1], rev = false), :]
    simple_algorithm.Results = results_matrix[:,1:size(results,2)]
end


function somemodel(X::Union{Vector,Matrix}, Y::Union{Vector,Matrix})
    simple_algorithm = SimpleAlgorithm()
    simple_algorithm.Sample = Subset(X = X, Y = Y, indexobs = [i for i in 1:size(X,1)])
    someselection!(simple_algorithm)
    somecalculationsforA!(simple_algorithm)
    simple_algorithm.SubsetB = excludingsubset(simple_algorithm.Sample, simple_algorithm.SubsetA)
    somecalculationsforB!(simple_algorithm)  
    formatresults!(simple_algorithm)
    return simple_algorithm
end

X = [5 13; 16 12; 16 26; 17 15; 18 14; 23 6; 25 10; 27 22; 37 14; 42 25; 5 17];

Y = [12; 14; 25; 26; 8; 9; 27; 30; 31; 26; 12];

results = somemodel(X, Y)
results.Results

Maybe you could use DataFrames.jl and one of the query packages DataFramesMeta.jl or Query.jl? DataFrames are build for tasks like this.

Here is a solution with just DataFrames. Notice that I used view=true. Hence, the original dataset is changed as well, if you don’t want that, you can remove that (or set view=false). Then the original data stays unaltered.

using DataFrames
df = DataFrame(
        X=[[5, 13], [16, 12], [16, 26], [17, 15], [18, 14], [23, 6], [25, 10], [27, 22], [37, 14], [42, 25], [5, 17]],
        Y=[12, 14, 25, 26, 8, 9, 27, 30, 31, 26, 12])

df[!,:id] = 1:nrow(df)
df[!,:sel] = zeros(Int64,11)
df.sel[1:6] .= 1

A = filter(row -> row.sel == 1, df, view=true)
B = filter(row -> row.sel == 0, df, view=true)

A.X .*= 2
B.X .*= 3

df = vcat(A,B)
sort!(df, [:id])

result = df.X

If you selection is always very easy, you could also do this instead of adding an extra column for the selection

A = df[1:6,:]
B = df[7:end,:]
A.X .*= 2
B.X .*= 3

Thank you for your answer.
I’m really sorry I was far from precise enough in my initial question. I need to perform an algorithm as a contribution for a package with no support for DataFrames currently…
But if there is no cleaner solution without using such format, I would be more than happy to use the solution you propose.
Indeed, even if the MWE « works », I’m really concern about its lack of robustness.

I think the problem is, that in principle you want to operate on both, the rows (subsamples) and colums (computation).

So, the best dataformat would be one which gives easy access to both of these things. If you cannot use other packages, how about a simple matrix for all your data?
Here, the first column stored the id. This allows you to easy carry the id with you throughout the computations.

X = [5 13; 16 12; 16 26; 17 15; 18 14; 23 6; 25 10; 27 22; 37 14; 42 25; 5 17];
Y = [12; 14; 25; 26; 8; 9; 27; 30; 31; 26; 12];
data = [1:n X Y]

A = data[1:2:end,:]
B = data[2:2:end,:]

result_A = [A[:,1] A[:,2:3] .* 2]
result_B = [B[:,1] B[:,2:3] .* 2]

result = vcat(result_A,result_B)
sort!(result, dims=1)
result

2 Likes

Maybe useful for you is the command Not

inds = 1:2:n
A = data[inds,:]
B = data[Not(inds),:]
2 Likes

Thank you so much!
I don’t know why I didn’t think that way, it’s far more understandable / quicker to think like this.
I will reframe my algo based on this.
Thank you again.

1 Like

Note that you can also take views on matrices allowing to skip the store result step.

3 Likes