How to sample a Data frame

How can i sampling Data Frame?
Like python data.sample() method
NB: replace must need

1 Like

Just can just use random row indices like:

julia> using DataFrames, Random

julia> df = DataFrame(a = 1:10, b = rand(10))
10Γ—2 DataFrame
β”‚ Row β”‚ a     β”‚ b        β”‚
β”‚     β”‚ Int64 β”‚ Float64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 1     β”‚ 0.180922 β”‚
β”‚ 2   β”‚ 2     β”‚ 0.726072 β”‚
β”‚ 3   β”‚ 3     β”‚ 0.802304 β”‚
β”‚ 4   β”‚ 4     β”‚ 0.769662 β”‚
β”‚ 5   β”‚ 5     β”‚ 0.705299 β”‚
β”‚ 6   β”‚ 6     β”‚ 0.266686 β”‚
β”‚ 7   β”‚ 7     β”‚ 0.332831 β”‚
β”‚ 8   β”‚ 8     β”‚ 0.393075 β”‚
β”‚ 9   β”‚ 9     β”‚ 0.1936   β”‚
β”‚ 10  β”‚ 10    β”‚ 0.830922 β”‚

julia> df[shuffle(1:nrow(df))[1:5], :]
5Γ—2 DataFrame
β”‚ Row β”‚ a     β”‚ b        β”‚
β”‚     β”‚ Int64 β”‚ Float64  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 7     β”‚ 0.332831 β”‚
β”‚ 2   β”‚ 8     β”‚ 0.393075 β”‚
β”‚ 3   β”‚ 1     β”‚ 0.180922 β”‚
β”‚ 4   β”‚ 5     β”‚ 0.705299 β”‚
β”‚ 5   β”‚ 9     β”‚ 0.1936   β”‚

The shuffle function returns a random ordering of the range from 1 to the number of rows of your dataframe, which you can then index with [1:x] where x is the number of samples you want.

Alternatively, there are ML/stats packages that implement their own way of splitting data into train and test data, like MLJ or Turing - check their docs if that’s of interest.

3 Likes

need 100 rows data to 1000 sample

I’m not sure I understand - do you want to sample 100 rows from a 1,000 row DataFrame? Or do you want to draw 1,000 samples of length 100 from a larger data set? My suggestion above can work in both cases, can you clarify what you’re looking for (and what isn’t working for you) ideally by way of a minimal working example?

1 Like

yes i want 1,000 samples from length 100 data set

Okay to adapt my example from above, you have a length 100 data set:

df = DataFrame(a = 1:100, b = rand(100))

now we can get 1,000 random samples from this - I’m assuming each sample has length 10 here:

samples = [df[shuffle(1:nrow(df))[1:10], :] for _ in 1:1_000]

samples is now a vector of lenght 1,000 which holds a 10-row random sample of your original data set in each location.

2 Likes

Or if you’d like to sample 1,000 rows with replacement:

df[rand(1:nrow(df),1000),:]

I imagine you are trying to bootstrap data. In addition to the solutions given here, see if bootstrap.jl is a package that works for you.

3 Likes

DependentBootstrap will also work here. One of the options is an iid bootstrap which will do what the OP wants, ie:

using DependentBootstrap
dbootdata(mydataframe, numresample=1000, bootmethod=:iid)

will return a vector of length 1000 where each element is a resampled DataFrame.

1 Like

This is how I split my DataFrame into β€œtraining” and β€œtesting”

function createTrainTest(df::DataFrame,prop=0.5,randomseed=1234)
    df_training = similar(df,0)
    df_testing  = similar(df,0)

    # Now split the df into df_training and df_testing
    df_size = size(df,1)
    training_proportion = prop
    trainingsize = round(df_size*training_proportion)

    # Create a random permutation vector
    randvec = randperm!(MersenneTwister(randomseed),
                        Vector{Int64}(undef,df_size))

    for k in axes(df)[1]
        push!(  k ≀ trainingsize ?
                df_training : df_testing ,
                df[randvec[k],:]
        )
    end
    return (df_training,df_testing)
end

If you want 1000 samples with each sample having 100 rows then just change the trainingsize to a fixed value of 100 and call the above function 1000 times.

PS: do not forget to use a different randomseed each time!

using StatsBase:sample
using DataFrames

df = DataFrame(a = 1:1000)

sample_rows = sample(1:nrow(df), 100, replace=false)

df_sample = df[sample_rows, :]

test_rows = setdiff(1:nrow(df), sample_rows)

df_test = df[test_rows, :]
2 Likes