Why does shuffling rows change the estimates?

jzr · December 23, 2020, 11:14pm

This is a small model with fake data in a linear regression. Unexpectedly, shuffling the rows of the dataframe changes the estimates.

import Random
using DataFrames
using Pipe: @pipe
using StatsBase: StatsBase, mad, median, percentile, sample, shuffle

using Distributions
using Turing

Random.seed!(1)
##

df, parameters = let
    β = (
        β_0 = 7,
        β_1 = 0.05,
        β_2 = 0.10,
        β_3 = 0.15,
        β_4 = 0.20,
    )
    σ = 2
    parameters = (β..., σ=σ)
    parameters_df = DataFrame(
        parameter=collect(keys(parameters)),
        value=collect(values(parameters)),
    )

    N = 100_000
    X = DataFrame(
        x_0=fill(1, N),
        x_1=rand([1,2], N),
        x_2=rand(1:1:10, N),
        x_3=rand([0,1], N),
        x_4=rand(0:1:10, N),
    )
    df = transform(X)
    μ = Matrix(df) * collect(β)
    ϵ = rand(Normal(0, σ), N)
    y = μ .+ ϵ

    (df = transform(
        X,
        [] => (() -> μ) => :μ,
        [] => (() -> ϵ) => :ϵ,
        [] => (() -> y) => :y,
        ),
     parameters = parameters_df,    )
end

##
@model function linear_outcome_model(X, y)
    β_0 ~ Normal(mean(y), 2std(y))
    β_1 ~ Normal(0, .5)
    β_2 ~ Normal(0, .5)
    β_3 ~ Normal(0, .5)
    β_4 ~ Normal(0, .5)

    μ = (
            β_0 .* X.x_0
            .+ β_1 .* X.x_1
            .+ β_2 .* X.x_2
            .+ β_3 .* X.x_3
            .+ β_4 .* X.x_4

    )
    σ ~ truncated(Normal(0, 2std(y)), 0, Inf)
    y ~ MvNormal(μ, σ)
end


fields = [
    :x_0,
    :x_1,
    :x_2,
    :x_3,
    :x_4,

]

samples_original = @pipe df |>
linear_outcome_model(_[:, fields], _.y) |>
sample(_, NUTS(0.65), 3000)

samples_shuffled = @pipe df |>
DataFrame(shuffle(eachrow(_))) |>
linear_outcome_model(_[:, fields], _.y) |>
sample(_, NUTS(0.65), 3000)



function compare_samples(samples)
    @pipe DataFrame(summarize(samples, mean, std)) |>
    rename(_, :parameters => :parameter) |>
    innerjoin(_, parameters, on=:parameter)
end


comparison_original = compare_samples(samples_original)

#comparison_original
#6×4 DataFrame
# Row │ parameter  mean       std         value 
#     │ Symbol     Float64    Float64     Real  
#─────┼─────────────────────────────────────────
#   1 │ β_0        6.97052    0.0266058       7
#   2 │ β_1        0.0417781  0.0130792    0.05
#   3 │ β_2        0.101241   0.00227608    0.1
#   4 │ β_3        0.170157   0.012559     0.15
#   5 │ β_4        0.203193   0.00201488    0.2
#   6 │ σ          1.99352    0.00446498      2

comparison_shuffled = compare_samples(samples_shuffled)

# 6×4 DataFrame
#  Row │ parameter  mean         std         value 
#      │ Symbol     Float64      Float64     Real  
# ─────┼───────────────────────────────────────────
#    1 │ β_0         8.70263     0.0270448       7
#    2 │ β_1        -0.00431323  0.0133114    0.05
#    3 │ β_2        -0.00108792  0.00232939    0.1
#    4 │ β_3         0.0106373   0.0134052    0.15
#    5 │ β_4        -0.00122068  0.00215966    0.2
#    6 │ σ           2.11683     0.0047302       2

cpfiffer · December 24, 2020, 4:34am

Try setting a seed with Random.seed! first – I think this is a seeding thing and has nothing to do with shuffling.

jzr · December 24, 2020, 4:50am

Unfortunately that doesn’t do it. I still get wrong results when shuffling and right results otherwise.

jzr · December 25, 2020, 7:15am

Turns out I was using @pipe wrong. Switching to @chain solves the problem.

Topic		Replies	Views
Why is `sort(x, by = _ -> rand())` not a good shuffler? General Usage	2	357	February 19, 2022
Shuffle values and sample indices of a sparse matrix Performance question , python , sparse	15	1301	March 23, 2021
Shuffle columns of a matrix General Usage question	8	93	June 30, 2025
Turing.jl run not reproducible! Probabilistic Programming turing , reproducibility	17	2551	October 10, 2022
How to permute the rows of a DataFrame in-place efficiently? Data performance , dataframes	11	6732	February 20, 2018

Why does shuffling rows change the estimates?

Related topics