Hi all, I found that the following chunk is the major bottleneck in my code. For each “id” (subject), it uses pipelines to filter out the data for that “id”, creates the columns
object, and then build the MyObs
type.
Is this approach the most efficient out there? Any suggestions for improvement?
The @profile
result is too long to paste here, but I found that |>
, iterate
(from TableOperations.jl), columns
and buildcolumns
have very large counts next to them.
I attached a reproducible example at the end.
for (i, id) in enumerate(subset_id)
obsvec[i] = datatable_cols |>
TableOperations.filter(x -> Tables.getcolumn(x, :id) == id) |>
Tables.columns |>
myobs(feformula, reformula)
end
Reproducible example:
using LinearAlgebra, Random, StatsModels, DataFrames, CSV, JuliaDB, TableOperations, Tables, Profile
############ Preparation ############
# define type and constructors
struct MyObs{T <: LinearAlgebra.BlasReal}
y::Vector{T}
X::Matrix{T}
Z::Matrix{T}
xty::Vector{T}
zty::Vector{T}
end
function MyObs(
y::Vector{T},
X::Matrix{T},
Z::Matrix{T}
) where T <: LinearAlgebra.BlasReal
xty = transpose(X) * y
zty = transpose(Z) * y
MyObs{T}(y, X, Z, xty, zty)
end
function myobs(data_obs, feformula::FormulaTerm, reformula::FormulaTerm)
y, X = StatsModels.modelcols(feformula, data_obs)
Z = StatsModels.modelmatrix(reformula, data_obs)
return MyObs(y, X, Z)
end
myobs(feformula::FormulaTerm, reformula::FormulaTerm) = data_obs -> myobs(data_obs, feformula, reformula)
# simulate a dataset and save it to csv
Random.seed!(1)
reps = 20
N = 10^6
p = 5
q = 2
X = Matrix{Float64}(undef, N*reps, p)
X[:, 1] = ones(N*reps)
@views randn!(X[:, 2:p])
rand_intercept = zeros(N*reps)
for j in 1:N
rand_intercept[(reps * (j-1) + 1) : reps * j] .= Random.randn(1)
end
y = X * ones(p) + # fixed effects
rand_intercept + # random intercept, standard normal
Random.randn(N*reps) # error, standard normal
id = repeat(1:N, inner = reps)
dat = DataFrame(hcat(id, y, X))
rename!(dat, Symbol.(["id", "y", "x1", "x2", "x3", "x4", "x5"]))
CSV.write("testfile.csv", dat)
############ Running ############
# load data
datatable = JuliaDB.loadtable("testfile.csv"; distributed = false)
feformula = @formula(y ~ 1 + x1 + x2 + x3 + x4 + x5)
reformula = @formula(y ~ 1)
feformula = apply_schema(feformula, schema(feformula, datatable))
reformula = apply_schema(reformula, schema(reformula, datatable))
datatable_cols = Tables.columns(datatable)
subset_size = 1000
obsvec = Vector{MyObs{Float64}}(undef, subset_size)
subset_id = [1:1:subset_size;]
Profile.clear()
# This is the bottle neck
@profile for (i, id) in enumerate(subset_id)
obsvec[i] = datatable_cols |>
TableOperations.filter(x -> Tables.getcolumn(x, :id) == id) |>
Tables.columns |>
myobs(feformula, reformula)
end
Profile.print(format = :flat)