Segmentation fault when applying PCA to a big dataset on an HPC

I’m not sure what the cause of the seg fault, but for diagnosis I suggest you start with:

  1. Instead of a DataFrame, wrap your matrix as a table using MLJ.table(mat) (or Tables.table(mat))
  2. If that still gives seg fault, try using MultivariateStats.jl directly without MLJ interface. In this case you call directly on a matrix, but with columns as observations. You do this with something like:
using MultivariateStats
mat = rand(20, 10000)
theta = fit(PCA, mat, pratio=0.99, maxoutdim=10)
transform(theta, mat)

By the way, I see the the MLJ interface computes a matrix transpose (where I reckon it ought to compute an adjoint) which means an extra copy of your data.