Segmentation fault when applying PCA to a big dataset on an HPC

I understand the following may be due to the fact that I am using an HPC remote service. If that’s the case, I would appreciate comments on where to ask this question best.

I have a big array (2765 x 12212224), and I want to reduce its dimensionality using PCA. Can’t do it on my personal machine because it runs out of memory, so I am trying to do it on a remote computer cluster.

The following code runs well:

using DataFrames, Serialization, Parameters, JLD, Images, FileIO, MappedArrays, MLJ, MLJMultivariateStatsInterface # some packages aren't necessary for this MWE, but these are the ones I am using for my project

X = DataFrame(rand(100, 100), :auto) #######################

begin
   PCA = MLJ.@load PCA 
   pca = PCA(maxoutdim=50) 
   mach = machine(pca, X) 
   fit!(mach, verbosity = 1) 
   transformed = MLJ.transform(mach, X)
   r = report(mach)
end

JLD.save(string(pwd(), "/Data/data_reduced.jld"), "pca_data", transformed, "report", r)

However, when I change the definition of X (the line with many ‘#’ above) to make it have the dimensions of my array (2765x12212224), I run into a segmentation fault. I am unfamiliar with this kind of error, but some online reading suggested that it is related to attempts to access restricted memory spaces. How can this be dependent on the size of the array I am working with? What other thing can it mean?

For some context on my project: I want to analyse a set of images and so far everything I’ve tried has either crashed, failed, or run forever. The large number here (12212224) is the number of pixels in each image, which I have reshaped into vectors.

Some references on using PCA to reduce the dimensionality of image processing problems:

Can you run this even on your local machine using incremental PCA? I just made public a very old package from my lab. Use it at your own risk (no real docs provided, sorry), and PRs would be welcome as long as they don’t break our internal code.

1 Like

I’m not sure what the cause of the seg fault, but for diagnosis I suggest you start with:

  1. Instead of a DataFrame, wrap your matrix as a table using MLJ.table(mat) (or Tables.table(mat))
  2. If that still gives seg fault, try using MultivariateStats.jl directly without MLJ interface. In this case you call directly on a matrix, but with columns as observations. You do this with something like:
using MultivariateStats
mat = rand(20, 10000)
theta = fit(PCA, mat, pratio=0.99, maxoutdim=10)
transform(theta, mat)

By the way, I see the the MLJ interface computes a matrix transpose (where I reckon it ought to compute an adjoint) which means an extra copy of your data.