Segmentation fault when applying PCA to a big dataset on an HPC

Ivan · February 22, 2022, 10:38am

I understand the following may be due to the fact that I am using an HPC remote service. If that’s the case, I would appreciate comments on where to ask this question best.

I have a big array (2765 x 12212224), and I want to reduce its dimensionality using PCA. Can’t do it on my personal machine because it runs out of memory, so I am trying to do it on a remote computer cluster.

The following code runs well:

using DataFrames, Serialization, Parameters, JLD, Images, FileIO, MappedArrays, MLJ, MLJMultivariateStatsInterface # some packages aren't necessary for this MWE, but these are the ones I am using for my project

X = DataFrame(rand(100, 100), :auto) #######################

begin
   PCA = MLJ.@load PCA 
   pca = PCA(maxoutdim=50) 
   mach = machine(pca, X) 
   fit!(mach, verbosity = 1) 
   transformed = MLJ.transform(mach, X)
   r = report(mach)
end

JLD.save(string(pwd(), "/Data/data_reduced.jld"), "pca_data", transformed, "report", r)

However, when I change the definition of X (the line with many ‘#’ above) to make it have the dimensions of my array (2765x12212224), I run into a segmentation fault. I am unfamiliar with this kind of error, but some online reading suggested that it is related to attempts to access restricted memory spaces. How can this be dependent on the size of the array I am working with? What other thing can it mean?

For some context on my project: I want to analyse a set of images and so far everything I’ve tried has either crashed, failed, or run forever. The large number here (12212224) is the number of pixels in each image, which I have reshaped into vectors.

Some references on using PCA to reduce the dimensionality of image processing problems:

tim.holy · February 22, 2022, 11:41am

Can you run this even on your local machine using incremental PCA? I just made public a very old package from my lab. Use it at your own risk (no real docs provided, sorry), and PRs would be welcome as long as they don’t break our internal code.

ablaom · February 22, 2022, 7:15pm

I’m not sure what the cause of the seg fault, but for diagnosis I suggest you start with:

Instead of a DataFrame, wrap your matrix as a table using MLJ.table(mat) (or Tables.table(mat))
If that still gives seg fault, try using MultivariateStats.jl directly without MLJ interface. In this case you call directly on a matrix, but with columns as observations. You do this with something like:

using MultivariateStats
mat = rand(20, 10000)
theta = fit(PCA, mat, pratio=0.99, maxoutdim=10)
transform(theta, mat)

By the way, I see the the MLJ interface computes a matrix transpose (where I reckon it ought to compute an adjoint) which means an extra copy of your data.

Topic		Replies	Views
Recommendations for larger than memory data General Usage big-data	4	152	May 15, 2025
Segmentation Fault working with complex sparse arrays General Usage question	2	36	August 9, 2025
Julia run using terminal for 1GB dataset showing out of memory error General Usage question	18	5006	August 31, 2017
Unexpected OutOfMemory error on HPC General Usage hpc , memory , clustering	17	1701	April 8, 2020
Julia Execution get out of memory error General Usage	3	4689	August 5, 2017

Segmentation fault when applying PCA to a big dataset on an HPC

Related topics