I understand the following may be due to the fact that I am using an HPC remote service. If that’s the case, I would appreciate comments on where to ask this question best.
I have a big array (2765 x 12212224), and I want to reduce its dimensionality using PCA. Can’t do it on my personal machine because it runs out of memory, so I am trying to do it on a remote computer cluster.
The following code runs well:
using DataFrames, Serialization, Parameters, JLD, Images, FileIO, MappedArrays, MLJ, MLJMultivariateStatsInterface # some packages aren't necessary for this MWE, but these are the ones I am using for my project
X = DataFrame(rand(100, 100), :auto) #######################
begin
PCA = MLJ.@load PCA
pca = PCA(maxoutdim=50)
mach = machine(pca, X)
fit!(mach, verbosity = 1)
transformed = MLJ.transform(mach, X)
r = report(mach)
end
JLD.save(string(pwd(), "/Data/data_reduced.jld"), "pca_data", transformed, "report", r)
However, when I change the definition of X (the line with many ‘#’ above) to make it have the dimensions of my array (2765x12212224), I run into a segmentation fault. I am unfamiliar with this kind of error, but some online reading suggested that it is related to attempts to access restricted memory spaces. How can this be dependent on the size of the array I am working with? What other thing can it mean?
For some context on my project: I want to analyse a set of images and so far everything I’ve tried has either crashed, failed, or run forever. The large number here (12212224) is the number of pixels in each image, which I have reshaped into vectors.
Some references on using PCA to reduce the dimensionality of image processing problems: