OutOfMemory() when loading a database of images

I am trying to work with a database of images (2767 images in total), but I run into an OutOfMemory() error when loading the data. This is how I am doing it:

path = string(pwd(), "/Data")
list = readdir(path, join = true)
imgs = load.(list)

If I change the last line to imgs = load.(list[1:100]) it works as desired, but then I get the OutOfMemory() error when I do imgs_2 = load.(list[101:200]).

I have seen this topic discussing a similar issue, and tried the solution proposed by Sukera on Nov 19 (memory mapping), but when trying to implement it, I got an error saying something like the IOStream could not read RGB nor N0f8 values.

What is the preferred way of loading data like this? My images are divided into different subfolders according to the image labels, and I would like to reconstruct a file with a single data frame containing all pictures and their labels.

  1. How can I -sequentially- load all my images to push them into the final data frame?
  2. What file type would be the most appropriate? I was thinking of a CSV file with a table, where one column “Image” contains the pixel arrays as elements (eltype Matrix{RGB{N0f8}}}).

but how large are they in total? Try run this:

total_size = sum(filesize, list)

julia> total_size = sum(filesize, list)
11328360809

where list is a 2767-element Vector containing the full path to all images.

The easiest solution is probably via MappedArrays.jl

using MappedArrays

# do add ; in REPL to suppress the eager collection.
imgs = mappedarray(load, list);

But this is lazy-loading, so if you do, say, imgs[1] multiple times, you load the same image multiple times from the disk.

2 Likes

Also, I’d recommend GitHub - shashi/FileTrees.jl: Parallel file processing made easy for complicated folder structure. It also has a built-in lazy-loading strategy but I’ve never tried it.

2 Likes

This made it! Thanks a lot.

so, you’re trying to load 10GB of images into RAM, probably you have an 8G or 16G computer? Although, I believed it would inflate further when you load them into matrix of RGB pixels, so yeah, at any rate, you shouldn’t try to load O(10GB) or images into RAM.

Yes, Im on a 8GB laptop. I did think about that, but I don’t come from a computer science background and didn’t really know how to investigate a better method

Can you solve your problem while only loading one image at a time? What is the actual problem you are trying to solve with the images?

1 Like

FileTrees.jl probably won’t work in lazy mode unless you limit how many files you load at once. I’m going to improve this behavior with some changes to Dagger.jl (the library supporting FileTrees’ lazy mode), but it’ll be a while before that’s available.

Generally, either Mmapping, or loading only a set of images at once, is the best strategy for now.

1 Like