How to load a portion of image dataset

I am working with an 8GB image dataset and I am trying to use only a portion of the dataset in my model since my pc has pretty low ram. I am not sure exactly how to go about doing this but this is what I have tried so far. Once i load my image dataset i want to resize all of them and convert them to an array. I am not sure if it will work with arrays but i tried converting to a matrix and it did not work out so well.This is what i tried to do


# sample path to load all car images from windows img="C:\\Users\\Adrian\\cars\\test\\blue"
#creating a function to load the data set
function load_data(path)
x=readdir(path,join=true)
#this is where im trying to use a portion of data. I have 1000 images but only want to use 50 of them
a=load.(x[1:50])
#this where i tried to resize those 50 images
h=imresize(a,(240,240)
#trying to convert to an array because it seems like arrays are the only things that dont give problems when fitting it in a convolutional model
l=convert.(Matrix{Float32}, h}
return l

# b= load_data("C:\\Users\\Adrian\\cars\\test\\blue")



If there is a better way to use only a portion of the dataset and convert it to implement it onto a model do please send me a sample solution or a link for guidance :slight_smile:

If you’re memory-constrained, you may want to load and transform one image at a time, which prevents you from needing to hold all the input images and all the outputs in memory simultaneously. Broadcasting is often handy, but it’ll probably be easier to split this into multiple functions or use explicit loops to avoid those memory constraints.

Your example doesn’t make a ton of sense to me:

  • loading x then never referencing it again
  • indexing into img[1:50] - where was img defined?
  • doing h=imresize(a,(240,240), but never referencing h afterwards

It helps to make sure your example can be run as written by the person trying to assist you (see this note).

That said, I’ll make some guesses from your comments:

function load_data(path, N)
    img_paths = readdir(path, join=true)
    l = Matrix{Float32}[]
    for img_path in 1:min(N, length(img_paths))
        a = load(img_path)
        h = imresize(a, (240,240))
        push!(l, convert(Matrix{Float32}, a)
    end
    return l
end

thank you sorry i will re-edit the problem so it doesn’t cause more confusion

just wanted to ask in function load_data(path,N) what does N represent in this case? what value will I need to place into it.

N is the number of images to load, which should almost always be an input parameter rather than a constant in this sort of function - it makes it easy to test on just one image, or to process many on a computer with more memory without tweaking the function itself. I used min(N, length(img_paths)) to make sure no more images are loaded than actually exist in the folder.

1 Like

If your images are sequentially numbered, it’s IMO better you use “glob” instead of “readdir”
because in “glob” you can define a “filefilter”.
i.e

In this the example only the path/filenames from 5 of n pictures are collected.