Custom dataset

Hello! I am new here.
I have been trying to use Metalhead.jl/Flux.jl to train custom image classifiers but I cant seem to be able to load data from a folder directly. (Not part of the standard datasets).
Say I have a folder with 10 sub folders → each of them with a 1000 images.
How can I load this up to perform a simple classification task?

Thank you!
Great to see such an active community :slight_smile:

1 Like

Welcome, @SubhadityaMukherjee!
It is great that you want to use Flux.

There is true that there is not a DataLoader able to read from folder directly.
I found the same problem, so I have to implement it myself, I am going to give you the code to help you:

using Metalhead
using FileIO
using Images
using Serialization
using Random

function absolute(dir::AbstractString)
    return  expanduser(abspath(dir))
    read_resnet(dir::AbstractString, output::AbstractString)

Read the images in the directory dir, apply the resnet and save the results
to an new directory diroutput.

- dir should have the structure partitionX/test/<category>.
- diroutput is a new directory that store the resnet structure, with
the structure partitionX/test, partitionX/train in which the results are serialized
as the pair matrix, category.
function apply_model(apply_model::Function, dir::AbstractString, diroutput::AbstractString)
    partitions = readdir(dir)
    subdirs = ["test", "train"]
    categories = String[]

    if (!isdir(diroutput))

    # Put in absolute path
    dir = absolute(dir)
    diroutput = absolute(diroutput)

    # Check categories
    for subdir in subdirs, partition in partitions
            cats = readdir(joinpath(dir, partition, subdir))

            if isempty(categories)
                categories = cats
                @assert size(cats) == size(categories) && all(cats .== categories) "Error, categories '$categories' and '$cats' are not the same"

    for subdir in subdirs
        for partition in partitions
            for category in categories
                files = readdir(joinpath(dir, partition, subdir, category), join=true)
                outputdir = joinpath(diroutput, partition)

                if (!isdir(outputdir))

                outputfile = joinpath(outputdir, "$(subdir)_$(category).bin")

                if (isfile(outputfile))
                    println("Ignore '$outputfile'")

                output = reduce(hcat, [apply_model(file)::Array{Float32,2} for file in files])

                open(outputfile, "w") do file
                    serialize(file, output)
                println("Written '$outputfile'")

function main_apply_resnet()
    model = ResNet()

    function apply_resnet(file)
        img = RGB.(Images.load(file))
        output = model.layers(Metalhead.preprocess(img))
        return output
    apply_model(apply_resnet, "data/", "resnet_data/")

isinteractive() || main_apply_resnet()

Nowadays, there is effort in create a DataLoader that allow this type of things.

I hope it could help you.

1 Like

Thank you so much! This really helps a lot. @dmolina

I was actually trying to make something like fastais(from the Pytorch world) data loader for Julia. I thought it would be really helpful as this bit of Flux isnt that developed yet and it is really awesome so why not contribute a bit haha. But I couldn’t figure out how to get the files in an array in usable time.
Now hopefully I will be able to :slight_smile:

Have a great day!

1 Like