Integrating MLUtils.DataLoader and image augmentation pipeline on custom dataset

I’m trying to create a custom dataset where getobs performs random image augmentations.
The DataLoader docs suggest that my dataset has to have custom numobs and getobs calls.

So this is my code:

struct my_dataset{T}

# Constructor sets the data and the pipeline
function my_dataset(
    return my_dataset(data, pipeline)

function getobs(data::my_dataset, i)
    # Perform augmentations on single image. 
   return data_arr[i]

numobs(data::my_dataset) = size(data.data_arr)[end]

Instantiating a DataLoader works fine:

dset = my_dataset(randn(24, 8, 3, 10_000), Flipx() * NoOp())
loader = DataLoader(dset, batchsize=-1)

But when I try to iterate, instead of my numobs method, a generic routine is called:

julia> first(loader)
ERROR: MethodError: no method matching length(::my_dataset{Array{Float32, 4}})
Closest candidates are:
  length(::Union{Base.KeySet, Base.ValueIterator}) at abstractdict.jl:58
  length(::Union{LinearAlgebra.Adjoint{T, <:Union{StaticArraysCore.StaticArray{Tuple{var"#s2"}, T, 1} where var"#s2", StaticArraysCore.StaticArray{Tuple{var"#s3", var"#s4"}, T, 2} where {var"#s3", var"#s4"}}}, LinearAlgebra.Diagonal{T, <:StaticArraysCore.StaticArray{Tuple{var"#s13"}, T, 1} where var"#s13"}, LinearAlgebra.Hermitian{T, <:StaticArraysCore.StaticArray{Tuple{var"#s10", var"#s11"}, T, 2} where {var"#s10", var"#s11"}}, LinearAlgebra.LowerTriangular{T, <:StaticArraysCore.StaticArray{Tuple{var"#s18", var"#s19"}, T, 2} where {var"#s18", var"#s19"}}, LinearAlgebra.Symmetric{T, <:StaticArraysCore.StaticArray{Tuple{var"#s7", var"#s8"}, T, 2} where {var"#s7", var"#s8"}}, LinearAlgebra.Transpose{T, <:Union{StaticArraysCore.StaticArray{Tuple{var"#s2"}, T, 1} where var"#s2", StaticArraysCore.StaticArray{Tuple{var"#s3", var"#s4"}, T, 2} where {var"#s3", var"#s4"}}}, LinearAlgebra.UnitLowerTriangular{T, <:StaticArraysCore.StaticArray{Tuple{var"#s24", var"#s25"}, T, 2} where {var"#s24", var"#s25"}}, LinearAlgebra.UnitUpperTriangular{T, <:StaticArraysCore.StaticArray{Tuple{var"#s21", var"#s22"}, T, 2} where {var"#s21", var"#s22"}}, LinearAlgebra.UpperTriangular{T, <:StaticArraysCore.StaticArray{Tuple{var"#s15", var"#s16"}, T, 2} where {var"#s15", var"#s16"}}, StaticArraysCore.StaticArray{Tuple{var"#s25"}, T, 1} where var"#s25", StaticArraysCore.StaticArray{Tuple{var"#s1", var"#s3"}, T, 2} where {var"#s1", var"#s3"}, StaticArraysCore.StaticArray{<:Tuple, T}} where T) at ~/.julia/packages/StaticArrays/jA1zK/src/abstractarray.jl:1
  length(::Union{LinearAlgebra.Adjoint{T, S}, LinearAlgebra.Transpose{T, S}} where {T, S}) at ~/Software/julia-1.8.5/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:172
 [1] numobs(::Type{SimpleTraits.Not{MLUtils.IsTable{kstar_ecei_dataset{Array{Float32, 4}}}}}, data::kstar_ecei_dataset{Array{Float32, 4}})
   @ MLUtils ~/.julia/packages/MLUtils/KcBtS/src/observation.jl:53
 [2] numobs
   @ ~/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
 [3] ObsView(data::kstar_ecei_dataset{Array{Float32, 4}})
   @ MLUtils ~/.julia/packages/MLUtils/KcBtS/src/obsview.jl:145
 [4] iterate(e::DataLoader{kstar_ecei_dataset{Array{Float32, 4}}, Random._GLOBAL_RNG, Val{nothing}})
   @ MLUtils ~/.julia/packages/MLUtils/KcBtS/src/eachobs.jl:158
 [5] first(itr::DataLoader{kstar_ecei_dataset{Array{Float32, 4}}, Random._GLOBAL_RNG, Val{nothing}})
   @ Base ./abstractarray.jl:424
 [6] top-level scope
   @ REPL[40]:1

I don’t follow what is happening here. Why is my custom numobs method not called?

You likely need to import numobs

import MLUtils: numobs

or name it explicitly

MLUtils.numobs(data::my_dataset) = size(data.data_arr)[end]

By implementing it without importing or namespacing, you are defining a new function in the current module rather than adding a new method to the existing function.

Thanks, that works. Here is the MWE:

using MLUtils
using Random

struct my_dset{T}

function MLUtils.getobs(dset::my_dset, ix)
    obs = dset.data_arr[:, ix]
    map(dset.trf, obs)
MLUtils.numobs(data::my_dset) = size(data.data_arr)[end]

d = randn(Float32, 3, 20)
ds = my_dset(d, x -> x + 12.3)

loader = DataLoader(ds, batchsize=-1)

for obs ∈ loader
    @show obs

As a side-note, this approach would be the julia-prototype for pytorch-like dataloaders . Is there a tutorial like this but for julia anywhere on the web?

For reference when dealing with larger image datasets requiring to load them from disk, the following can be used: ImageNetTrain.jl/resnet-optim.jl at b7cc19676a74525d9b4ec007435f2ff9c892c604 · jeremiedb/ImageNetTrain.jl · GitHub

Note that it is sufficient to extend Base’s length and getobs to get a working custom dataloader (MLUtils’s numobs not required)

getobs and numbobs fallback to getindex and length so the following works as well:

struct my_dset{T}

function Base.getindex(dset::my_dset, ix)
    obs = dset.data_arr[:, ix]
    map(dset.trf, obs)

Base.length(data::my_dset) = size(data.data_arr)[end]