Data spread over multiple files

Tomas_Pevny · September 25, 2018, 12:10pm

Hi,

I would like to ask, if someone has dealt in principle manner with problem, where the data for machine learning is spread over several files.

I have once defined an iterator, which takes loadfunction and list of files as an input and outputs minibatches with a fixed number of observations irrespectively the number of samples in each file.

The code was written for julia-0.6 and I have not updated it yet to 0.7 / 1.0. I would like to know, if someone did something similar or if there is a desire for such a thing.

Tamas_Papp · September 25, 2018, 1:26pm

Hard to say more without an MWE, but if you have iterators for individual files, you can use Base.Iterators.flatten to concatenate them, and then optionally Base.Iterators.partition or similar for small batches.

Tomas_Pevny · September 25, 2018, 6:37pm

Hi Tamas,

thank you very much for the answer.
Following is a test I have written:

using DataIterators: FileIterator

d = Dict("a" => [1 2 3 4 5], 
	"b" => [6 7 8 9 10], 
	"c" => [11 12 13 14])
loadfun(f) = (println("reading ",f); d[f])

files = ["a", "b", "c"]

begin 
	iter = FileIterator(loadfun, files, 2)
	nxt = iterate(iter)
	i = 0
	while nxt !== nothing && i < 10
		(x, state) = nxt
		println(x)
		nxt = iterate(iter, state)
		i += 1
	end
end

The output should be:

reading a
[1 2]
[3 4]
reading b
[5 6]
[7 8]
[9 10]
reading c
[11 12]
[13 14]

I think that flatten does not do the job here, since I need the data to be loaded lazily when needed. I have a first proof of concept implementation, certainly far from being perfect.

Tomas_Pevny · September 26, 2018, 4:55am

If someone is interested in this problem, here is a first shot on this

Topic		Replies	Views
Shuffeld minibatches of a large datasets Machine Learning knet , flux	3	1263	March 25, 2019
PyTorch DataLoader equivalent for training large models with Flux Machine Learning flux	16	4163	November 8, 2020
Iterate sequentially over several collections General Usage	7	664	August 19, 2018
Multi-threaded producer - consumer with threads for leading data Machine Learning question	16	1382	November 13, 2019
Reading data .mat files and merge them General Usage	8	1751	August 14, 2018

Data spread over multiple files

Related topics