Data spread over multiple files


#1

Hi,

I would like to ask, if someone has dealt in principle manner with problem, where the data for machine learning is spread over several files.

I have once defined an iterator, which takes loadfunction and list of files as an input and outputs minibatches with a fixed number of observations irrespectively the number of samples in each file.

The code was written for julia-0.6 and I have not updated it yet to 0.7 / 1.0. I would like to know, if someone did something similar or if there is a desire for such a thing.


#2

Hard to say more without an MWE, but if you have iterators for individual files, you can use Base.Iterators.flatten to concatenate them, and then optionally Base.Iterators.partition or similar for small batches.


#3

Hi Tamas,

thank you very much for the answer.
Following is a test I have written:

using DataIterators: FileIterator

d = Dict("a" => [1 2 3 4 5], 
	"b" => [6 7 8 9 10], 
	"c" => [11 12 13 14])
loadfun(f) = (println("reading ",f); d[f])

files = ["a", "b", "c"]

begin 
	iter = FileIterator(loadfun, files, 2)
	nxt = iterate(iter)
	i = 0
	while nxt !== nothing && i < 10
		(x, state) = nxt
		println(x)
		nxt = iterate(iter, state)
		i += 1
	end
end

The output should be:

reading a
[1 2]
[3 4]
reading b
[5 6]
[7 8]
[9 10]
reading c
[11 12]
[13 14]

I think that flatten does not do the job here, since I need the data to be loaded lazily when needed. I have a first proof of concept implementation, certainly far from being perfect.


#4

If someone is interested in this problem, here is a first shot on this