Download and read MNIST images

I am trying to access MNIST images from MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges.
This works, but it looks a bit cumbersome.

Is there a better way? Perhaps chaining everything without saving the files on disk?


using HTTP, GZip, IDX # https://github.com/jlegare/IDX.git
r = HTTP.get("http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz", cookies=true);
destPath = joinpath(dirname(Base.find_package("Bmlt")),"..","test","data","mnist")
zippedFile = joinpath(destPath,"test.gz")
unZippedFile = joinpath(destPath,"test.idx3")
open(zippedFile,"w") do f
    write(f,String(r.body))
end
fh = GZip.open(zippedFile)
open(unZippedFile,"w") do f
    write(f,read(fh))
end
train_set = load(unZippedFile)
img1 = train_set[3][:,:,1]

You can get them easily via https://github.com/JuliaML/MLDatasets.jl/

2 Likes

@ericphanson is absolutely right, for MNIST data it’s better to use MLDatasets.jl

But I would like to also advertise UrlDownload.jl which was developed specifically for cases like this.

Unfortunately, author of IDX.jl didn’t provide the possibility to process IDX data from stream, but fortunately it can be done rather easily.

function parseidx(data)
    type_constructors = [ UInt8,
                          Int8,
                          Int16, # Actually, this one is unused. I'm really not sure why ... I haven't found proper
                                   # documentation for the format other than the website reference quoted above.
                          Int16, 
                          Int32,
                          Float32,
                          Float64 ]
    idxtype = type_constructors[data[3] - 0x07]
    dimensions = data[4]
    sizes = map(i -> reinterpret(UInt32, reverse(data[(4 + (i - 1)*4 + 1):(4 + i*4)]))[1], 1:dimensions)
    reshape(convert(Array{idxtype}, data[4*(dimensions + 1) + 1:end]), Tuple(reverse(sizes)))
end

Using this parser, it’s easy to get data using custom parsers feature of UrlDownload.jl

url = "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"

data = urldownload(url, true, parser=parseidx)
data[:, :, 1] # 28 x 28 array of the first mnist character

As a bonus, you’ll get nice ProgressMeter.jl download bar.

Couple of notes, though. Firstly, custom parsers support for compressed data is available in master or in version 0.2.1 (soon to be registered). Secondly, you may be asked to install CodecZLib.jl.

1 Like