How to read in Julia a "pickled" numpy multiarray?

I am trying to port to Julia a Python exercise where however the first step is in reading some “pickled” data.
If I open the file with a text file I notice it is a binary file with stamped “numpy.core.multiarray” on the first row.

This is the code given in Python:

def get_MNIST_data():
    """
    Reads mnist dataset from file

    Returns:
        train_x - 2D Numpy array (n, d) where each row is an image
        train_y - 1D Numpy array (n, ) where each row is a label
        test_x  - 2D Numpy array (n, d) where each row is an image
        test_y  - 1D Numpy array (n, ) where each row is a label

    """
    train_set, valid_set, test_set = read_pickle_data('../Datasets/mnist.pkl.gz')
    train_x, train_y = train_set
    valid_x, valid_y = valid_set
    train_x = np.vstack((train_x, valid_x))
    train_y = np.append(train_y, valid_y)
    test_x, test_y = test_set
    return (train_x, train_y, test_x, test_y)

Is there a native Julia package to read such format ? If not, should I use PyCall instead ?

yes, pycall should work. or you can copy some modules in some ML implementation in julia that loads MNIST such as in Knet: https://github.com/denizyuret/Knet.jl/blob/master/data/mnist.jl or examples: Knet.jl/examples/mnist-mlp at master · denizyuret/Knet.jl · GitHub or https://github.com/JuliaML/MLDatasets.jl

Perhpas it would make sense to store the data in a different file format for the exercise? That is load it once using pickle = pyimport("pickle") and then save it in a generic data format like HDF5 or so.

Alternatively, why not store the data as .npz in Python and use NPZ.jl to load it in Julia?

1 Like

yes, this works, and I get the output as Julia Arrays:

using PyCall
@pyimport pickle
@pyimport gzip

function read_pickle_data(file_name)
    f = gzip.open(file_name, "rb")
    data = pickle.load(f, encoding="latin1")
    f.close()
    return data
end

(I thought read_pickle_data was a Python function)

1 Like