Unable to load pickle data

question

#1

Hello,

I am trying to load python pickle object. I keep getting the below error message and I am unable to resolve the issue. The data which i am trying to load is CIFAR10 Dataset. Below is my code with which i am trying to load datasets.

using PyCall
@pyimport pickle

function load_pickle_data(ROOT)
	datadict = Dict()
	for b=1:5
		f=joinpath(ROOT, "data_batch_$b")
		fo=open(f,"r")
		datadict=pickle.load(fo)
	end
	datadict
end

ERROR

PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.TypeError'>
TypeError("unhashable type: 'bytearray'",)
  File "/Users/Saran/.julia/v0.6/Conda/deps/usr/lib/python2.7/pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "/Users/Saran/.julia/v0.6/Conda/deps/usr/lib/python2.7/pickle.py", line 864, in load
    dispatch[key](self)

cifar-10-batches-py Directory has the following files in it

batches.meta
data_batch_1
data_batch_2
data_batch_3
data_batch_4
data_batch_5
readme.html
test_batch

cifar-10-batches-py directory and the Julia file which i am running are in the same folder. Kindly help me out in fixing this issue.

Thank You


#2

CIFAR comes in Binary version.
This really should be preferred over Pickle anyway.
Pickle is insecure and can run malicious code.
(Eg if the authors site had been hacked)

The binary version is easily parse-able using julia


#3

The problem is that Julia UInt8 arrays are by default converted to Python bytearray, whereas pickle only allows bytes objects for some reason (hence the unhashable type: 'bytearray' error). You can do:

datadict = pickle.loads(pybytes(readbytes(fo)))

instead. (Note that your for b=1:5 loop overwrites datadict 5 times, so that you are only returning the last datadict. Maybe you want merge!(datadict, pickle.loads(...)) instead?)

See also https://github.com/JuliaPy/PyCall.jl/pull/388 for how PyCall uses pickle for serialization.


#4

(This should be fixed in the latest PyCall master, which fixes Python I/O with Julia IO objects to produce bytes objects as required by the Python 3 documentation: https://github.com/JuliaPy/PyCall.jl/commit/1b58d9a3543229c78661117f8c34508ea3fea3e8 … the Python 2 docs were unclear on this point.)


#5

@stevengj Thank You very much for pointing out the mistake i had to use merge!
I have made the changes which you suggested. But still i am getting error

using PyCall
@pyimport pickle

function load_pickle_data(ROOT)
	datadict = Dict()
	for b=1:5
		f=joinpath(ROOT, "data_batch_$b")
		fo=open(f,"r")
		merge!(datadict,pickle.load(pybytes(readbytes(fo))))
	end
	datadict
end
UndefVarError: readbytes not defined
in load_pickle_data at loadbatchutil.jl:9

I am currently using Julia v"0.6.0". Hence i tried to change readbytes to readbytes! still get an error

merge!(datadict,pickle.load(pybytes(readbytes!(fo, UInt8))))

ERROR Msg

MethodError: no method matching readbytes!(::IOStream, ::Type{UInt8})
Closest candidates are:
  readbytes!(::IOStream, !Matched::Array{UInt8,N} where N) at iostream.jl:278
  readbytes!(::IOStream, !Matched::Array{UInt8,N} where N, !Matched::Any; all) at iostream.jl:278
  readbytes!(::IO, !Matched::AbstractArray{UInt8,N} where N) at io.jl:503
  ...
in load_pickle_data at loadbatchutil.jl:9

Please let me know what i am missing here


#6

Sorry, it is just read(fo) in Julia 0.6.


#7

@stevengj Tried read(fo). I still get error message.

merge!(datadict,pickle.load(pybytes(read(fo))))

PyError (ccall(@pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, arg, C_NULL)) <type 'exceptions.AttributeError'>
AttributeError("'str' object has no attribute 'readline'",)
  File "/Users/Saran/.julia/v0.6/Conda/deps/usr/lib/python2.7/pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "/Users/Saran/.julia/v0.6/Conda/deps/usr/lib/python2.7/pickle.py", line 847, in __init__
    self.readline = file.readline

#8

You have to use pickle.loads, not pickle.load, to read from a pybytes object. pickle.load only takes an I/O stream.


#9

@stevengj Thank You very much. pickle.loads did sort the issue.

"batch_label" → "training batch 5 of 5"
"labels" → Any[10000]
"data" → 10000×3072 Array{UInt8,2}:
"filenames" → Any[10000]