Problem Reading Python Pandas object into Julia

question

#1

I don’t kow if there is a simple way of doing this but any help is appeciated. The problem occurs when I convert Pandas dataframe to Dataframe.Dataframe. The output is messed up.

The altenative I know is to convert HDFs data to a csv file and them read directly to dataframe data.

using DataFrames,JLD2   #,CSVfiles

using PyCall: @pyimport
@pyimport pandas as pd
file=joinpath(pwd(),"Data") 
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))
dpli_data = data_store["data_0818"];
# data_store.close()
typeof(dpli_data)
executed in 976ms, finished 14:17:00 2018-10-27

dpli_data
executed in 455ms, finished 14:21:03 2018-10-27
Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
PolNo																																												
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	32	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
3	32	1	2008	50000.0	31	3	3	20180905	70001019	0.00	1	0	30	30	13	19760822	1	0	0	0	10000.0000	20080905	1	6	2000000.0	1980	21	21	39	1980	2123	0	0	0	0	0	0	1	0	9	9	120	119	0
4	38	0	2008	250000.0	31	0	3	20140908	70001019	0.00	1	0	14	14	8	19700804	1	0	0	0	50000.0000	20080908	1	6	2400000.0	1980	21	21	32	1980	2123	0	0	0	0	0	0	1	0	9	9	72	72	0
5	48	0	2008	50000.0	31	0	3	20110911	70001019	0.00	1	0	5	5	9	19600108	1	0	0	0	10000.0000	20080911	1	2	0.0	1399	27	27	20	1980	2123	0	0	0	0	0	0	0	0	9	9	36	36	0
6	39	0	2008	100000.0	31	1	3	20130911	70001019	0.00	1	0	5	5	9	19690212	1	0	0	0	20000.0000	20080911	1	6	350000.0	1399	18	18	4	1980	2123	0	0	0	0	0	0	0	0	9	9	60	60	0
7	43	1	2008	200000.0	31	3	3	20180911	70001019	0.00	1	0	20	20	13	19650124	1	0	0	0	10000.0000	20080911	1	6	6000000.0	1399	21	21	39	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
8	39	1	2008	5000.0	0	0	2	20090812	70001019	0.00	1	0	10	10	7	19690122	1	0	0	0	2100.0000	20080911	1	6	360000.0	1399	4	4	39	1980	2123	0	0	0	0	0	0	0	0	9	9	11	11	0
9	26	1	2008	10000.0	0	0	2	20090911	70001019	0.00	1	0	10	10	7	19820620	1	0	0	0	2230.0000	20080911	1	6	360000.0	1399	21	21	23	1980	2123	0	0	0	0	0	0	0	0	9	9	12	12	0
10	29	1	2008	5000.0	0	0	2	20090911	70001019	0.00	1	0	30	30	7	19790720	1	0	0	0	2815.0000	20080911	1	6	285000.0	1399	18	18	9	1980	2123	0	0	0	0	0	0	0	0	9	9	12	12	0

608034 rows × 44 columns

using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
typeof(dpli_data1)
executed in 6ms, finished 14:02:53 2018-10-27
Pandas.DataFrame

dpli_data1[0:5]
executed in 61ms, finished 14:14:24 2018-10-27
       Age  Sex   YOR        SA  Plan   ...     MOR  MOI  DUR  DURIF  AGT_BLK
PolNo                                   ...                                  
1       40    0  2008  250000.0    31   ...       9    9  120    119        0
2       40    0  2008  250000.0    31   ...       9    9  120    119        0
3       32    1  2008   50000.0    31   ...       9    9  120    119        0
4       38    0  2008  250000.0    31   ...       9    9   72     72        0
5       48    0  2008   50000.0    31   ...       9    9   36     36        0
6       39    0  2008  100000.0    31   ...       9    9   60     60        0

[6 rows x 44 columns]

dpli_data2=DataFrames.DataFrame(dpli_data1);
head(dpli_data2)
executed in 7.12s, finished 14:10:38 2018-10-27
Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
Int64	Int64	Int32	Float64	Int64	Int64	Int64	Int64	Int32	Float64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Float64	Int64	Int64	Int64	Float64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int32	Int32	Int32	Int32	Int32
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.0	1	0	30	30	13	19680803	1	0	0	0	50000.0	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	0	31	70001019	0.0	0	3	20180903	1	9	50000.0	0	30	30	13	19680803	1	0	0	0	20080903	500000.0	1	7	1399	250000.0	21	21	4	1980	2123	0	0	0	0	0	0	0	0	40	9	120	119	0	2008
3	31	0	9	50000.0	3	20180903	1	0	9	500000.0	30	30	13	19680803	1	0	0	0	20080903	1	250000.0	7	1399	21	0.0	21	4	1980	2123	0	0	0	0	0	0	0	0	40	0	120	119	0	2008	70001019
4	0	3	9	500000.0	20180903	1	0	30	120	250000.0	30	13	19680803	1	0	0	0	20080903	1	7	0.0	1399	21	21	50000.0	4	1980	2123	0	0	0	0	0	0	0	0	40	0	31	119	0	2008	70001019	9
5	3	20180903	120	250000.0	1	0	30	30	119	0.0	13	19680803	1	0	0	0	20080903	1	7	1399	50000.0	21	21	4	500000.0	1980	2123	0	0	0	0	0	0	0	0	40	0	31	0	0	2008	70001019	9	9
6	20180903	1	119	0.0	0	30	30	13	0	50000.0	19680803	1	0	0	0	20080903	1	7	1399	21	500000.0	21	4	1980	50000.0	2123	0	0	0	0	0	0	0	0	40	0	31	0	3	2008	70001019	9	9	120

#2

Try Pandas.jl.


#3

It is being used at line 26 or so. It looks like that it is able to read python object but subsequent conversion to DataFrames.dataframe is not working. I am looking for a direct way without using pandas.jl in-between.

I have also tried reading HDfs file directly but it does not gets converted in desired format.


#4

Oh, sorry, I didn’t scroll down in the code example :slight_smile:

Can you post what exact versions of packages you are using? I’m slightly confused, because for example head(df) on my system always shows a column with the row number first, which I don’t see in what you pasted.


#5

@davidanthoff

I am using Julia to read HDF file created in Python. Then I convert it to a Pandas DataFrame which seems to work fine. Subsequently I try to convert this to DataFrames.DataFrame (which I cannot do directly from Python), and the output I get is all messed up.

I am using Julia 1.0 and list of package version is given below.

using PyCall: @pyimport
@pyimport pandas as pd

#import os
file=joinpath(pwd(),"Data") 
# dpli_data = pd.read_csv( os.path.join(os.getcwd(), "dpli_data_Final.csv"),low_memory=False)
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))

# ########Retrieve data using key
dpli_data = data_store["data_0818"];

# # data_store.close()
@show typeof(dpli_data)
typeof(dpli_data) = PyCall.PyObject

dpli_data
   Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
PolNo																																												
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	32	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
3	32	1	2008	50000.0	31	3	3	20180905	70001019	0.00	1	0	30	30	13	19760822	1	0	0	0	10000.0000	20080905	1	6	2000000.0	1980	21	21	39	1980	2123	0	0	0	0	0	0	1	0	9	9	120	119	0
4	38	0	2008	250000.0	31	0	3	20140908	70001019	0.00	1	0	14	14	8	19700804	1	0	0	0	50000.0000	20080908	1	6	2400000.0	1980	21	21	32	1980	2123	0	0	0	0	0	0	1	0	9	9	72	72	0
5	48	0	2008	50000.0	31	0	3	20110911	70001019	0.00	1	0	5	5	9	19600108	1	0	0	0	10000.0000	20080911	1	2	0.0	1399	27	27	20	1980	2123	0	0	0	0	0	0	0	0	9	9	36	36	0

using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
Pandas.head(dpli_data1)

        Age  Sex   YOR        SA  Plan   ...     MOR  MOI  DUR  DURIF  AGT_BLK
PolNo                                   ...                                  
1       40    0  2008  250000.0    31   ...       9    9  120    119        0
2       40    0  2008  250000.0    31   ...       9    9  120    119        0
3       32    1  2008   50000.0    31   ...       9    9  120    119        0
4       38    0  2008  250000.0    31   ...       9    9   72     72        0
5       48    0  2008   50000.0    31   ...       9    9   36     36        0

[5 rows x 44 columns]

using DataFrames
dpli_data2=DataFrames.DataFrame(dpli_data1);
typeof(dpli_data2)
DataFrames.DataFrame

DataFrames.head(dpli_data2)

	Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.0	1	0	30	30	13	19680803	1	0	0	0	50000.0	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	0	31	70001019	0.0	0	3	20180903	1	9	50000.0	0	30	30	13	19680803	1	0	0	0	20080903	500000.0	1	7	1399	250000.0	21	21	4	1980	2123	0	0	0	0	0	0	0	0	40	9	120	119	0	2008
3	31	0	9	50000.0	3	20180903	1	0	9	500000.0	30	30	13	19680803	1	0	0	0	20080903	1	250000.0	7	1399	21	0.0	21	4	1980	2123	0	0	0	0	0	0	0	0	40	0	120	119	0	2008	70001019
4	0	3	9	500000.0	20180903	1	0	30	120	250000.0	30	13	19680803	1	0	0	0	20080903	1	7	0.0	1399	21	21	50000.0	4	1980	2123	0	0	0	0	0	0	0	0	40	0	31	119	0	2008	70001019	9
5	3	20180903	120	250000.0	1	0	30	30	119	0.0	13	19680803	1	0	0	0	20080903	1	7	1399	50000.0	21	21	4	500000.0	1980	2123	0	0	0	0	0	0	0	0	40	0	31	0	0	2008	70001019	9	9
6	20180903	1	119	0.0	0	30	30	13	0	50000.0	19680803	1	0	0	0	20080903	1	7	1399	21	500000.0	21	4	1980	50000.0	2123	0	0	0	0	0	0	0	0	40	0	31	0	3	2008	70001019	9	9	120

(v1.0) pkg> st
    Status `C:\Users\chatura\.julia\environments\v1.0\Project.toml`
  [28f2ccd6] ApproxFun v0.10.1
  [c52e3926] Atom v0.7.6
  [6e4b80f9] BenchmarkTools v0.4.1
  [a74b3585] Blosc v0.5.1
  [336ed68f] CSV v0.4.2
  [5d742f6a] CSVFiles v0.10.0
  [aaaa29a8] Clustering v0.12.1
  [944b1d66] CodecZlib v0.5.1
  [f3117721] CombineML v1.1.1
  [8f4d0f93] Conda v1.1.1
  [a93c6f00] DataFrames v0.14.1
  [1313f7d8] DataFramesMeta v0.4.0
  [7806a523] DecisionTree v0.8.1
  [31c24e10] Distributions v0.16.4
  [587475ba] Flux v0.6.8
  [da1fdf0e] FreqTables v0.3.0
  [38e38edf] GLM v1.0.1
  [c91e804a] Gadfly v1.0.0
  [bc5e4493] GitHub v5.0.2
  [cd3eb016] HTTP v0.7.1
  [7073ff75] IJulia v1.14.0
  [033835bb] JLD2 v0.1.2
  [682c06a0] JSON v0.19.0
  [4076af6c] JuMP v0.18.4
  [e5e0dc1b] Juno v0.5.3
  [f0e99cf1] MLBase v0.8.0
  [9920b226] MLDataPattern v0.5.0
  [cc2ba9b6] MLDataUtils v0.4.0
  [1914dd2f] MacroTools v0.4.4
  [6f286f6a] MultivariateStats v0.6.0
  [b8a86587] NearestNeighbors v0.4.2
  [429524aa] Optim v0.17.2
  [eadc2687] Pandas v1.0.2
  [91a5bcdd] Plots v0.21.0
  [92933f4c] ProgressMeter v0.6.1
  [438e738f] PyCall v1.18.5
  [1a8c2f83] Query v0.10.1
  [295af30f] Revise v0.7.12
  [fdea26ae] SIMD v2.0.1
  [3e6341c9] SLEEF v0.5.1
  [3646fa90] ScikitLearn v0.5.0
  [6e75b9c4] ScikitLearnBase v0.4.1
  [60ddc479] StatPlots v0.8.1
  [2913bbd2] StatsBase v0.25.0
  [37b6cedf] Traceur v0.2.0
  [c17dfb99] WinRPM v0.4.2
  [009559a3] XGBoost v0.2.0+ #master (https://github.com/dmlc/XGBoost.jl.git)
  [37e2e46d] LinearAlgebra
  [3fa0cd96] REPL
  [9a3f8284] Random
  [10745b16] Statistics

#6

Could you try to update your packages? For example, you have Pandas.jl v1.0.2, but the latest released version is v1.0.4. And there have been updates in other underlying packages as well.

And then, any chance you could post the file?


#7

Updated the packages but output is same. I will try to share a sample of data.


#8

I have prepared a sample of data but don’t know how/where to upload it.


#9

A git repository could work, eg on Github. Note that you can also use a gist as a git repo to upload binary data.


#10

Thanks. I have created a git repository and uploaded the file with sample data. The file is python hdf file with data key as ‘data’

PS: Just to add that if I save the file from pandas.dataframe to CSV and then read it back into DataFrames.dataframe it seem to work fine