Problem Reading Python Pandas object into Julia

question

#1

I don’t kow if there is a simple way of doing this but any help is appeciated. The problem occurs when I convert Pandas dataframe to Dataframe.Dataframe. The output is messed up.

The altenative I know is to convert HDFs data to a csv file and them read directly to dataframe data.

using DataFrames,JLD2   #,CSVfiles

using PyCall: @pyimport
@pyimport pandas as pd
file=joinpath(pwd(),"Data") 
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))
dpli_data = data_store["data_0818"];
# data_store.close()
typeof(dpli_data)
executed in 976ms, finished 14:17:00 2018-10-27

dpli_data
executed in 455ms, finished 14:21:03 2018-10-27
Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
PolNo																																												
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	32	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
3	32	1	2008	50000.0	31	3	3	20180905	70001019	0.00	1	0	30	30	13	19760822	1	0	0	0	10000.0000	20080905	1	6	2000000.0	1980	21	21	39	1980	2123	0	0	0	0	0	0	1	0	9	9	120	119	0
4	38	0	2008	250000.0	31	0	3	20140908	70001019	0.00	1	0	14	14	8	19700804	1	0	0	0	50000.0000	20080908	1	6	2400000.0	1980	21	21	32	1980	2123	0	0	0	0	0	0	1	0	9	9	72	72	0
5	48	0	2008	50000.0	31	0	3	20110911	70001019	0.00	1	0	5	5	9	19600108	1	0	0	0	10000.0000	20080911	1	2	0.0	1399	27	27	20	1980	2123	0	0	0	0	0	0	0	0	9	9	36	36	0
6	39	0	2008	100000.0	31	1	3	20130911	70001019	0.00	1	0	5	5	9	19690212	1	0	0	0	20000.0000	20080911	1	6	350000.0	1399	18	18	4	1980	2123	0	0	0	0	0	0	0	0	9	9	60	60	0
7	43	1	2008	200000.0	31	3	3	20180911	70001019	0.00	1	0	20	20	13	19650124	1	0	0	0	10000.0000	20080911	1	6	6000000.0	1399	21	21	39	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
8	39	1	2008	5000.0	0	0	2	20090812	70001019	0.00	1	0	10	10	7	19690122	1	0	0	0	2100.0000	20080911	1	6	360000.0	1399	4	4	39	1980	2123	0	0	0	0	0	0	0	0	9	9	11	11	0
9	26	1	2008	10000.0	0	0	2	20090911	70001019	0.00	1	0	10	10	7	19820620	1	0	0	0	2230.0000	20080911	1	6	360000.0	1399	21	21	23	1980	2123	0	0	0	0	0	0	0	0	9	9	12	12	0
10	29	1	2008	5000.0	0	0	2	20090911	70001019	0.00	1	0	30	30	7	19790720	1	0	0	0	2815.0000	20080911	1	6	285000.0	1399	18	18	9	1980	2123	0	0	0	0	0	0	0	0	9	9	12	12	0

608034 rows × 44 columns

using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
typeof(dpli_data1)
executed in 6ms, finished 14:02:53 2018-10-27
Pandas.DataFrame

dpli_data1[0:5]
executed in 61ms, finished 14:14:24 2018-10-27
       Age  Sex   YOR        SA  Plan   ...     MOR  MOI  DUR  DURIF  AGT_BLK
PolNo                                   ...                                  
1       40    0  2008  250000.0    31   ...       9    9  120    119        0
2       40    0  2008  250000.0    31   ...       9    9  120    119        0
3       32    1  2008   50000.0    31   ...       9    9  120    119        0
4       38    0  2008  250000.0    31   ...       9    9   72     72        0
5       48    0  2008   50000.0    31   ...       9    9   36     36        0
6       39    0  2008  100000.0    31   ...       9    9   60     60        0

[6 rows x 44 columns]

dpli_data2=DataFrames.DataFrame(dpli_data1);
head(dpli_data2)
executed in 7.12s, finished 14:10:38 2018-10-27
Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
Int64	Int64	Int32	Float64	Int64	Int64	Int64	Int64	Int32	Float64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Float64	Int64	Int64	Int64	Float64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int32	Int32	Int32	Int32	Int32
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.0	1	0	30	30	13	19680803	1	0	0	0	50000.0	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	0	31	70001019	0.0	0	3	20180903	1	9	50000.0	0	30	30	13	19680803	1	0	0	0	20080903	500000.0	1	7	1399	250000.0	21	21	4	1980	2123	0	0	0	0	0	0	0	0	40	9	120	119	0	2008
3	31	0	9	50000.0	3	20180903	1	0	9	500000.0	30	30	13	19680803	1	0	0	0	20080903	1	250000.0	7	1399	21	0.0	21	4	1980	2123	0	0	0	0	0	0	0	0	40	0	120	119	0	2008	70001019
4	0	3	9	500000.0	20180903	1	0	30	120	250000.0	30	13	19680803	1	0	0	0	20080903	1	7	0.0	1399	21	21	50000.0	4	1980	2123	0	0	0	0	0	0	0	0	40	0	31	119	0	2008	70001019	9
5	3	20180903	120	250000.0	1	0	30	30	119	0.0	13	19680803	1	0	0	0	20080903	1	7	1399	50000.0	21	21	4	500000.0	1980	2123	0	0	0	0	0	0	0	0	40	0	31	0	0	2008	70001019	9	9
6	20180903	1	119	0.0	0	30	30	13	0	50000.0	19680803	1	0	0	0	20080903	1	7	1399	21	500000.0	21	4	1980	50000.0	2123	0	0	0	0	0	0	0	0	40	0	31	0	3	2008	70001019	9	9	120

#2

Try Pandas.jl.


#3

It is being used at line 26 or so. It looks like that it is able to read python object but subsequent conversion to DataFrames.dataframe is not working. I am looking for a direct way without using pandas.jl in-between.

I have also tried reading HDfs file directly but it does not gets converted in desired format.


#4

Oh, sorry, I didn’t scroll down in the code example :slight_smile:

Can you post what exact versions of packages you are using? I’m slightly confused, because for example head(df) on my system always shows a column with the row number first, which I don’t see in what you pasted.