I don’t kow if there is a simple way of doing this but any help is appeciated. The problem occurs when I convert Pandas dataframe to Dataframe.Dataframe. The output is messed up.
The altenative I know is to convert HDFs data to a csv file and them read directly to dataframe data.
using DataFrames,JLD2 #,CSVfiles
using PyCall: @pyimport
@pyimport pandas as pd
file=joinpath(pwd(),"Data")
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))
dpli_data = data_store["data_0818"];
# data_store.close()
typeof(dpli_data)
executed in 976ms, finished 14:17:00 2018-10-27
dpli_data
executed in 455ms, finished 14:21:03 2018-10-27
Age Sex YOR SA Plan Mode LOB PTD AGT EXPR Type EMR PT PPT SubStatus DOBLA UW TPCL PML TPML APE DOI CHNL LA_OCC LA_INC LA_CITY LA_ID PO_ID NOMINEE AG_CITY AGT_BR REDF_CITY DTH_CLM POLINV ADVFIND EXAPP DEF_ND LA_AG_CITY LA_BR_CITY MOR MOI DUR DURIF AGT_BLK
PolNo
1 40 0 2008 250000.0 31 0 3 20180903 70001019 0.00 1 0 30 30 13 19680803 1 0 0 0 50000.0000 20080903 1 7 500000.0 1399 21 21 4 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
2 40 0 2008 250000.0 31 0 3 20180903 70001019 0.00 1 0 30 30 13 19680803 1 0 0 0 50000.0000 20080903 1 7 500000.0 1399 21 21 32 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
3 32 1 2008 50000.0 31 3 3 20180905 70001019 0.00 1 0 30 30 13 19760822 1 0 0 0 10000.0000 20080905 1 6 2000000.0 1980 21 21 39 1980 2123 0 0 0 0 0 0 1 0 9 9 120 119 0
4 38 0 2008 250000.0 31 0 3 20140908 70001019 0.00 1 0 14 14 8 19700804 1 0 0 0 50000.0000 20080908 1 6 2400000.0 1980 21 21 32 1980 2123 0 0 0 0 0 0 1 0 9 9 72 72 0
5 48 0 2008 50000.0 31 0 3 20110911 70001019 0.00 1 0 5 5 9 19600108 1 0 0 0 10000.0000 20080911 1 2 0.0 1399 27 27 20 1980 2123 0 0 0 0 0 0 0 0 9 9 36 36 0
6 39 0 2008 100000.0 31 1 3 20130911 70001019 0.00 1 0 5 5 9 19690212 1 0 0 0 20000.0000 20080911 1 6 350000.0 1399 18 18 4 1980 2123 0 0 0 0 0 0 0 0 9 9 60 60 0
7 43 1 2008 200000.0 31 3 3 20180911 70001019 0.00 1 0 20 20 13 19650124 1 0 0 0 10000.0000 20080911 1 6 6000000.0 1399 21 21 39 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
8 39 1 2008 5000.0 0 0 2 20090812 70001019 0.00 1 0 10 10 7 19690122 1 0 0 0 2100.0000 20080911 1 6 360000.0 1399 4 4 39 1980 2123 0 0 0 0 0 0 0 0 9 9 11 11 0
9 26 1 2008 10000.0 0 0 2 20090911 70001019 0.00 1 0 10 10 7 19820620 1 0 0 0 2230.0000 20080911 1 6 360000.0 1399 21 21 23 1980 2123 0 0 0 0 0 0 0 0 9 9 12 12 0
10 29 1 2008 5000.0 0 0 2 20090911 70001019 0.00 1 0 30 30 7 19790720 1 0 0 0 2815.0000 20080911 1 6 285000.0 1399 18 18 9 1980 2123 0 0 0 0 0 0 0 0 9 9 12 12 0
608034 rows × 44 columns
using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
typeof(dpli_data1)
executed in 6ms, finished 14:02:53 2018-10-27
Pandas.DataFrame
dpli_data1[0:5]
executed in 61ms, finished 14:14:24 2018-10-27
Age Sex YOR SA Plan ... MOR MOI DUR DURIF AGT_BLK
PolNo ...
1 40 0 2008 250000.0 31 ... 9 9 120 119 0
2 40 0 2008 250000.0 31 ... 9 9 120 119 0
3 32 1 2008 50000.0 31 ... 9 9 120 119 0
4 38 0 2008 250000.0 31 ... 9 9 72 72 0
5 48 0 2008 50000.0 31 ... 9 9 36 36 0
6 39 0 2008 100000.0 31 ... 9 9 60 60 0
[6 rows x 44 columns]
dpli_data2=DataFrames.DataFrame(dpli_data1);
head(dpli_data2)
executed in 7.12s, finished 14:10:38 2018-10-27
Age Sex YOR SA Plan Mode LOB PTD AGT EXPR Type EMR PT PPT SubStatus DOBLA UW TPCL PML TPML APE DOI CHNL LA_OCC LA_INC LA_CITY LA_ID PO_ID NOMINEE AG_CITY AGT_BR REDF_CITY DTH_CLM POLINV ADVFIND EXAPP DEF_ND LA_AG_CITY LA_BR_CITY MOR MOI DUR DURIF AGT_BLK
Int64 Int64 Int32 Float64 Int64 Int64 Int64 Int64 Int32 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 Int64 Int64 Float64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int32 Int32 Int32 Int32 Int32
1 40 0 2008 250000.0 31 0 3 20180903 70001019 0.0 1 0 30 30 13 19680803 1 0 0 0 50000.0 20080903 1 7 500000.0 1399 21 21 4 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
2 0 31 70001019 0.0 0 3 20180903 1 9 50000.0 0 30 30 13 19680803 1 0 0 0 20080903 500000.0 1 7 1399 250000.0 21 21 4 1980 2123 0 0 0 0 0 0 0 0 40 9 120 119 0 2008
3 31 0 9 50000.0 3 20180903 1 0 9 500000.0 30 30 13 19680803 1 0 0 0 20080903 1 250000.0 7 1399 21 0.0 21 4 1980 2123 0 0 0 0 0 0 0 0 40 0 120 119 0 2008 70001019
4 0 3 9 500000.0 20180903 1 0 30 120 250000.0 30 13 19680803 1 0 0 0 20080903 1 7 0.0 1399 21 21 50000.0 4 1980 2123 0 0 0 0 0 0 0 0 40 0 31 119 0 2008 70001019 9
5 3 20180903 120 250000.0 1 0 30 30 119 0.0 13 19680803 1 0 0 0 20080903 1 7 1399 50000.0 21 21 4 500000.0 1980 2123 0 0 0 0 0 0 0 0 40 0 31 0 0 2008 70001019 9 9
6 20180903 1 119 0.0 0 30 30 13 0 50000.0 19680803 1 0 0 0 20080903 1 7 1399 21 500000.0 21 4 1980 50000.0 2123 0 0 0 0 0 0 0 0 40 0 31 0 3 2008 70001019 9 9 120
It is being used at line 26 or so. It looks like that it is able to read python object but subsequent conversion to DataFrames.dataframe is not working. I am looking for a direct way without using pandas.jl in-between.
I have also tried reading HDfs file directly but it does not gets converted in desired format.
Oh, sorry, I didn’t scroll down in the code example
Can you post what exact versions of packages you are using? I’m slightly confused, because for example head(df)
on my system always shows a column with the row number first, which I don’t see in what you pasted.
@davidanthoff
I am using Julia to read HDF file created in Python. Then I convert it to a Pandas DataFrame which seems to work fine. Subsequently I try to convert this to DataFrames.DataFrame (which I cannot do directly from Python), and the output I get is all messed up.
I am using Julia 1.0 and list of package version is given below.
using PyCall: @pyimport
@pyimport pandas as pd
#import os
file=joinpath(pwd(),"Data")
# dpli_data = pd.read_csv( os.path.join(os.getcwd(), "dpli_data_Final.csv"),low_memory=False)
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))
# ########Retrieve data using key
dpli_data = data_store["data_0818"];
# # data_store.close()
@show typeof(dpli_data)
typeof(dpli_data) = PyCall.PyObject
dpli_data
Age Sex YOR SA Plan Mode LOB PTD AGT EXPR Type EMR PT PPT SubStatus DOBLA UW TPCL PML TPML APE DOI CHNL LA_OCC LA_INC LA_CITY LA_ID PO_ID NOMINEE AG_CITY AGT_BR REDF_CITY DTH_CLM POLINV ADVFIND EXAPP DEF_ND LA_AG_CITY LA_BR_CITY MOR MOI DUR DURIF AGT_BLK
PolNo
1 40 0 2008 250000.0 31 0 3 20180903 70001019 0.00 1 0 30 30 13 19680803 1 0 0 0 50000.0000 20080903 1 7 500000.0 1399 21 21 4 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
2 40 0 2008 250000.0 31 0 3 20180903 70001019 0.00 1 0 30 30 13 19680803 1 0 0 0 50000.0000 20080903 1 7 500000.0 1399 21 21 32 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
3 32 1 2008 50000.0 31 3 3 20180905 70001019 0.00 1 0 30 30 13 19760822 1 0 0 0 10000.0000 20080905 1 6 2000000.0 1980 21 21 39 1980 2123 0 0 0 0 0 0 1 0 9 9 120 119 0
4 38 0 2008 250000.0 31 0 3 20140908 70001019 0.00 1 0 14 14 8 19700804 1 0 0 0 50000.0000 20080908 1 6 2400000.0 1980 21 21 32 1980 2123 0 0 0 0 0 0 1 0 9 9 72 72 0
5 48 0 2008 50000.0 31 0 3 20110911 70001019 0.00 1 0 5 5 9 19600108 1 0 0 0 10000.0000 20080911 1 2 0.0 1399 27 27 20 1980 2123 0 0 0 0 0 0 0 0 9 9 36 36 0
using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
Pandas.head(dpli_data1)
Age Sex YOR SA Plan ... MOR MOI DUR DURIF AGT_BLK
PolNo ...
1 40 0 2008 250000.0 31 ... 9 9 120 119 0
2 40 0 2008 250000.0 31 ... 9 9 120 119 0
3 32 1 2008 50000.0 31 ... 9 9 120 119 0
4 38 0 2008 250000.0 31 ... 9 9 72 72 0
5 48 0 2008 50000.0 31 ... 9 9 36 36 0
[5 rows x 44 columns]
using DataFrames
dpli_data2=DataFrames.DataFrame(dpli_data1);
typeof(dpli_data2)
DataFrames.DataFrame
DataFrames.head(dpli_data2)
Age Sex YOR SA Plan Mode LOB PTD AGT EXPR Type EMR PT PPT SubStatus DOBLA UW TPCL PML TPML APE DOI CHNL LA_OCC LA_INC LA_CITY LA_ID PO_ID NOMINEE AG_CITY AGT_BR REDF_CITY DTH_CLM POLINV ADVFIND EXAPP DEF_ND LA_AG_CITY LA_BR_CITY MOR MOI DUR DURIF AGT_BLK
1 40 0 2008 250000.0 31 0 3 20180903 70001019 0.0 1 0 30 30 13 19680803 1 0 0 0 50000.0 20080903 1 7 500000.0 1399 21 21 4 1980 2123 0 0 0 0 0 0 0 0 9 9 120 119 0
2 0 31 70001019 0.0 0 3 20180903 1 9 50000.0 0 30 30 13 19680803 1 0 0 0 20080903 500000.0 1 7 1399 250000.0 21 21 4 1980 2123 0 0 0 0 0 0 0 0 40 9 120 119 0 2008
3 31 0 9 50000.0 3 20180903 1 0 9 500000.0 30 30 13 19680803 1 0 0 0 20080903 1 250000.0 7 1399 21 0.0 21 4 1980 2123 0 0 0 0 0 0 0 0 40 0 120 119 0 2008 70001019
4 0 3 9 500000.0 20180903 1 0 30 120 250000.0 30 13 19680803 1 0 0 0 20080903 1 7 0.0 1399 21 21 50000.0 4 1980 2123 0 0 0 0 0 0 0 0 40 0 31 119 0 2008 70001019 9
5 3 20180903 120 250000.0 1 0 30 30 119 0.0 13 19680803 1 0 0 0 20080903 1 7 1399 50000.0 21 21 4 500000.0 1980 2123 0 0 0 0 0 0 0 0 40 0 31 0 0 2008 70001019 9 9
6 20180903 1 119 0.0 0 30 30 13 0 50000.0 19680803 1 0 0 0 20080903 1 7 1399 21 500000.0 21 4 1980 50000.0 2123 0 0 0 0 0 0 0 0 40 0 31 0 3 2008 70001019 9 9 120
(v1.0) pkg> st
Status `C:\Users\chatura\.julia\environments\v1.0\Project.toml`
[28f2ccd6] ApproxFun v0.10.1
[c52e3926] Atom v0.7.6
[6e4b80f9] BenchmarkTools v0.4.1
[a74b3585] Blosc v0.5.1
[336ed68f] CSV v0.4.2
[5d742f6a] CSVFiles v0.10.0
[aaaa29a8] Clustering v0.12.1
[944b1d66] CodecZlib v0.5.1
[f3117721] CombineML v1.1.1
[8f4d0f93] Conda v1.1.1
[a93c6f00] DataFrames v0.14.1
[1313f7d8] DataFramesMeta v0.4.0
[7806a523] DecisionTree v0.8.1
[31c24e10] Distributions v0.16.4
[587475ba] Flux v0.6.8
[da1fdf0e] FreqTables v0.3.0
[38e38edf] GLM v1.0.1
[c91e804a] Gadfly v1.0.0
[bc5e4493] GitHub v5.0.2
[cd3eb016] HTTP v0.7.1
[7073ff75] IJulia v1.14.0
[033835bb] JLD2 v0.1.2
[682c06a0] JSON v0.19.0
[4076af6c] JuMP v0.18.4
[e5e0dc1b] Juno v0.5.3
[f0e99cf1] MLBase v0.8.0
[9920b226] MLDataPattern v0.5.0
[cc2ba9b6] MLDataUtils v0.4.0
[1914dd2f] MacroTools v0.4.4
[6f286f6a] MultivariateStats v0.6.0
[b8a86587] NearestNeighbors v0.4.2
[429524aa] Optim v0.17.2
[eadc2687] Pandas v1.0.2
[91a5bcdd] Plots v0.21.0
[92933f4c] ProgressMeter v0.6.1
[438e738f] PyCall v1.18.5
[1a8c2f83] Query v0.10.1
[295af30f] Revise v0.7.12
[fdea26ae] SIMD v2.0.1
[3e6341c9] SLEEF v0.5.1
[3646fa90] ScikitLearn v0.5.0
[6e75b9c4] ScikitLearnBase v0.4.1
[60ddc479] StatPlots v0.8.1
[2913bbd2] StatsBase v0.25.0
[37b6cedf] Traceur v0.2.0
[c17dfb99] WinRPM v0.4.2
[009559a3] XGBoost v0.2.0+ #master (https://github.com/dmlc/XGBoost.jl.git)
[37e2e46d] LinearAlgebra
[3fa0cd96] REPL
[9a3f8284] Random
[10745b16] Statistics
Could you try to update your packages? For example, you have Pandas.jl v1.0.2, but the latest released version is v1.0.4. And there have been updates in other underlying packages as well.
And then, any chance you could post the file?
Updated the packages but output is same. I will try to share a sample of data.
1 Like
I have prepared a sample of data but don’t know how/where to upload it.
A git repository could work, eg on Github. Note that you can also use a gist as a git repo to upload binary data.
Thanks. I have created a git repository and uploaded the file with sample data. The file is python hdf file with data key as ‘data’
https://github.com/ajaychat3/Test_data.git
PS: Just to add that if I save the file from pandas.dataframe to CSV and then read it back into DataFrames.dataframe it seem to work fine