Converting Pandas DataFrame to Julia DataFrame?

Pandas.jl is pretty nice. Is there an easy way to convert from Pandas DataFrame to Julia DataFrame?

e.g.

julia> df  = read_csv("iris.csv")
     Unnamed: 0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0             1           5.1          3.5           1.4          0.2     setosa
1             2           4.9          3.0           1.4          0.2     setosa
2             3           4.7          3.2           1.3          0.2     setosa
3             4           4.6          3.1           1.5          0.2     setosa
4             5           5.0          3.6           1.4          0.2     setosa
5             6           5.4          3.9           1.7          0.4     setosa
6             7           4.6          3.4           1.4          0.3     setosa
7             8           5.0          3.4           1.5          0.2     setosa
8             9           4.4          2.9           1.4          0.2     setosa
9            10           4.9          3.1           1.5          0.1     setosa
10           11           5.4          3.7           1.5          0.2     setosa
...

julia> typeof(df)
Pandas.DataFrame

As far as I understand, the standard conversion tool is

Provided that there is an interface from pandas.jl. If not, one should be built.

Pandas.jl has support for this, so it should just work.

Not for me…

julia> DataFrames.DataFrame(df)
ERROR: MethodError: no method matching DataFrames.DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrames.DataFrame(::Any, ::DataStreams.Data.Schema, ::Type{S}, ::Bool; reference) where S at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/abstractdataframe/io.jl:295
  DataFrames.DataFrame(::Array{Any,1}, ::DataFrames.Index) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:87
  DataFrames.DataFrame(; kwargs...) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:142

So I tried to convert a TypedTable to DataFrame as noted in IterableTables’s README file and it’s giving weird results. Perhaps things got broken in the transition to 1.0?

julia> t = TypedTables.Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Table with 2 columns and 3 rows:
     a  b
   β”Œβ”€β”€β”€β”€β”€β”€β”€
 1 β”‚ 1  2.0
 2 β”‚ 2  4.0
 3 β”‚ 3  6.0

julia> DataFrames.DataFrame(t)
β”Œ Warning: passing columns argument with non-AbstractVector entries is deprecated
β”‚   caller = top-level scope at none:0
β”” @ Core none:0
1Γ—3 DataFrames.DataFrame
β”‚ Row β”‚ x1               β”‚ x2               β”‚ x3               β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ (a = 1, b = 2.0) β”‚ (a = 2, b = 4.0) β”‚ (a = 3, b = 6.0) β”‚

I did this and it works. I am using Julia 1.0 on win 10

using Pandas

df=read_csv("iris.csv");
typeof(df)

Pandas.DataFrame

df1=DataFrames.DataFrame(df);
typeof(df1)

DataFrame

I replicated the same problem with a new project environment. I’m using Mac but I doubt the OS plays a role in this issue. Do you have the same package versions as mine?

(v1.0) pkg> activate .

(PandasTest) pkg> add Pandas DataFrames
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `Project.toml`
  [a93c6f00] + DataFrames v0.13.1
  [eadc2687] + Pandas v1.0.1
  Updating `Manifest.toml`
  [b99e7846] + BinaryProvider v0.5.0
  [324d7699] + CategoricalArrays v0.3.13
  [944b1d66] + CodecZlib v0.5.0
  [34da2185] + Compat v1.1.0
  [8f4d0f93] + Conda v1.0.1
  [a93c6f00] + DataFrames v0.13.1
  [9a8bc11e] + DataStreams v0.4.1
  [864edb3b] + DataStructures v0.13.0
  [e7dc6d0d] + DataValues v0.4.5
  [82899510] + IteratorInterfaceExtensions v0.1.1
  [682c06a0] + JSON v0.19.0
  [50d2b5c4] + Lazy v0.13.1
  [1914dd2f] + MacroTools v0.4.4
  [e1d29d7a] + Missings v0.3.1
  [bac558e1] + OrderedCollections v1.0.1
  [eadc2687] + Pandas v1.0.1
  [438e738f] + PyCall v1.18.4
  [189a3867] + Reexport v0.2.0
  [a2af1166] + SortingAlgorithms v0.3.1
  [2913bbd2] + StatsBase v0.25.0
  [3783bdb8] + TableTraits v0.3.1
  [382cd787] + TableTraitsUtils v0.3.1
  [3bb67fe8] + TranscodingStreams v0.8.1
  [81def892] + VersionParsing v1.1.2
  [ea10d353] + WeakRefStrings v0.5.3
  [2a0f44e3] + Base64 
  [ade2ca70] + Dates 
  [8bb1440f] + DelimitedFiles 
  [8ba89e20] + Distributed 
  [9fa8497b] + Future 
  [b77e0a4c] + InteractiveUtils 
  [76f85450] + LibGit2 
  [8f399da3] + Libdl 
  [37e2e46d] + LinearAlgebra 
  [56ddb016] + Logging 
  [d6f4376e] + Markdown 
  [a63ad114] + Mmap 
  [44cfe95a] + Pkg 
  [de0858da] + Printf 
  [3fa0cd96] + REPL 
  [9a3f8284] + Random 
  [ea8e919c] + SHA 
  [9e88b42a] + Serialization 
  [1a1011a3] + SharedArrays 
  [6462fe0b] + Sockets 
  [2f01184e] + SparseArrays 
  [10745b16] + Statistics 
  [8dfed614] + Test 
  [cf7118a7] + UUIDs 
  [4ec0a83e] + Unicode 

julia> using Pandas: read_csv

julia> using DataFrames: DataFrame

julia> df1 = read_csv("iris.csv"); 

julia> df2 = DataFrame(df1)
ERROR: MethodError: no method matching DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrame(::Any, ::DataStreams.Data.Schema, ::Type{S}, ::Bool; reference) where S at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/abstractdataframe/io.jl:295
  DataFrame(::Array{Any,1}, ::DataFrames.Index) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:87
  DataFrame(; kwargs...) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:142

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

Try with DataFrames master

@tk3369

I don’t know what is happending. Today I run the same command and it throws an error

using Pandas

df=read_csv("iris.csv");
@show typeof(df)

Pandas.DataFrame

using DataFrames,Data
df1=DataFrames.DataFrame(df);
typeof(df1)

MethodError: no method matching DataFrames.DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrames.DataFrame(::Any, !Matched::DataStreams.Data.Schema, !Matched::Type{S}, !Matched::Bool; reference) where S at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\abstractdataframe\io.jl:295
  DataFrames.DataFrame(!Matched::Array{Any,1}, !Matched::DataFrames.Index) at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\dataframe\dataframe.jl:87
  DataFrames.DataFrame(; kwargs...) at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\dataframe\dataframe.jl:142
  ...

Stacktrace:
 [1] top-level scope at In[4]:2

It looks like that you’re not using master… same problem as mine before.

See if you have this:

(v1.0) pkg> st
  [a93c6f00] DataFrames v0.13.1+ #master (https://github.com/JuliaData/DataFrames.jl.git)

If not, just switch over:

] add DataFrames#master

Yes it works after using master.

So the conversion between Pandas.jl and DataFrames.jl should just work on julia 1.0:

(foo) pkg> st
    Status `C:\Users\david\.julia\environments\foo\Project.toml`
  [a93c6f00] DataFrames v0.14.0
  [eadc2687] Pandas v1.0.2

julia> using DataFrames, Pandas

julia> df = DataFrames.DataFrame(a=rand(10), b=rand(10))
10Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ a         β”‚ b         β”‚
β”‚     β”‚ Float64   β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.098737  β”‚ 0.912536  β”‚
β”‚ 2   β”‚ 0.66538   β”‚ 0.770032  β”‚
β”‚ 3   β”‚ 0.767376  β”‚ 0.635237  β”‚
β”‚ 4   β”‚ 0.353171  β”‚ 0.169174  β”‚
β”‚ 5   β”‚ 0.330284  β”‚ 0.453514  β”‚
β”‚ 6   β”‚ 0.363861  β”‚ 0.64091   β”‚
β”‚ 7   β”‚ 0.622878  β”‚ 0.672581  β”‚
β”‚ 8   β”‚ 0.0130092 β”‚ 0.0542869 β”‚
β”‚ 9   β”‚ 0.779855  β”‚ 0.0753927 β”‚
β”‚ 10  β”‚ 0.943342  β”‚ 0.395862  β”‚

julia> pd = Pandas.DataFrame(df)
          a         b
0  0.098737  0.912536
1  0.665380  0.770032
2  0.767376  0.635237
3  0.353171  0.169174
4  0.330284  0.453514
5  0.363861  0.640910
6  0.622878  0.672581
7  0.013009  0.054287
8  0.779855  0.075393
9  0.943342  0.395862


julia> DataFrames.DataFrame(pd)
10Γ—2 DataFrames.DataFrame
β”‚ Row β”‚ a         β”‚ b         β”‚
β”‚     β”‚ Float64   β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.098737  β”‚ 0.912536  β”‚
β”‚ 2   β”‚ 0.66538   β”‚ 0.770032  β”‚
β”‚ 3   β”‚ 0.767376  β”‚ 0.635237  β”‚
β”‚ 4   β”‚ 0.353171  β”‚ 0.169174  β”‚
β”‚ 5   β”‚ 0.330284  β”‚ 0.453514  β”‚
β”‚ 6   β”‚ 0.363861  β”‚ 0.64091   β”‚
β”‚ 7   β”‚ 0.622878  β”‚ 0.672581  β”‚
β”‚ 8   β”‚ 0.0130092 β”‚ 0.0542869 β”‚
β”‚ 9   β”‚ 0.779855  β”‚ 0.0753927 β”‚
β”‚ 10  β”‚ 0.943342  β”‚ 0.395862  β”‚

julia>

The TableTraits.jl/IterableTables.jl integration for TypedTables.jl does not yet work on julia 1.0.

I’m on DataFrames#master and I’m getting this error:

ERROR: PyError ($(Expr(:escape, :(ccall(#= /Users/brilhana/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:44 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'KeyError'> KeyError('0')

Creating a Pandas.DataFrame works fine but conversion to a DataFrames.DataFrame fails.

Any chance you could provide a small code example that constructs a minimal Pandas.DataFrame that then can’t be converted to a DataFrames.DataFrame?

using DataFrames, HTTP, Pandas

const URL = "https://api.bitfinex.com/v1"

function symbols()
    r = HTTP.get("$URL/symbols")
    s = String(r.body)
    return Pandas.read_json(s) # Creates a Pandas.DataFrame.
end

df = symbols()

df2 = DataFrames.DataFrame(df) # This will fail.

I tracked this down to a weird indexing situation in Pandas.jl. For now, it might be enough to rename the column, but I’m not sure.

I am highlighting a separate connected issue. I am using Julia to read HDF file created in Python. Then I convert it to a Pandas DataFrame which seems to work fine. Subsequently I try to convert this to DataFrames.DataFrame (which I cannot do directly from Python), and the output I get is all messed up.

I am using Julia 1.0 and list of package version is given below.

using PyCall: @pyimport
@pyimport pandas as pd

#import os
file=joinpath(pwd(),"Data") 
# dpli_data = pd.read_csv( os.path.join(os.getcwd(), "dpli_data_Final.csv"),low_memory=False)
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))

# ########Retrieve data using key
dpli_data = data_store["data_0818"];

# # data_store.close()
@show typeof(dpli_data)
typeof(dpli_data) = PyCall.PyObject

dpli_data
   Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
PolNo																																												
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	32	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
3	32	1	2008	50000.0	31	3	3	20180905	70001019	0.00	1	0	30	30	13	19760822	1	0	0	0	10000.0000	20080905	1	6	2000000.0	1980	21	21	39	1980	2123	0	0	0	0	0	0	1	0	9	9	120	119	0
4	38	0	2008	250000.0	31	0	3	20140908	70001019	0.00	1	0	14	14	8	19700804	1	0	0	0	50000.0000	20080908	1	6	2400000.0	1980	21	21	32	1980	2123	0	0	0	0	0	0	1	0	9	9	72	72	0
5	48	0	2008	50000.0	31	0	3	20110911	70001019	0.00	1	0	5	5	9	19600108	1	0	0	0	10000.0000	20080911	1	2	0.0	1399	27	27	20	1980	2123	0	0	0	0	0	0	0	0	9	9	36	36	0

using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
Pandas.head(dpli_data1)

        Age  Sex   YOR        SA  Plan   ...     MOR  MOI  DUR  DURIF  AGT_BLK
PolNo                                   ...                                  
1       40    0  2008  250000.0    31   ...       9    9  120    119        0
2       40    0  2008  250000.0    31   ...       9    9  120    119        0
3       32    1  2008   50000.0    31   ...       9    9  120    119        0
4       38    0  2008  250000.0    31   ...       9    9   72     72        0
5       48    0  2008   50000.0    31   ...       9    9   36     36        0

[5 rows x 44 columns]

using DataFrames
dpli_data2=DataFrames.DataFrame(dpli_data1);
typeof(dpli_data2)
DataFrames.DataFrame

DataFrames.head(dpli_data2)

	Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.0	1	0	30	30	13	19680803	1	0	0	0	50000.0	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	0	31	70001019	0.0	0	3	20180903	1	9	50000.0	0	30	30	13	19680803	1	0	0	0	20080903	500000.0	1	7	1399	250000.0	21	21	4	1980	2123	0	0	0	0	0	0	0	0	40	9	120	119	0	2008
3	31	0	9	50000.0	3	20180903	1	0	9	500000.0	30	30	13	19680803	1	0	0	0	20080903	1	250000.0	7	1399	21	0.0	21	4	1980	2123	0	0	0	0	0	0	0	0	40	0	120	119	0	2008	70001019
4	0	3	9	500000.0	20180903	1	0	30	120	250000.0	30	13	19680803	1	0	0	0	20080903	1	7	0.0	1399	21	21	50000.0	4	1980	2123	0	0	0	0	0	0	0	0	40	0	31	119	0	2008	70001019	9
5	3	20180903	120	250000.0	1	0	30	30	119	0.0	13	19680803	1	0	0	0	20080903	1	7	1399	50000.0	21	21	4	500000.0	1980	2123	0	0	0	0	0	0	0	0	40	0	31	0	0	2008	70001019	9	9
6	20180903	1	119	0.0	0	30	30	13	0	50000.0	19680803	1	0	0	0	20080903	1	7	1399	21	500000.0	21	4	1980	50000.0	2123	0	0	0	0	0	0	0	0	40	0	31	0	3	2008	70001019	9	9	120

(v1.0) pkg> st
    Status `C:\Users\chatura\.julia\environments\v1.0\Project.toml`
  [28f2ccd6] ApproxFun v0.10.1
  [c52e3926] Atom v0.7.6
  [6e4b80f9] BenchmarkTools v0.4.1
  [a74b3585] Blosc v0.5.1
  [336ed68f] CSV v0.4.2
  [5d742f6a] CSVFiles v0.10.0
  [aaaa29a8] Clustering v0.12.1
  [944b1d66] CodecZlib v0.5.1
  [f3117721] CombineML v1.1.1
  [8f4d0f93] Conda v1.1.1
  [a93c6f00] DataFrames v0.14.1
  [1313f7d8] DataFramesMeta v0.4.0
  [7806a523] DecisionTree v0.8.1
  [31c24e10] Distributions v0.16.4
  [587475ba] Flux v0.6.8
  [da1fdf0e] FreqTables v0.3.0
  [38e38edf] GLM v1.0.1
  [c91e804a] Gadfly v1.0.0
  [bc5e4493] GitHub v5.0.2
  [cd3eb016] HTTP v0.7.1
  [7073ff75] IJulia v1.14.0
  [033835bb] JLD2 v0.1.2
  [682c06a0] JSON v0.19.0
  [4076af6c] JuMP v0.18.4
  [e5e0dc1b] Juno v0.5.3
  [f0e99cf1] MLBase v0.8.0
  [9920b226] MLDataPattern v0.5.0
  [cc2ba9b6] MLDataUtils v0.4.0
  [1914dd2f] MacroTools v0.4.4
  [6f286f6a] MultivariateStats v0.6.0
  [b8a86587] NearestNeighbors v0.4.2
  [429524aa] Optim v0.17.2
  [eadc2687] Pandas v1.0.2
  [91a5bcdd] Plots v0.21.0
  [92933f4c] ProgressMeter v0.6.1
  [438e738f] PyCall v1.18.5
  [1a8c2f83] Query v0.10.1
  [295af30f] Revise v0.7.12
  [fdea26ae] SIMD v2.0.1
  [3e6341c9] SLEEF v0.5.1
  [3646fa90] ScikitLearn v0.5.0
  [6e75b9c4] ScikitLearnBase v0.4.1
  [60ddc479] StatPlots v0.8.1
  [2913bbd2] StatsBase v0.25.0
  [37b6cedf] Traceur v0.2.0
  [c17dfb99] WinRPM v0.4.2
  [009559a3] XGBoost v0.2.0+ #master (https://github.com/dmlc/XGBoost.jl.git)
  [37e2e46d] LinearAlgebra
  [3fa0cd96] REPL
  [9a3f8284] Random
  [10745b16] Statistics

The discussion here is from Pandas, meaning Pandas.jl, to DataFrame. I actually have a Python pandas dataframe, and suppose support for it needs to be added to IterableTables.jl for it too? Or should such conversion be in Pandas.jl or is already? I guess long term, I should get rid of all Python code (I’m porting, in phases).

julia> using IterableTables

julia> p_df = DataFrame(py"df_SA")
ERROR: ArgumentError: 'PyObject' iterates 'String' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Stacktrace:
 [1] invalidtable(::PyObject, ::String) at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/tofromdatavalues.jl:42
 [2] iterate at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/tofromdatavalues.jl:48 [inlined]
 [3] buildcolumns at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/fallbacks.jl:185 [inlined]
 [4] columns at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/fallbacks.jl:237 [inlined]
 [5] DataFrame(::PyObject; copycols::Bool) at /home/pharaldsson_sym/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
 [6] DataFrame(::PyObject) at /home/pharaldsson_sym/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
 [7] top-level scope at /home/pharaldsson_sym/.julia/packages/PyCall/zqDXB/src/pyeval.jl:232

I also have similar requirement. And, I wonder there is no supported function to read parquet file in Pandas.jl such as read_parquet(). Hence, I was reading the parquet file from python pandas library and couldn’t able to convert to Julia DataFrame.
Any suggestions/updatates?

PandasLite.jl implemented read_parquet ,
and here, Julia data storage - #2 by lungben

For anyone coming here from google to convert Pandas dataframes to DataFrames.DataFrame, this post from another thread contains a function which can do this.