Converting Pandas DataFrame to Julia DataFrame?

Pandas.jl is pretty nice. Is there an easy way to convert from Pandas DataFrame to Julia DataFrame?

e.g.

julia> df  = read_csv("iris.csv")
     Unnamed: 0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
0             1           5.1          3.5           1.4          0.2     setosa
1             2           4.9          3.0           1.4          0.2     setosa
2             3           4.7          3.2           1.3          0.2     setosa
3             4           4.6          3.1           1.5          0.2     setosa
4             5           5.0          3.6           1.4          0.2     setosa
5             6           5.4          3.9           1.7          0.4     setosa
6             7           4.6          3.4           1.4          0.3     setosa
7             8           5.0          3.4           1.5          0.2     setosa
8             9           4.4          2.9           1.4          0.2     setosa
9            10           4.9          3.1           1.5          0.1     setosa
10           11           5.4          3.7           1.5          0.2     setosa
...

julia> typeof(df)
Pandas.DataFrame
1 Like

As far as I understand, the standard conversion tool is

Provided that there is an interface from pandas.jl. If not, one should be built.

Pandas.jl has support for this, so it should just work.

1 Like

Not for me…

julia> DataFrames.DataFrame(df)
ERROR: MethodError: no method matching DataFrames.DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrames.DataFrame(::Any, ::DataStreams.Data.Schema, ::Type{S}, ::Bool; reference) where S at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/abstractdataframe/io.jl:295
  DataFrames.DataFrame(::Array{Any,1}, ::DataFrames.Index) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:87
  DataFrames.DataFrame(; kwargs...) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:142

So I tried to convert a TypedTable to DataFrame as noted in IterableTables’s README file and it’s giving weird results. Perhaps things got broken in the transition to 1.0?

julia> t = TypedTables.Table(a = [1, 2, 3], b = [2.0, 4.0, 6.0])
Table with 2 columns and 3 rows:
     a  b
   ┌───────
 1 │ 1  2.0
 2 │ 2  4.0
 3 │ 3  6.0

julia> DataFrames.DataFrame(t)
┌ Warning: passing columns argument with non-AbstractVector entries is deprecated
│   caller = top-level scope at none:0
└ @ Core none:0
1×3 DataFrames.DataFrame
│ Row │ x1               │ x2               │ x3               │
├─────┼──────────────────┼──────────────────┼──────────────────┤
│ 1   │ (a = 1, b = 2.0) │ (a = 2, b = 4.0) │ (a = 3, b = 6.0) │

I did this and it works. I am using Julia 1.0 on win 10

using Pandas

df=read_csv("iris.csv");
typeof(df)

Pandas.DataFrame

df1=DataFrames.DataFrame(df);
typeof(df1)

DataFrame

I replicated the same problem with a new project environment. I’m using Mac but I doubt the OS plays a role in this issue. Do you have the same package versions as mine?

(v1.0) pkg> activate .

(PandasTest) pkg> add Pandas DataFrames
  Updating registry at `~/.julia/registries/General`
  Updating git-repo `https://github.com/JuliaRegistries/General.git`
 Resolving package versions...
  Updating `Project.toml`
  [a93c6f00] + DataFrames v0.13.1
  [eadc2687] + Pandas v1.0.1
  Updating `Manifest.toml`
  [b99e7846] + BinaryProvider v0.5.0
  [324d7699] + CategoricalArrays v0.3.13
  [944b1d66] + CodecZlib v0.5.0
  [34da2185] + Compat v1.1.0
  [8f4d0f93] + Conda v1.0.1
  [a93c6f00] + DataFrames v0.13.1
  [9a8bc11e] + DataStreams v0.4.1
  [864edb3b] + DataStructures v0.13.0
  [e7dc6d0d] + DataValues v0.4.5
  [82899510] + IteratorInterfaceExtensions v0.1.1
  [682c06a0] + JSON v0.19.0
  [50d2b5c4] + Lazy v0.13.1
  [1914dd2f] + MacroTools v0.4.4
  [e1d29d7a] + Missings v0.3.1
  [bac558e1] + OrderedCollections v1.0.1
  [eadc2687] + Pandas v1.0.1
  [438e738f] + PyCall v1.18.4
  [189a3867] + Reexport v0.2.0
  [a2af1166] + SortingAlgorithms v0.3.1
  [2913bbd2] + StatsBase v0.25.0
  [3783bdb8] + TableTraits v0.3.1
  [382cd787] + TableTraitsUtils v0.3.1
  [3bb67fe8] + TranscodingStreams v0.8.1
  [81def892] + VersionParsing v1.1.2
  [ea10d353] + WeakRefStrings v0.5.3
  [2a0f44e3] + Base64 
  [ade2ca70] + Dates 
  [8bb1440f] + DelimitedFiles 
  [8ba89e20] + Distributed 
  [9fa8497b] + Future 
  [b77e0a4c] + InteractiveUtils 
  [76f85450] + LibGit2 
  [8f399da3] + Libdl 
  [37e2e46d] + LinearAlgebra 
  [56ddb016] + Logging 
  [d6f4376e] + Markdown 
  [a63ad114] + Mmap 
  [44cfe95a] + Pkg 
  [de0858da] + Printf 
  [3fa0cd96] + REPL 
  [9a3f8284] + Random 
  [ea8e919c] + SHA 
  [9e88b42a] + Serialization 
  [1a1011a3] + SharedArrays 
  [6462fe0b] + Sockets 
  [2f01184e] + SparseArrays 
  [10745b16] + Statistics 
  [8dfed614] + Test 
  [cf7118a7] + UUIDs 
  [4ec0a83e] + Unicode 

julia> using Pandas: read_csv

julia> using DataFrames: DataFrame

julia> df1 = read_csv("iris.csv"); 

julia> df2 = DataFrame(df1)
ERROR: MethodError: no method matching DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrame(::Any, ::DataStreams.Data.Schema, ::Type{S}, ::Bool; reference) where S at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/abstractdataframe/io.jl:295
  DataFrame(::Array{Any,1}, ::DataFrames.Index) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:87
  DataFrame(; kwargs...) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:142

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

Try with DataFrames master

2 Likes

@tk3369

I don’t know what is happending. Today I run the same command and it throws an error

using Pandas

df=read_csv("iris.csv");
@show typeof(df)

Pandas.DataFrame

using DataFrames,Data
df1=DataFrames.DataFrame(df);
typeof(df1)

MethodError: no method matching DataFrames.DataFrame(::Pandas.DataFrame)
Closest candidates are:
  DataFrames.DataFrame(::Any, !Matched::DataStreams.Data.Schema, !Matched::Type{S}, !Matched::Bool; reference) where S at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\abstractdataframe\io.jl:295
  DataFrames.DataFrame(!Matched::Array{Any,1}, !Matched::DataFrames.Index) at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\dataframe\dataframe.jl:87
  DataFrames.DataFrame(; kwargs...) at C:\Users\chatura\.julia\packages\DataFrames\utxEh\src\dataframe\dataframe.jl:142
  ...

Stacktrace:
 [1] top-level scope at In[4]:2

It looks like that you’re not using master… same problem as mine before.

See if you have this:

(v1.0) pkg> st
  [a93c6f00] DataFrames v0.13.1+ #master (https://github.com/JuliaData/DataFrames.jl.git)

If not, just switch over:

] add DataFrames#master

Yes it works after using master.

So the conversion between Pandas.jl and DataFrames.jl should just work on julia 1.0:

(foo) pkg> st
    Status `C:\Users\david\.julia\environments\foo\Project.toml`
  [a93c6f00] DataFrames v0.14.0
  [eadc2687] Pandas v1.0.2

julia> using DataFrames, Pandas

julia> df = DataFrames.DataFrame(a=rand(10), b=rand(10))
10×2 DataFrames.DataFrame
│ Row │ a         │ b         │
│     │ Float64   │ Float64   │
├─────┼───────────┼───────────┤
│ 1   │ 0.098737  │ 0.912536  │
│ 2   │ 0.66538   │ 0.770032  │
│ 3   │ 0.767376  │ 0.635237  │
│ 4   │ 0.353171  │ 0.169174  │
│ 5   │ 0.330284  │ 0.453514  │
│ 6   │ 0.363861  │ 0.64091   │
│ 7   │ 0.622878  │ 0.672581  │
│ 8   │ 0.0130092 │ 0.0542869 │
│ 9   │ 0.779855  │ 0.0753927 │
│ 10  │ 0.943342  │ 0.395862  │

julia> pd = Pandas.DataFrame(df)
          a         b
0  0.098737  0.912536
1  0.665380  0.770032
2  0.767376  0.635237
3  0.353171  0.169174
4  0.330284  0.453514
5  0.363861  0.640910
6  0.622878  0.672581
7  0.013009  0.054287
8  0.779855  0.075393
9  0.943342  0.395862


julia> DataFrames.DataFrame(pd)
10×2 DataFrames.DataFrame
│ Row │ a         │ b         │
│     │ Float64   │ Float64   │
├─────┼───────────┼───────────┤
│ 1   │ 0.098737  │ 0.912536  │
│ 2   │ 0.66538   │ 0.770032  │
│ 3   │ 0.767376  │ 0.635237  │
│ 4   │ 0.353171  │ 0.169174  │
│ 5   │ 0.330284  │ 0.453514  │
│ 6   │ 0.363861  │ 0.64091   │
│ 7   │ 0.622878  │ 0.672581  │
│ 8   │ 0.0130092 │ 0.0542869 │
│ 9   │ 0.779855  │ 0.0753927 │
│ 10  │ 0.943342  │ 0.395862  │

julia>

The TableTraits.jl/IterableTables.jl integration for TypedTables.jl does not yet work on julia 1.0.

I’m on DataFrames#master and I’m getting this error:

ERROR: PyError ($(Expr(:escape, :(ccall(#= /Users/brilhana/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:44 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'KeyError'> KeyError('0')

Creating a Pandas.DataFrame works fine but conversion to a DataFrames.DataFrame fails.

Any chance you could provide a small code example that constructs a minimal Pandas.DataFrame that then can’t be converted to a DataFrames.DataFrame?

using DataFrames, HTTP, Pandas

const URL = "https://api.bitfinex.com/v1"

function symbols()
    r = HTTP.get("$URL/symbols")
    s = String(r.body)
    return Pandas.read_json(s) # Creates a Pandas.DataFrame.
end

df = symbols()

df2 = DataFrames.DataFrame(df) # This will fail.

I tracked this down to a weird indexing situation in Pandas.jl. For now, it might be enough to rename the column, but I’m not sure.

I am highlighting a separate connected issue. I am using Julia to read HDF file created in Python. Then I convert it to a Pandas DataFrame which seems to work fine. Subsequently I try to convert this to DataFrames.DataFrame (which I cannot do directly from Python), and the output I get is all messed up.

I am using Julia 1.0 and list of package version is given below.

using PyCall: @pyimport
@pyimport pandas as pd

#import os
file=joinpath(pwd(),"Data") 
# dpli_data = pd.read_csv( os.path.join(os.getcwd(), "dpli_data_Final.csv"),low_memory=False)
data_store = pd.HDFStore(joinpath(file,"Dpli_data_warehouse.h5"))

# ########Retrieve data using key
dpli_data = data_store["data_0818"];

# # data_store.close()
@show typeof(dpli_data)
typeof(dpli_data) = PyCall.PyObject

dpli_data
   Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
PolNo																																												
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	40	0	2008	250000.0	31	0	3	20180903	70001019	0.00	1	0	30	30	13	19680803	1	0	0	0	50000.0000	20080903	1	7	500000.0	1399	21	21	32	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
3	32	1	2008	50000.0	31	3	3	20180905	70001019	0.00	1	0	30	30	13	19760822	1	0	0	0	10000.0000	20080905	1	6	2000000.0	1980	21	21	39	1980	2123	0	0	0	0	0	0	1	0	9	9	120	119	0
4	38	0	2008	250000.0	31	0	3	20140908	70001019	0.00	1	0	14	14	8	19700804	1	0	0	0	50000.0000	20080908	1	6	2400000.0	1980	21	21	32	1980	2123	0	0	0	0	0	0	1	0	9	9	72	72	0
5	48	0	2008	50000.0	31	0	3	20110911	70001019	0.00	1	0	5	5	9	19600108	1	0	0	0	10000.0000	20080911	1	2	0.0	1399	27	27	20	1980	2123	0	0	0	0	0	0	0	0	9	9	36	36	0

using Pandas
dpli_data1=Pandas.DataFrame(dpli_data)
Pandas.head(dpli_data1)

        Age  Sex   YOR        SA  Plan   ...     MOR  MOI  DUR  DURIF  AGT_BLK
PolNo                                   ...                                  
1       40    0  2008  250000.0    31   ...       9    9  120    119        0
2       40    0  2008  250000.0    31   ...       9    9  120    119        0
3       32    1  2008   50000.0    31   ...       9    9  120    119        0
4       38    0  2008  250000.0    31   ...       9    9   72     72        0
5       48    0  2008   50000.0    31   ...       9    9   36     36        0

[5 rows x 44 columns]

using DataFrames
dpli_data2=DataFrames.DataFrame(dpli_data1);
typeof(dpli_data2)
DataFrames.DataFrame

DataFrames.head(dpli_data2)

	Age	Sex	YOR	SA	Plan	Mode	LOB	PTD	AGT	EXPR	Type	EMR	PT	PPT	SubStatus	DOBLA	UW	TPCL	PML	TPML	APE	DOI	CHNL	LA_OCC	LA_INC	LA_CITY	LA_ID	PO_ID	NOMINEE	AG_CITY	AGT_BR	REDF_CITY	DTH_CLM	POLINV	ADVFIND	EXAPP	DEF_ND	LA_AG_CITY	LA_BR_CITY	MOR	MOI	DUR	DURIF	AGT_BLK
1	40	0	2008	250000.0	31	0	3	20180903	70001019	0.0	1	0	30	30	13	19680803	1	0	0	0	50000.0	20080903	1	7	500000.0	1399	21	21	4	1980	2123	0	0	0	0	0	0	0	0	9	9	120	119	0
2	0	31	70001019	0.0	0	3	20180903	1	9	50000.0	0	30	30	13	19680803	1	0	0	0	20080903	500000.0	1	7	1399	250000.0	21	21	4	1980	2123	0	0	0	0	0	0	0	0	40	9	120	119	0	2008
3	31	0	9	50000.0	3	20180903	1	0	9	500000.0	30	30	13	19680803	1	0	0	0	20080903	1	250000.0	7	1399	21	0.0	21	4	1980	2123	0	0	0	0	0	0	0	0	40	0	120	119	0	2008	70001019
4	0	3	9	500000.0	20180903	1	0	30	120	250000.0	30	13	19680803	1	0	0	0	20080903	1	7	0.0	1399	21	21	50000.0	4	1980	2123	0	0	0	0	0	0	0	0	40	0	31	119	0	2008	70001019	9
5	3	20180903	120	250000.0	1	0	30	30	119	0.0	13	19680803	1	0	0	0	20080903	1	7	1399	50000.0	21	21	4	500000.0	1980	2123	0	0	0	0	0	0	0	0	40	0	31	0	0	2008	70001019	9	9
6	20180903	1	119	0.0	0	30	30	13	0	50000.0	19680803	1	0	0	0	20080903	1	7	1399	21	500000.0	21	4	1980	50000.0	2123	0	0	0	0	0	0	0	0	40	0	31	0	3	2008	70001019	9	9	120

(v1.0) pkg> st
    Status `C:\Users\chatura\.julia\environments\v1.0\Project.toml`
  [28f2ccd6] ApproxFun v0.10.1
  [c52e3926] Atom v0.7.6
  [6e4b80f9] BenchmarkTools v0.4.1
  [a74b3585] Blosc v0.5.1
  [336ed68f] CSV v0.4.2
  [5d742f6a] CSVFiles v0.10.0
  [aaaa29a8] Clustering v0.12.1
  [944b1d66] CodecZlib v0.5.1
  [f3117721] CombineML v1.1.1
  [8f4d0f93] Conda v1.1.1
  [a93c6f00] DataFrames v0.14.1
  [1313f7d8] DataFramesMeta v0.4.0
  [7806a523] DecisionTree v0.8.1
  [31c24e10] Distributions v0.16.4
  [587475ba] Flux v0.6.8
  [da1fdf0e] FreqTables v0.3.0
  [38e38edf] GLM v1.0.1
  [c91e804a] Gadfly v1.0.0
  [bc5e4493] GitHub v5.0.2
  [cd3eb016] HTTP v0.7.1
  [7073ff75] IJulia v1.14.0
  [033835bb] JLD2 v0.1.2
  [682c06a0] JSON v0.19.0
  [4076af6c] JuMP v0.18.4
  [e5e0dc1b] Juno v0.5.3
  [f0e99cf1] MLBase v0.8.0
  [9920b226] MLDataPattern v0.5.0
  [cc2ba9b6] MLDataUtils v0.4.0
  [1914dd2f] MacroTools v0.4.4
  [6f286f6a] MultivariateStats v0.6.0
  [b8a86587] NearestNeighbors v0.4.2
  [429524aa] Optim v0.17.2
  [eadc2687] Pandas v1.0.2
  [91a5bcdd] Plots v0.21.0
  [92933f4c] ProgressMeter v0.6.1
  [438e738f] PyCall v1.18.5
  [1a8c2f83] Query v0.10.1
  [295af30f] Revise v0.7.12
  [fdea26ae] SIMD v2.0.1
  [3e6341c9] SLEEF v0.5.1
  [3646fa90] ScikitLearn v0.5.0
  [6e75b9c4] ScikitLearnBase v0.4.1
  [60ddc479] StatPlots v0.8.1
  [2913bbd2] StatsBase v0.25.0
  [37b6cedf] Traceur v0.2.0
  [c17dfb99] WinRPM v0.4.2
  [009559a3] XGBoost v0.2.0+ #master (https://github.com/dmlc/XGBoost.jl.git)
  [37e2e46d] LinearAlgebra
  [3fa0cd96] REPL
  [9a3f8284] Random
  [10745b16] Statistics

The discussion here is from Pandas, meaning Pandas.jl, to DataFrame. I actually have a Python pandas dataframe, and suppose support for it needs to be added to IterableTables.jl for it too? Or should such conversion be in Pandas.jl or is already? I guess long term, I should get rid of all Python code (I’m porting, in phases).

julia> using IterableTables

julia> p_df = DataFrame(py"df_SA")
ERROR: ArgumentError: 'PyObject' iterates 'String' values, which doesn't satisfy the Tables.jl `AbstractRow` interface
Stacktrace:
 [1] invalidtable(::PyObject, ::String) at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/tofromdatavalues.jl:42
 [2] iterate at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/tofromdatavalues.jl:48 [inlined]
 [3] buildcolumns at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/fallbacks.jl:185 [inlined]
 [4] columns at /home/pharaldsson_sym/.julia/packages/Tables/okt7x/src/fallbacks.jl:237 [inlined]
 [5] DataFrame(::PyObject; copycols::Bool) at /home/pharaldsson_sym/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
 [6] DataFrame(::PyObject) at /home/pharaldsson_sym/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
 [7] top-level scope at /home/pharaldsson_sym/.julia/packages/PyCall/zqDXB/src/pyeval.jl:232

I also have similar requirement. And, I wonder there is no supported function to read parquet file in Pandas.jl such as read_parquet(). Hence, I was reading the parquet file from python pandas library and couldn’t able to convert to Julia DataFrame.
Any suggestions/updatates?

PandasLite.jl implemented read_parquet ,
and here, Julia data storage - #2 by lungben

For anyone coming here from google to convert Pandas dataframes to DataFrames.DataFrame, this post from another thread contains a function which can do this.

2 Likes